1. 程式人生 > >Explore and get value out of your raw data: An Introduction to Splunk

Explore and get value out of your raw data: An Introduction to Splunk

Install Splunk Enterprise

Let’s start by installing Splunk Enterprise in your machine. Installing Splunk is quite straightforward and the setup package is available to pretty much all platforms: OSX/Linux/Windows. Download the package here and follow the installation instructions.

Splunk Enterprise? but... does it have a free license? Yes!

“Index 500 MB/Day. (…) After 60 days you can convert to a perpetual free license or purchase a Splunk Enterprise license to continue using the expanded functionality designed for enterprise-scale deployments.”

While in the scope of an introduction and personal usage a local installation in your machine is quite ok

, I would highly recommend you to quickly shift to a proper Splunk deployment (on-premise or in the cloud) as soon as you start using it more extensively.

Splunk Enterprise 7.1.3 Web Interface

If your local installation went well, you will be greeted with a web interface similar as the screenshot above. Yay!

Import your raw data

This article applies to any type of raw data - Splunkis well known for being able to ingest raw data without prior knowledge of it’s schema — but to be able to demonstrate this I need a raw dataset. Instead of generating some meaningless dummy test dataset, I decided to search for an interesting real world dataset available as Open Data.

Helsinki Public Transportation (HSL) — Passenger Volume per Station during October 2016

I found an interesting dataset from the Helsinki Region Transport (HSL) containing the volume of Passengers per Station in the Helsinki area. The dataset (available here ) contains the average number of passengers per day during November 2016 and was collected from the passenger travel card system. While I was a bit disappointed that this particular dataset only has available old data (November 2016), I was positively surprised to discover that HSL (and the Finnish public authorities in general) have quite a big catalog of data openly available (https://www.opendata.fi/en). Nice!

By downloading this particular HSL dataset — I choosed the GeoJSON APIJSON data format — you will get a raw data file named: HSL%3An_nousijamäärät.geojson

Raw data from HSL

As you are able to see, at the top level we have a single FeatureCollection that contains all the Feature events within.

Since we only care about the events (the high level FeatureCollection array part is not needed) we can clean the data a bit by dropping the JSON array and pipe all the Feature events to a new file (HSLvolumes.json).

Add the data to Splunk

It is quite straight forward to add new data into Splunk from a file in the local hard disk. Let’s head to Splunk and use the UI options to do so.

Splunk > Add data

Click on the Add Data option and select Upload (from files in my computer)

Splunk > Add data: Select Source

A step by step guide will appear. Let’s start by selecting our raw data file. In my case, I will be using the HSLvolumes.json file that contain the Feature events.

Splunk > Add data: Set Source Type

After getting your data in, Splunk will try to “understand” your data automatically and allow you to tweak and provide more details about the data format.

In this particular case, you can see that it automatically recognized my data as JSON (Source type: _json) and overall the events look good. However, there are some warnings that it failed to parse a timestamp for each event.

Why? Splunk is all about event processing and time is essential. Based on the events you are indexing, Splunk will automatically try to find a timestamp. Since our data doesn’t have a timestamp field, Splunk will be using the current time on when each event was indexed as the event timestamp.

For an in-depth explanation on how Splunk timestamp assignments works, please check this Splunk documentation page.

Splunk > Add data: Save Source Type

So, in the Timestamp section we will enforce this by choosing Current and since we modified the _json Source type, let’s hit Save As and name this according with our data source (e.g hslvolumesjson).

Splunk > Add data: Input Settings

In this section, we need to select in which Splunk index we want to store this data. It is a good practice to create separate indexes for different types of data, so let’s create a new index.

Splunk > Add data: New Index

Choose your index name and click Save. We can leave the other fields with their default values.

Double check that the new index is selected. Click Review, Submit & Start Searching and you are ready to go.

For a more in-depth explanation about getting data in Splunk, please check the Splunk documentation: http://dev.splunk.com/view/dev-guide/SP-CAAAE3A

First glance at your data: Exploration

After you clicked the Start Searching button you will be directed to the Splunk Search panel.

There are a lot of interesting things in this view. If you never used Splunk before you might actually feel a bit overwhelmed. Allow me to highlight some of areas and break the view apart for you.

In the upper left corner, you will find in which Splunk app (default: Search & Reporting) and panel (default: Search) you currently are.

Right below that, you will find the Splunk search bar with a query that (at first glance) might look a bit complex. Given our simple use case, the exact same search results would have appeared with the query: index=”hslnov2016". We will explore the query language below.

In the upper right corner, you will find the Time picker (default: All time). This allows you to select the time range of your search. Since our timestamp was set to be the indexing current time, this will not be useful here.

In the lower left corner, you find the Interesting Fields. These are fields from your data that Splunk was able to extract automatically.

At last, the remaining lower part is where your search query result events are going to be displayed. In this case, all the index results are appearing.

Cool, What now?

One of my favorite options to use first to explore data in Splunk is the “Interesting Fields” panel. By clicking in any field you can really quickly gain valuable insights.

In this case, by selecting the field properties.nimi_s we are able to quickly understand what are the field top values, ie, what HSL Station Names appear in the majority of the events. [ Without much surprise for any Helsinki area resident, Rautatientori (Central Railway Station) and Kamppi are on the top :) ]

Getting answers: statistics and data agreggation

The Splunk search and query language is both powerful and vast, but with some simple commands and little experience you can quickly get some valuable answers.

If you start from: index=yourindex | command , Splunk will provide you autocomplete, guidance and explanation about each command.

Since each event contains the daily average of passengers in a single station, let’s say we want to now what is the total Volume of Passengers per Station. How can we do this?

Easy! We can quickly use the stats command and sum all the daily averages (properties.nousijat) and aggregate those results by station name (properties.nimi_s).

Side bonus: By getting 5071 results we also got to know the total number of stations in our dataset. Nice!

What if I want to know the top or bottom X Stations?

By appending to our previous query: | sort -volume | head 20 we immediately get the answer to that question.

We use sort to get the higher volume results ie, descending (for lower, ie, ascending, it would be sort +volume) and head to filter out only the first X results

Explore your data and get valuable answers with the different Splunk queries.

Dashboards & Visualizations

Once you start to get the hang of the Splunk search and saved a couple of the most interesting queries, you can create your first Dashboard and visualize your data in different ways.

Head to the Dashboards section and click Create New Dashboard. Give a name to your dashboard and add your first panel.

With the same query as before, I added a simple Column chart panel.

Quite quickly it becomes evident that our top 20 stations are very very very different in terms of volume of passengers. Both Kamppi and Rautatientori were handling 2x the passenger volume compared with the other 3 stations in the top 5. When we look at the remaining 15 stations (in the top 20!) we get 3x that volume.

At this point I decided to add two additional new panels…

On the left, the Passenger Volume per Station top 50 (same query but with |head 50) and a simple table visualization.

On the right, the Passenger Volume per Station (bottom ranks , less than 30 passengers). With a pie chart and the query:index=”hslnov2016" | stats sum(properties.nousijat) as volume by “properties.nimi_s”| sort +volume | search volume < 30 | stats count by volume

I decided to include only the stations with less than 30 passengers in volume. And I was surprised to see that there are so many stations (1827) with 0 passengers.

One last Panel…

Since my dataset included the geo coordinates (latitude and longitude) of each station, I decided to add one more panel (type Map). To do so, I extended my Splunk and installed a 3rd party visualization called Maps+ for Splunk.

You can do the same, by exploring the existing visualization types and go to “Find more visualizations”

Splunk has a built-in Map visualization. Why not to use it?
I did use the built in Map at first, but I found some limitations: you can’t zoom at a city level and my Splunk query was more complex. The Maps+ for Splunk was a clear winner to me.

The panel Splunk search query is: index=”hslnov2016" | spath path=”geometry.coordinates{0}” output=longitude | spath path=”geometry.coordinates{1}” output=latitude | stats first(latitude) as latitude , first(longitude) as longitude, first(properties.nimi_s) as description, sum(properties.nousijat) as title by “properties.nimi_s” | sort -title | search title > 0

The initial transformations using spath was needed because both the latitude and longitude were in the same field (multi value json type), therefore I had to “split” them into different fields.

This visualization (Maps+ for Splunk) only requires that you have the fields in a table with some particular labeled names. Check the project documentation at: https://github.com/sghaskell/maps-plus for more details.

base_search | table latitude, longitude [ description| title | (...)

I found the map really nice and helpful. I was able to quickly see the volume of passengers at any given station by hovering over it.

I hope you found this article useful ! Please share your feedback and thoughts.

Bruno