Data Mining and Exploration Events, Fields, and Anomalies
1. Data Mining and
Exploration
David Carasso, Office of CTO, Chief Mind
2. AGENDA
What is data mining?
What’s the plan of attack?
What type of events do I have?
How do I mine fields?
How do I to detect anomalous events?
Why do I need to visualize my data?
9. Preparing the data
You've been thrown data you aren't familiar with…
Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)
Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user root
Mar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user root
Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user
'root'
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…
Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)
Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root
....
Eventtypes Fields Transactions Anomalies
(closed sessions) (pid) (open-close) (unexpected
address)
9
12. Given Some Unknown Data
Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)
Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user root
Mar 7 12:40:02 willLaptop crond(pam_unix)[10696]: session closed for user root
Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user
'root'
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration ...
Mar 7 12:45:01 willLaptop crond(pam_unix)[10754]: session opened for user root by (uid=0)
Mar 7 12:45:02 willLaptop crond(pam_unix)[10754]: session closed for user root
....
12
14. Group Events by Content
Cluster events with similar values.
Show 3 examples from each cluster, from the most
common cluster to the least:
…| cluster labelonly=t showcount=t
| dedup 3 cluster_label
sortby -cluster_count, cluster_label, _time
14
15. Events By Content
count label _raw
--------------------------------------------------------------------------------------------------------
-
1339 3 Mar 7 11:05:01 willLaptop crond(pam_unix)[6785]: session opened for user root by…
1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1769]: session opened for user root by …
1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session opened for user root by …
1324 2 Mar 7 11:05:02 willLaptop crond(pam_unix)[6785]: session closed for user root
1324 2 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session closed for user root
1324 2 Mar 7 11:10:02 willLaptop crond(pam_unix)[1769]: session closed for user root
136 13 Mar 7 20:05:08 willLaptop kernel: SELinux: initialized (dev selinuxfs, type
selinuxfs)…
136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev usbfs, type usbfs), uses …
136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev sysfs, type sysfs), uses …
15
16. Group by $%#! Format
Cluster events by first 7 punctuation chars:
…| rex field=punct "(?<smallpunct>.{7})”
| eventstats count by smallpunct
| sort -count, smallpunct
| dedup 3 smallpunct
16
17. Events by Format
count smallpunct raw
------------------------------------------------------------------------------------------------
637 __::__( Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root
637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root
637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by …
367 __::__: Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.
367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50
367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67
57 __::__[ Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126, stratum 2
57 __::__[ Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum 10
57 __::__[ Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567 s
17
18. Group by Time
Look for bursts of events
• Turn on computer
• Load a web page
• Detects speeding car
• Print document
• Scan security badge
18
19. Group by Time Bursts
… | transaction maxpause=2s
| search eventcount>1
Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session opened for user root by (uid=0)
Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by (uid=0)
Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root
Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root
Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67
Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50
Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.
Mar 10 16:45:01 willLaptop crond(pam_unix)[9553]: session opened for user root by (uid=0)
Mar 10 16:45:02 willLaptop crond(pam_unix)[9553]: session closed for user root
19
32. Splunkd.log Sample File
09-05-2012 15:34:11.886 -0700 INFO ExecProcessor - Ran script: python /opt/splunk/etc/apps/...
09-05-2012 15:34:02.467 -0700 ERROR TcpOutputProc - Can't find or illegal IP address or ...
09-05-2012 15:32:03.397 -0700 INFO ProcessTracker - Process ran long; type=SplunkOptimize ...
09-05-2012 15:30:20.016 -0700 WARN DispatchCommand - The system is approaching the maximum ...
fascinating
32
39. Handling Known Anomalies.
Easy. Define a search for the anomalous condition
and make an alert to detect it.
ip=10.* NOT domain=mycompany.com
… | stats perc99(spent) 500ms.
Alert on “spent>500”
39
41. Anomalies by Single Field Values
Identify anomalous values in a given field either by
frequency of occurrence or number of standard
deviations from the mean.
… | anomalousvalue action=summary pthresh=0.02
| search isNum=YES
41
43. Anomalous by Many Values
Look for small clusters – by content, format, and
time – to find anomalies. For example…
…| cluster …| sort cluster_count
43
45. Small Clusters: Bursts of One
Find bursts of just a single events where a pause of 2 seconds
occurred around it.
… |transaction maxpause=2s | search eventcount = 1
Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126…
Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum…
Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567…
45
46. Burst of One
Same idea, different data source: splunk
[11:58:08] "POST /services/search/jobs/export HTTP/1.1" 200 201630 …
[11:12:51] "POST /services/search/jobs/export HTTP/1.1" 200 459441 …
[10:00:58] "GET /servicesNS/nobody/SplunkDeploymentMonitor/backfill/…
46
47. Anomalous by Context
Identify values not expected by the context of other
events.
… | anomalies field=file labelonly=true maxvalues=10
47
49. Surprise Eventtype: Part Deux!
Classified major categories of your data with
eventtypes?
-- just search for things that don’t match those
eventtypes
49
53. Other mining commands
• kmeans: Performs k-means clustering on selected
fields.
• outlier: Removes outlying numerical values.
• af (analyze fields): Analyzes numerical fields for their
ability to predict another discrete field
• fieldsummary : Generates summary information fields.
• shape: Produces a symbolic 'shape' attribute describing
the shape of a numeric multivalued field
53
55. Data Mining by Visualization
Visualization can capture nuances in the data that
numerical or linguistic summaries cannot easily capture.
55
56. These data points are radically different.
*Source: Anscombe’s Quartet (Anscombe 1973)
56
57. Why visualize?
Because they all have the exact same
• average (7.50)
• standard deviation (2.03)
• least-squares fit (3 + 0.5x).
Do not just rely on numerical summarization.
57
58. But I already have charts!
You don’t graph enough.
Data Exploration
Don’t decide ahead of time what graphs you want
Regularly do out-of-the-box scenarios with graphs
58
59. Data Exploration
Variations:
• Subsets of Events (paying customers vs lookers)
• Fields by Fields (including eventtypes and tags)
• Ignored fields
• Min/max/avg/count
• Compare to other times windows
• Transactions
59
60. Visual Arrangement
Sorting data, Changing Scales
(Linear/Log), Min/Max can have a huge difference
on looking at the same data.
60
61. Visual Considerations
Pick representations that make
obvious the distinctions you
need to care about.
61
63. Summary
• Discovery is an iterative process.
• Group events by content, format, and time, and
define classifications with eventtypes and tags
• Focus on promising fields with correlations
• Discover unknown anomalies with small clusters.
• Visualize your data, from a dozen angles.
63
65. More to come: Predictive Analytics
… | forecast foo
65
66. The End
Mine the Gap.
.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...
.,`......_.,`...,`...,`...,`...,`...,`...,`...,`...,`....._..
...___..|.|...__._..._.__.,`..._.__.,`..___...__.,`...__.|.|.
../.__|.|.|../._`.|.|.'_.....|.'_..../._...../././.|.|.
.|.(__..|.|.|.(_|.|.|.|_).|...|.|.|.|.|.(_).|...V..V./..|_|.
..___|.|_|..__,_|.|..__/....|_|.|_|..___/...._/_/...(_).
.,`...,`...,`...,`..|_|.,`...,`...,`...,`...,`...,`...,`.....
Golf clapping at #datamining
.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...
66
Editor's Notes
----- Meeting Notes (9/7/12 14:21) -----[ASK AUDIENCE -- WHAT IS DATA MINING?]
No. Explicit. Learning nothing new. Not significant in meaning.I’m explicitly telling you what it is. You’re not mining it. By looking at the data, you’re not learning anything new by me saying this is an orange. And frankly it’s not useful.
Regularities, patterns, anomalies that are interesting, meaning not obvious, explicit inferences, and at the same time not coincidental or noisy inferences.
Yellow is SodaBlue is PopRed is Coke
Before we can really mine a bunch of text for valuable information, we need to do some prep work. We need to understand our data – the dimensions, the sets of values. In Splunk terms – create fields, eventtypes, transactions, etc.By adding fields, you’re mining out dimensions; by adding eventtypes, you’re mining classes; my adding transactions, you’re mining correlations; etc.BUT… Prepping the data for mining is a data mining task of sorts in itself, and the line between understanding your data and mining is really non-existent. This before-work is sometimes called Data Exploration.
The more knowledge you can add to Splunk about your data the more options you’ll have to analyze it.There maybe data cleaning involved.
You can go from groups of events to understanding events to understanding fields to understanding normality/anomalies to generating reports. But the truth is, this is an iterative process. Each step tells you more about something else. (Un)fortunately, this presentation is linear.
Raw values, like raw text.
Make eventtypes for “session opened”, “session closed”, “linux initialized”. Tag them. Then mine out questions like “how long is the average session?, “how much churn is there?”, etc
Consider linecount as well.
Make eventtypes or tags for cron jobs, ntpd, dhclient. Then mine out questions like “who is running what jobs? Which are the most common?
One of the most useful ways to see how your individual events relate to each other is to look for pauses in your events, as real-physical events often happen in bursts. For example, there are bursts of log activity:When you shutdown a computerWhen you access a web page, which has many images.When a car factory robot detects the next carWhen you turn on a printer and it connects to your computerWhen you scan your security badge
Make transactions for sessions opening and closing. Find unclosed transactions. How often, how many, by whom?
No reason to limit correlations to a particular data source. Splunk can easily correlate them together in one search.Search isn’t correct in that the dedup is removing important consecutive events, but it was useful for showing small correlated events across sources.
If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.When you search your data, you’re essentially weeding out all unwanted events; the results of your search are events that share common characteristics, and you can give them a collective name or “event type”. The names of your event types are added as values into an eventtype field. This means that you can search for, and report on, these groups of events the same way you search for any field. The following example takes you through the steps to save a search as an eventtype and then searching for that field. If you run frequent searches to investigate SSH and firewall activities, such as sshd logins or firewall denies, you can save these searches as an event type. Also, if you see error messages that are cryptic, you can save it as an event type with a more descriptive name.
If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
If facebook had eventtypes, you’d define any picture that has any of your family members but no co-workers as a ‘family’ pix that you could then have a virtual photo album for. Any pix with a family member outside the bayarea as a “family vacation” pix.
Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
Why? Reduce the number of fields you should focus on to those with the most value. For analysis and graphing
A 1.0 means two fields always co-occur. For example, Component and Log_Level always co-occur in splunkd.log. You can filter out fields to make this table more manageable.
----- Meeting Notes (9/4/12 11:49) -----give splunkd example output first to show log
This shows that before we know the component is SavedSplunker, the odds of a WARN Log_Level is 62.25%; afterwords, the odds are 100%. Before we know the component is loader, the odds of INFO Log_Level is 33.15%; afterwards, 99.44%.
What are anomalies/outliers?The set of data points that are considerably differentApplications: network intrusion detection, fault detection, credit card fraud detection, telecommunication fraud detection– Build a profile of the “normal” behavior – patterns, stats to detect anomaliesVery often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.
Very often you want to find “problems” in your IT data, but you don’t know what to look for. If you know what to look for, by all means, look.… | eventstats perc99(spent) as bigspender | where spent > bigspender
Very often you want to find anomalies/problems in your IT data, but you don’t know what to look for. Single Value: – ‘port’ value is highly irregularMany Values: – many values look different than othersAnomalous: – many values were unexpected by contextEvernything applies to transactions as well. Look for anomalies
Identifies values in the data that are anomalous either by frequency of occurrence or number of standard deviations from the mean. Make searches to find these anomalous values and create alerts.
catNormFreq = the average frequency of non-anomalous valuesisNum means all values of the field were numerical.basically we assume a normal distribution, but if we find that ends up causing too many values to be anomalous we don't use it
Earlier we looked for large clusters to get a broad understanding of the events. We grouped by content, format, and time.Now, just flip it. Make searches to find these anomalous values and create alerts.
Same for for form (looking for unusual punctuation) or especially long pauses between events (10 seconds?)Make searches to find these anomalous values and create alerts.
. These slow events are often important and indicate longer tasks.
Make eventtypes or tags for these slow, important events. Who runs them most? Are they a problem? Why is someone exporting, or backfilling their data? Make an alert when it happens.
Experimental search command that uses compression and a window of N last events to see if a new command compresses well with past events, or if it looks unexpected.Make searches to find these anomalous values and create alerts.
Make searches to find these anomalous values and create alerts.
One of the most obvious and important methods of discovering what your data is saying is to simply graph your data.Humans have a well-developed ability to analyze large amounts of data presented visually, detecting general patterns and trends, as well as outliers and unusual patterns.
What data points are outliers? what inferences would you make?radically different.
Limitations of Statistical Approaches: usually tests a single attribute. distributions aren’t known for many dimensions, hard to estimate the true distribution Do not just rely on numerical summarization, or you won’t see what’s going on.
Same for transactions of events, and classes of events (eventtypes) and field-values (tags)
Eventually you’ll tweak out little nuggets of knowledge.Over time, what is the average duration users spend on my website by language of country, compared to last month.How does the time on the website correlate with the time of day, or browserdoes the max delay for each server vary over time by languageSame for transactions of events, and classes of events (eventtypes) and field-values (tags)
. So reducing the number of dimensions down to 2 or 3 for visualization and limiting the data shown
Heat map vs much more useful chart
Discovery: Each step tells you more about everything else.
predicting foo and getting better and better at it, and towards the right edge you can see it's predicting values that haven't happened yet"