Data Mining and Exploration Events, Fields, and Anomalies

Data Mining and
Exploration
David Carasso, Office of CTO, Chief Mind

AGENDA
What is data mining?

What’s the plan of attack?

What type of events do I have?

How do I mine fields?

How do I to detect anomalous events?

Why do I need to visualize my data?

What is Data Mining?

3

Is this data mining?

This is an orange

4

What is Data Mining?

Extracting implicit, previously unknown, and
potentially useful information from data.

5

Data Preparation

Understanding
Data Exploration
Data Mining

7

What’s the plan of attack?

8

Preparing the data
You've been thrown data you aren't familiar with…

Mar 7 12:40:01 willLaptop crond(pam_unix)[10696]: session opened for user root by (uid=0)
Mar 7 12:40:01 willLaptop crond(pam_unix)[10695]: session closed for user root
Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user
'root'
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…
....
Eventtypes Fields Transactions Anomalies
(closed sessions) (pid) (open-close) (unexpected
address)
9

Is Understanding Linear?

Event
Groups Events

reports

Anomalies Fields

No.

10

What type of events do I have?

11

Given Some Unknown Data
Mar 7 12:44:47 willLaptop gconfd (root-10750): starting (version 2.10.0), pid 10750 user
'root'
"xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only config...
Mar 7 12:44:47 willLaptop gconfd (root-10750): Resolved address "xml:readwrite:/root/.gconf”…
"xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration ...
....

12

Find Broad Categories of Events

Group Events by Content, Format, and Time

13

Group Events by Content
Cluster events with similar values.

Show 3 examples from each cluster, from the most
common cluster to the least:

…| cluster labelonly=t showcount=t
| dedup 3 cluster_label
sortby -cluster_count, cluster_label, _time
14

Events By Content
count label _raw
--------------------------------------------------------------------------------------------------------
-
1339 3 Mar 7 11:05:01 willLaptop crond(pam_unix)[6785]: session opened for user root by…
1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1769]: session opened for user root by …
1339 3 Mar 7 11:10:01 willLaptop crond(pam_unix)[1766]: session opened for user root by …

1324 2 Mar 7 11:05:02 willLaptop crond(pam_unix)[6785]: session closed for user root

136 13 Mar 7 20:05:08 willLaptop kernel: SELinux: initialized (dev selinuxfs, type
selinuxfs)…
136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev usbfs, type usbfs), uses …
136 13 Mar 7 20:05:09 willLaptop kernel: SELinux: initialized (dev sysfs, type sysfs), uses …

15

Group by $%#! Format
Cluster events by first 7 punctuation chars:

…| rex field=punct "(?<smallpunct>.{7})”
| eventstats count by smallpunct
| sort -count, smallpunct
| dedup 3 smallpunct
16

Events by Format
count smallpunct raw
------------------------------------------------------------------------------------------------
637 __::__( Mar 10 16:50:02 willLaptop crond(pam_unix)[9639]: session closed for user root
637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9638]: session closed for user root
637 __::__( Mar 10 16:50:01 willLaptop crond(pam_unix)[9639]: session opened for user root by …

367 __::__: Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.
367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50
367 __::__: Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67

57 __::__[ Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126, stratum 2
57 __::__[ Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum 10
57 __::__[ Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567 s

17

Group by Time
Look for bursts of events

• Turn on computer
• Load a web page
• Detects speeding car
• Print document
• Scan security badge

18

Group by Time Bursts
… | transaction maxpause=2s
| search eventcount>1

Mar 10 15:30:25 willLaptop dhclient: DHCPREQUEST on eth0 to 10.1.1.50 port 67
Mar 10 15:30:25 willLaptop dhclient: DHCPACK from 10.1.1.50
Mar 10 15:30:25 willLaptop dhclient: bound to 10.1.1.194 -- renewal in 5788 seconds.


19

Multiple Sources
(not really correct)

20

Now what?

1. ✓ group your data
2. tell splunk!

21

Telling Splunk
(about your groups of events)

Add eventtypes and tags

Huh?

22

SURPRISE TANGENT!

What is an eventtype?

23

Eventtype

A dynamic “tag” added to events, if they would
match the search that defines the eventtype.

24

Eventtype:
Name: “closed_root”
Definition: “session closed” root

Event:
… session closed for user root …
=>
eventtype=closed_root

25

Create an Eventtype

26

Independent searches will return events tagged
with previous eventtypes that help classify events.

27

Create reports on the classifications you’ve made

Ok, it
wasn’t a
tangent.

28

How do I mine fields?

29

Fields Correlation

Discover correlations to remove uninteresting
fields and narrow in on promising reports.

haiku

30

Fields Correlation Haiku

Discover patterns
in fields with a correlation:
co-occurring fields.

indulgence

31

Splunkd.log Sample File
09-05-2012 15:34:11.886 -0700 INFO ExecProcessor - Ran script: python /opt/splunk/etc/apps/...
09-05-2012 15:34:02.467 -0700 ERROR TcpOutputProc - Can't find or illegal IP address or ...
09-05-2012 15:32:03.397 -0700 INFO ProcessTracker - Process ran long; type=SplunkOptimize ...
09-05-2012 15:30:20.016 -0700 WARN DispatchCommand - The system is approaching the maximum ...

fascinating

32

Field Correlation
… | correlate
RowField C CN Component Context L ...
------------------------ ---- ---- --------- ------- ----
C 1.00 1.00 0.00 0.00 1.00
CN 1.00 1.00 0.00 0.00 1.00
Component 0.00 0.00 1.00 0.06 0.00
Context 0.00 0.00 0.06 1.00 0.00
L 1.00 1.00 0.00 0.00 1.00
Log_Level 0.00 0.00 1.00 0.06 0.00
…

33

Field Associations
automatically deduce correlations and
implications of field values:

…| associate Log_Level Component

34

Field Association Summary
Uncond Cond
Ref_Key Ref_Value Target_Key Support Entropy Entropy Increase Top_Conditional_Value
--------- ------------------------ ---------- ------- ------- ------- -------- ------------------------
Component DatabaseDirectoryManager Log_Level 34.67% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%)
Component HotDBManager Log_Level 38.25% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%)
Component SavedSplunker Log_Level 394.31% 1.182 0.000 1.182201 WARN (62.25% -> 100.00%)
Component databasePartitionPolicy Log_Level 95.50% 1.182 0.417 0.765017 INFO (33.15% -> 91.57%)
Component loader Log_Level 79.17% 1.182 0.050 1.131883 INFO (33.15% -> 99.44%)
Component timeinvertedIndex Log_Level 44.28% 1.182 0.000 1.182201 INFO (33.15% -> 100.00%)

35

Top Fields by Fields
Most common Log_Level by Component:

... | top Log_Level by Component
Component Log_Level count percent
---------------------------------- --------- ----- ----------
AdminManager WARN 1 100.000000
DatabaseDirectoryManager WARN 153 100.000000
DateParserVerbose WARN 262 100.000000
DedupProcessor ERROR 1 100.000000
DeploymentClient DEBUG 60 85.714286
DeploymentClient WARN 5 7.142857

36

How do I to detect anomalous events?

37

Types of Anomalies

Anomalies you know about

Anomalies you don’t know about

38

Handling Known Anomalies.
Easy. Define a search for the anomalous condition
and make an alert to detect it.

ip=10.* NOT domain=mycompany.com

… | stats perc99(spent)  500ms.
Alert on “spent>500”

39

Finding Unknown Anomalies
Look for Abnormal
• Single-Field Values
• Multi-Field Values
• Contexts
• Visual Inspections…

40

Anomalies by Single Field Values
Identify anomalous values in a given field either by
frequency of occurrence or number of standard
deviations from the mean.

… | anomalousvalue action=summary pthresh=0.02
| search isNum=YES

41

Anomalies by Single Field Values

42

Anomalous by Many Values

Look for small clusters – by content, format, and
time – to find anomalies. For example…

…| cluster …| sort cluster_count

43

Smallest Clusters by Content
count label uri

1 7 /img/skins/default/bolt.png

1 37 /en-US/search/inspector?sid=1345075042.125&namespace=search

1 45 /services/admin/summarization?count=10

1 53 /services/pdfgen/is_available?viewId=index_status_health&...

1 57 /static/splunkrc_cmds.xml

44

Small Clusters: Bursts of One
Find bursts of just a single events where a pause of 2 seconds
occurred around it.

… |transaction maxpause=2s | search eventcount = 1

Mar 10 16:46:32 willLaptop ntpd[2544]: synchronized to 138.23.180.126…

Mar 10 16:46:27 willLaptop ntpd[2544]: synchronized to LOCAL(0), stratum…

Mar 10 16:42:09 willLaptop ntpd[2544]: time reset -0.236567…

45

Burst of One
Same idea, different data source: splunk
[11:58:08] "POST /services/search/jobs/export HTTP/1.1" 200 201630 …

[11:12:51] "POST /services/search/jobs/export HTTP/1.1" 200 459441 …

[10:00:58] "GET /servicesNS/nobody/SplunkDeploymentMonitor/backfill/…

46

Anomalous by Context
Identify values not expected by the context of other
events.

… | anomalies field=file labelonly=true maxvalues=10

47

Anomalous by Context
Unexpectedness file
0.00 shelper
0.16 shelper
0.00 1345502591.356
0.00 1345502591.356
0.00 1345074401.191
0.00 1345074031.153 time
0.03 1345074328.186
0.00 1345502591.356
0.35 conf-dm_backfill
0.00 1345074309.185
0.00 1345502591.356

48

Surprise Eventtype: Part Deux!
Classified major categories of your data with
eventtypes?

-- just search for things that don’t match those
eventtypes

49

Once you can describe anomalous
behavior as a search…

51

Other mining commands
• kmeans: Performs k-means clustering on selected
fields.
• outlier: Removes outlying numerical values.
• af (analyze fields): Analyzes numerical fields for their
ability to predict another discrete field
• fieldsummary : Generates summary information fields.
• shape: Produces a symbolic 'shape' attribute describing
the shape of a numeric multivalued field
53

Why do I need to visualize my data?

54

Data Mining by Visualization

Visualization can capture nuances in the data that
numerical or linguistic summaries cannot easily capture.

55

These data points are radically different.

*Source: Anscombe’s Quartet (Anscombe 1973)

56

Why visualize?
Because they all have the exact same

• average (7.50)
• standard deviation (2.03)
• least-squares fit (3 + 0.5x).

Do not just rely on numerical summarization.
57

But I already have charts!
You don’t graph enough.
Data Exploration
Don’t decide ahead of time what graphs you want
Regularly do out-of-the-box scenarios with graphs

58

Data Exploration
Variations:
• Subsets of Events (paying customers vs lookers)
• Fields by Fields (including eventtypes and tags)
• Ignored fields
• Min/max/avg/count
• Compare to other times windows
• Transactions
59

Visual Arrangement
Sorting data, Changing Scales
(Linear/Log), Min/Max can have a huge difference
on looking at the same data.

60

Visual Considerations
Pick representations that make
obvious the distinctions you
need to care about.

61

Summary
• Discovery is an iterative process.
• Group events by content, format, and time, and
define classifications with eventtypes and tags
• Focus on promising fields with correlations
• Discover unknown anomalies with small clusters.
• Visualize your data, from a dozen angles.
63

More to come: Predictive Analytics
… | forecast foo

65

The End
Mine the Gap.

.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...
.,`......_.,`...,`...,`...,`...,`...,`...,`...,`...,`....._..
...___..|.|...__._..._.__.,`..._.__.,`..___...__.,`...__.|.|.
../.__|.|.|../._`.|.|.'_.....|.'_..../._...../././.|.|.
.|.(__..|.|.|.(_|.|.|.|_).|...|.|.|.|.|.(_).|...V..V./..|_|.
..___|.|_|..__,_|.|..__/....|_|.|_|..___/...._/_/...(_).
.,`...,`...,`...,`..|_|.,`...,`...,`...,`...,`...,`...,`.....
Golf clapping at #datamining
.,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...,`...

66

Data Mining and Exploration Events, Fields, and Anomalies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Mining and Exploration Events, Fields, and Anomalies

Similar to Data Mining and Exploration Events, Fields, and Anomalies (20)

Recently uploaded

Recently uploaded (20)

Data Mining and Exploration Events, Fields, and Anomalies

Editor's Notes