Intro to Data Science with Cascading

Intro to Data Science
with Cascading

Paco Nathan Document
Collection

Tokenize
Scrub

Concurrent, Inc.
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

pnathan@concurrentinc.com
Count

Word
Count

@pacoid

Copyright @2012, Concurrent, Inc.

opportunity

Unstructured Data
meets
Enterprise Scale
1. backstory: how we got here
2. build: data science teams
3. overview: typical use cases
4. example: Cascading apps

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

1. backstory:
how we got here

inﬂection point
huge Internet successes after 1997 holiday season…
AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1997

consider this metric: 1998
annual revenue per customer / amount of data stored
which dropped 100x within a few years after 1997

storage and processing costs plummeted, now we must
work much smarter to extract ROI from Big Data… 2004
our methods must adapt

“conventional wisdom” of RDBMS and BI tools became
less viable; business cadre still focused on pivot tables
and pie charts… tends toward inertia!

MapReduce and the Hadoop open source stack grew
directly out of that contention… however, that effort ＋
only solves parts of the puzzle

inﬂection point: consequences
Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
Hadoop Summit, 2012:

“All of Fortune 500 is now on notice over the next 10-year period.”
Amazon and Google as exemplars of massive disruption in retail,
advertising, etc.
data as the major force displacing Global 1000 over the next decade,
mostly through apps — verticals, leveraging domain expertise

Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
XLDB, 2012:

“Complex analytics workloads are now displacing SQL as the basis
for Enterprise apps.”

primary sources
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM

Google
“The Birth of Google” – John Battelle
wired.com/wired/archive/13.08/battelle.html
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

the world before…

BI, SQL, and highly
optimized code

data innovation: circa 1996
Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

the world after…

machine learning,
leveraging log files

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic
+ Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

the world ahead…

what our customers
are doing now

Customers
Data Apps
business
Domain process Workﬂow Prod
Expert
dashboard Web Apps,
metrics
History services Mobile,
data etc. s/w
science dev
Data
Planner
Scientist
social
discovery optimized interactions
+ capacity transactions, Eng
endpoints
modeling content

App Dev
Data Access Patterns

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch "real time"

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

statistical thinking

Process Variation Data Tools

employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way…
however, both systems engineers and data scientists must!

most valuable skills
approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁle analysis, etc.

unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up

most valuable skills:
‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the conﬁdence for reported results

‣ learn to automate work, making analysis repeatable
D3
the rest of the skills – modeling,
algorithms, etc. – those are secondary

social caveats
“This data cannot be correct!” may be an early warning
about the organization itself
much depends on how the people whom you work alongside
tend to arrive at decisions:
‣ probably good: Induction, Abduction, Circumscription
‣ probably poor: Deduction, Speculation, Justiﬁcation

in general, one good data visualization
puts many ongoing verbal arguments to rest
however, let domain experts handle
“data storytelling”, not data scientists

xkcd

references

by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

references

by Jack Olson
Data Quality
Morgan Kaufmann, 2003
amazon.com/dp/1558608915

references

also check out RStudio:
rstudio.org/
rpubs.com/

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

2. build:
data science teams

core values
Data Science teams develop actionable insights,
building conﬁdence for decisions

that work may inﬂuence a few decisions worth
billions (e.g., M&A) or billions of small decisions
(e.g., AdWords)

probably somewhere in-between…
solving for pattern, at scale.
an interdisciplinary pursuit which
requires teams, not sole players

team process

help people ask the
discovery right questions

allow automation to
modeling place informed bets

deliver products at
integration scale to customers

build smarts into
apps product features Gephi

keep infrastructure
systems running, cost-effective

building teams
nn
o
overy
very elliing
e ng ratiio
rat o apps
apps stem
stem
s
s
diisc
d sc mod
mod nteg
iinte
g sy
sy

stakeholder

scientist

developer

ops

references

by DJ Patil

Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE

Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

3. overview:
typical use cases

using science in data science
edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC

in a nutshell, what we do…
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC

‣ estimate probability
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU

edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
y d d uB d dA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
t a eS e g n a h C

dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
l e n aP t i dE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
‣ calculate analytic variance

‣ manipulate order complexity

‣ make use of learning theory

‣ collab with DevOps, Stakeholders

probability estimation
“a random variable or stochastic variable is a
variable whose value is subject to variations”
“an estimator is a rule for calculating an
estimate of a given quantity based on observed
data”
estimators and probability
distributions provide the essential
basis for our insights
bayesian methods, shrinkage…
these are our friends
quantile estimation, empirical CDFs…
…versus frequentist notions

analytic variance
our tools for automation leverage deep
understanding of covariance
cannot overstate the importance of
sampling… insist on metrics described
as conﬁdence intervals, where valid
bootstrapping, bagging…
these are our friends
Monte Carlo methods resolve “black box”
problems
point estimates may help prevent
“uninformed” decisions
do not skimp on this part, ever…
a hard lesson learned from BI failures

order complexity
techniques for manipulating order complexity:
dimensional reduction… with clustering
as a common case
e.g., you may have 100 million HTML docs,
but there are only ~10K useful keywords
low-dimensional structures, PCA
linear algebra tricks: eigenvalues, matrix
decomposition, etc.
many hard problems resolved by “divide and
conquer”
this is an area ripe for much advancement in
algorithms research near-term

learning theory
in general, apps alternate between learning
patterns/rules and retrieving similar things…
statistical learning theory – rigorous,
prevents you from making billion dollar
mistakes, probably our future
machine learning – scalable, enables
you to make billion dollar mistakes, much
commercial emphasis
supervised vs. unsupervised
arguably, optimization is a related area

once Big Data projects get beyond merely
digesting log ﬁles, optimization will likely
become yet another buzzword :)

use case: marketing funnel
• must optimize a very large ad spend
• different vendors report different metrics

Wikipedia
• seasonal variation distorts performance
• some campaigns are much smaller than others
• hard to predict ROI for incremental spend

approach:
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• customer lifetime value quantiﬁes ROI of new leads
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/beneﬁt
• linear programming models estimate elasticity of demand

use case: ecommerce fraud
• sparse data means lots of missing values

stat.berkeley.edu
• “needle in a haystack” lack of training cases
• answers are available in large-scale batch, results
are needed in real-time event processing
• not just one pattern to detect – many, ever-changing

approach:
• random forest (RF) classifiers predict likely fraud
• subsampled data to re-balance training sets
• impute missing values based on density functions
• train on massive log files, run on in-memory grid
• adjust metrics to minimize customer support costs
• detect novelty – report anomalies via notifications

use case: customer segmentation
• many millions of customers, hard to determine
which features resonate

Mathworks
• multi-modal distributions get obscured by the
practice of calculating an “average”
• not much is known about individual customers

approach:
• connected components for sessionization, determining
uniques from logs
• estimates for age, gender, income, geo, etc.
• clustering algorithms to group into market segments
• social graph infers “unknown” relationships
• covariance/heat maps visualizes segments vs. feature sets

use case: monetizing content
• need to suggest relevant content which would

Digital Humanities
otherwise get buried in the back catalog
• big disconnect between inventory and limited
performance ad market
• enormous amounts of text, hard to categorize

approach:
• text analytics glean key phrases from documents
• hierarchical clustering of char frequencies detects lang
• latent dirichlet allocation (LDA) reduces dimension to
topic models
• recommenders suggest similar topics to customers
• collaborative ﬁlters connect known users with less known

data+code “political spectrum”
“Notes from the Mystery Machine Bus”
by Steve Yegge, Google
goo.gl/SeRZa
“conservative” “liberal”
(mostly) Enterprise (mostly) Start-Up

risk management customer experiments

assurance flexibility

well-defined schema schema follows code
explicit configuration convention

type-checking compiler interpreted scripts

wants no surprises wants no impediments

Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.

Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.

a selection of great tools…
reporting:
visualization:
Graphite, PowerPivot,
ggplot2, D3, Gephi
analytics/modeling: Pentaho, Jaspersoft, SAS
R, Weka, Matlab, PMML, GLPK
text:
LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

apps:
Cascading, Scalding, Cascalog, R markdown, SWF
scale-out:
Scalr, RightScale, CycleComputing, vFabric, Beanstalk
graph: column:
Gremlin, Vertica,
GraphLab, HBase, key/val: index: relational:
Neo4J Drill, Redis, Lucene/Solr, usual suspects
Dynamo Membase, ElasticSearch
MySQL

imdg:
Spark, Storm, hadoop:
EMR, HW, MapR, machine data:
Gigaspaces
EMC, Azure, Compute Splunk, collectd, durable storage:
Nagios S3, ASV, GCS,
Riak, Couch

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

4. example:
Cascading apps

the workflow abstraction

cascading.org/category/impatient/
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

layers of a workflow
business domain expertise, business trade-offs,
process market position, operating parameters, etc.

API Scala, Clojure, Python, Ruby, Java, etc.
language
…envision whatever runs in a JVM

optimize /
schedule major changes in technology now
Document
Collection

Scrub
Tokenize
token

physical
M

HashJoin Regex
Left token
GroupBy R

plan
Stop Word token
List
RHS

Count

Word
Count

compute Apache Hadoop, in-memory local mode

“assembler”
code
substrate
…envision GPUs, streaming, etc.

machine
data Splunk, Nagios, Collectd, New Relic, etc.

audience?
• Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

• Systems Integrator POV:
system integration of heterogenous data sources and compute platforms

• Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law

• Data Architect POV:
a physical plan for large-scale data flow management

• Software Architect POV:
a pattern language, similar to plumbing or circuit design
Document

•
Collection

App Developer POV: M
Tokenize
Scrub
token

API bindings for Scala, Clojure, Python, Ruby, Java Stop Word
List
HashJoin
Left

RHS
Regex
token
GroupBy
token
R

•
Count

Systems Engineer POV: Word
Count

a JAR file, has passed CI, available in a Maven repo

1: copy
public class
  Main
  {
  public static void
  main( String[] args )
    {
    String inPath = args[ 0 ];
    String outPath = args[ 1 ];
Source
    Properties props = new Properties();
    AppProps.setApplicationJarClass( props, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

    // create the source tap
    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

M     // create the sink tap
    Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
Sink
    // specify a pipe to connect the taps
    Pipe copyPipe = new Pipe( "copy" );

    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
     .addSource( copyPipe, inTap )
     .addTailSink( copyPipe, outTap );

    // run the flow
    flowConnector.connect( flowDef ).complete();

1 mapper     }
  }
0 reducers
10 lines code

wait!

ten lines of code
for a ﬁle copy…
seems like a lot.

same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+

Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days

Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours

Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes

2: word count

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 mapper
1 reducer
18 lines code

3: wc + scrub

Document
Collection

Scrub GroupBy
Tokenize
token token
Count
M

R Word
Count

1 mapper
1 reducer
22+10 lines code

4: wc + scrub + stop words

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

1 mapper Word

1 reducer Count

28+10 lines code

5: tf-idf

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

M R M R M RHS

Scrub
Tokenize
token
HashJoin
M

RHS

token
HashJoin Regex Unique GroupBy

DF
Left token token token ExprFunc
Count CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF

M

GroupBy
TF

doc_id,
token Count
GroupBy Count
token

M R M R
Word
R M R Count

11 mappers
9 reducers
65+10 lines code

6: tf-idf + tdd

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

RHS
M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M

RHS

token
HashJoin Regex Unique GroupBy

DF
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS

M R M R M R
TF-IDF

M
GroupBy

TF
doc_id,
Failure token Count
Traps GroupBy Count
token

M R M R
Word
Count
R M R

12 mappers
9 reducers
76+14 lines code

City of Palo Alto open data
Regex Regex

tree
Scrub
filter parser species

M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint

road
Regex Regex

tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M

R M R M RHS
M
HashJoin Estimate Road
Left Albedo Segments Geohash CoGroup
Road
Metadata GPS
Failure RHS M logs
Traps R
road

Geohash

M

Regex
park

filter reco

M
park

github.com/Cascading/CoPA/wiki
‣ GIS export for parks, roads, trees (unstructured / open data)
‣ log ﬁles of personalized/frequented locations in Palo Alto via iPhone GPS tracks
‣ curated metadata, used to enrich the dataset
‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”

CoPA: results 0.12
Estimated Tree Height (meters)

0.10

0.08
count
0

density
100
0.06 200
300

0.04

0.02

0.00

0 10 20 30 40 50
avg_height

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciﬂua
‣ avg height 23 m
‣ road albedo: 0.12
‣ distance: 10 m
‣ a short walk from my train stop ✔

drill-down

blog, code/wiki/gists, jars, list, DevOps products:
cascading.org/
github.org/Cascading/
conjars.org/
goo.gl/KQtUL
concurrentinc.com/

Intro to Data Science with Cascading

Recomendados

Más contenido relacionado

Destacado

Destacado (11)

Más de Paco Nathan

Más de Paco Nathan (20)

Último

Último (20)

Intro to Data Science with Cascading

Notas del editor