SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Intro to Data Science
with Cascading


Paco Nathan                   Document
                              Collection




                                           Tokenize
                                                           Scrub




Concurrent, Inc.
                                                           token

                                      M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com
                                                                                         Count




                                                                                                     Word
                                                                                                     Count




@pacoid




                            Copyright @2012, Concurrent, Inc.
opportunity

 Unstructured Data
   meets
  Enterprise Scale
 1. backstory: how we got here
 2. build: data science teams
 3. overview: typical use cases
 4. example: Cascading apps
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




1. backstory:
how we got here
inflection point
 huge Internet successes after 1997 holiday season…
 AMZN, EBAY, Inktomi (YHOO Search), then GOOG                1997

 consider this metric:                                       1998
   annual revenue per customer / amount of data stored
 which dropped 100x within a few years after 1997

 storage and processing costs plummeted, now we must
 work much smarter to extract ROI from Big Data…             2004
 our methods must adapt

 “conventional wisdom” of RDBMS and BI tools became
 less viable; business cadre still focused on pivot tables
 and pie charts… tends toward inertia!

 MapReduce and the Hadoop open source stack grew
 directly out of that contention… however, that effort        +
 only solves parts of the puzzle
inflection point: consequences
 Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm)
 Hadoop Summit, 2012:

 “All of Fortune 500 is now on notice over the next 10-year period.”
 Amazon and Google as exemplars of massive disruption in retail,
 advertising, etc.
 data as the major force displacing Global 1000 over the next decade,
 mostly through apps — verticals, leveraging domain expertise


 Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.)
 XLDB, 2012:

 “Complex analytics workloads are now displacing SQL as the basis
  for Enterprise apps.”
primary sources
 Amazon
 “Early Amazon: Splitting the website” – Greg Linden
 glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

 eBay
 “The eBay Architecture” – Randy Shoup, Dan Pritchett
 addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
 addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

 Inktomi (YHOO Search)
 “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
 youtube.com/watch?v=E91oEn1bnXM

 Google
 “The Birth of Google” – John Battelle
 wired.com/wired/archive/13.08/battelle.html
 “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
 youtube.com/watch?v=qsan-GQaeyk
 perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
the world before…

BI, SQL, and highly
optimized code
data innovation: circa 1996
                           Stakeholder                   Customers

    Excel pivot tables
  PowerPoint slide decks        strategy



        BI
                               Product
      Analysts


                              requirements



      SQL Query                              optimized
                             Engineering       code         Web App
       result sets



                                                            transactions




                                                            RDBMS
the world after…

machine learning,
leveraging log files
data innovation: circa 2001
   Stakeholder                    Product                   Customers




     dashboards                                                  UX
                                 Engineering

                   models                        servlets

                                 recommenders
   Algorithmic
                                        +                   Web Apps
    Modeling                        classifiers


                                                            Middleware
                   aggregation
                                                  event
    SQL Query                                    history
     result sets                                               customer
                                                             transactions
                                    Logs



       DW                             ETL                    RDBMS
the world ahead…

what our customers
are doing now
data innovation: circa 2013
                                                                                            Customers
                                       Data Apps
                         business
 Domain                  process       Workflow                                                                          Prod
 Expert
                           dashboard                                                        Web Apps,
                            metrics
                                        History                     services                 Mobile,
                 data                                                                         etc.                s/w
               science                                                                                            dev
  Data
                                        Planner
Scientist
                                                                                      social
                         discovery                  optimized                      interactions
                             +                       capacity                                     transactions,          Eng
                                        endpoints
                         modeling                                                                    content

 App Dev
                                               Data Access Patterns


                                        Hadoop,                   Log                        In-Memory
                                          etc.                   Events                       Data Grid
   Ops                          DW                                                                                       Ops
                                                                           batch      "real time"


                                                                Cluster Scheduler
 introduced                                                                                                             existing
  capability                                                                                                             SDLC

                                                                                                  RDBMS
                                                                                                   RDBMS
a key difference…
statistical thinking

          Process           Variation            Data           Tools




  employing a mode of thought which includes both logical and analytical reasoning:
  evaluating the whole of a problem, as well as its component parts; attempting
  to assess the effects of changing one or more variables

  this approach attempts to understand not just problems and solutions,
  but also the processes involved and their variances

  particularly valuable in Big Data work when combined with hands-on experience in
  physics – roughly 50% of my peers come from physics or physical engineering…

  programmers typically don’t think this way…
  however, both systems engineers and data scientists must!
most valuable skills
 approximately 80% of the costs for data-related projects
 get spent on data preparation – mostly on cleaning up
 data quality issues: ETL, log file analysis, etc.

 unfortunately, data-related budgets for many companies tend
 to go into frameworks which can only be used after clean up

 most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable
                                                               D3
 the rest of the skills – modeling,
 algorithms, etc. – those are secondary
social caveats
 “This data cannot be correct!” may be an early warning
 about the organization itself
 much depends on how the people whom you work alongside
 tend to arrive at decisions:
     ‣ probably good: Induction, Abduction, Circumscription
     ‣ probably poor: Deduction, Speculation, Justification


 in general, one good data visualization
 puts many ongoing verbal arguments to rest
 however, let domain experts handle
 “data storytelling”, not data scientists



                                                              xkcd
references

 by Leo Breiman
 Statistical Modeling:
 The Two Cultures
 Statistical Science, 2001
 bit.ly/eUTh9L
references

 by Jack Olson
 Data Quality
 Morgan Kaufmann, 2003
 amazon.com/dp/1558608915
references

 also check out RStudio:
 rstudio.org/
 rpubs.com/
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




2. build:
data science teams
core values
 Data Science teams develop actionable insights,
 building confidence for decisions

 that work may influence a few decisions worth
 billions (e.g., M&A) or billions of small decisions
 (e.g., AdWords)

 probably somewhere in-between…
 solving for pattern, at scale.
 an interdisciplinary pursuit which
 requires teams, not sole players
team process

                  help people ask the
    discovery     right questions


                  allow automation to
     modeling     place informed bets


                  deliver products at
    integration   scale to customers


                  build smarts into
       apps       product features          Gephi


                  keep infrastructure
     systems      running, cost-effective
building teams
                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      stem
                                                            stem
                                                                 s
                                                                 s
    diisc
    d sc           mod
                   mod            nteg
                                iinte
                                      g                  sy
                                                         sy


                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops
references

 by DJ Patil

 Data Jujitsu
 O’Reilly, 2012
 amazon.com/dp/B008HMN5BE

 Building Data Science Teams
 O’Reilly, 2011
 amazon.com/dp/B005O4U3ZE
Intro to Data Science
            Document
            Collection



                                         Scrub
                         Tokenize
                                         token

                    M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count




                                                                                   Word
                                                                                   Count




3. overview:
typical use cases
using science in data science
                                                        edoMpUsserD:IUN
                                    tcudorP ylppA lenaP yrotnevnI tneilC
                                 tcudorP evomeR lenaP yrotnevnI tneilC




  in a nutshell, what we do…
                                                        edoMmooRyM:IUN
                                                    edoMmooRcilbuP:IUN
                                                                 ydduB ddA
                                                              nigoL etisbeW
                                                                          vd
                                                         edoMsdneirF:IUN
                                                             edoMtahC:IUN
                                                         egasseM a evaeL
                                            G1 :gniniamer ecaps sserddA
                                                     dekcilCeliforPyM:IUN
                                                      edoMstiderCyuB:IUN
                                                          tohspanS a ekaT
                                                      egapemoH nwO tisiV
                                                              elbbuB a epyT
                                                               taeS egnahC
                                                         wodniW D3 nepO
                                                                 dneirF ddA
                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                  lenaP tidE
                                                                   woN tahC




  ‣ estimate probability
                                                                    teP yalP
                                                                   teP deeF
                            2 petS egaP traC esahcruP edaM remotsuC
                                         M215 :gniniamer ecaps sserddA
                                                             gnihtolC no tuP
                                                          bew :metI na yuB
                                                            edoMeivoM:IUN
                                   ytinummoc ,tneilc :detratS weiV eivoM
                                                            teP weN etaerC
                                       detrats etius tset :tseTytivitcennoC
                                                  emag pazyeh dehcnuaL
                                                   eciov mooRcilbuP tahC
                                                         egasseM yadhtriB
                                                         edoMlairotuT:IUN
                                                   ybbol semag dehcnuaL
                                                       noitartsigeR euqinU




                                                                               edoMpUsserD:IUN
                                                                               tcudorP ylppA lenaP yrotnevnI tneilC
                                                                               tcudorP evomeR lenaP yrotnevnI tneilC
                                                                               edoMmooRyM:IUN
                                                                               edoMmooRcilbuP:IUN
                                                                               y d d uB d dA
                                                                               nigoL etisbeW
                                                                               vd
                                                                               edoMsdneirF:IUN
                                                                               edoMtahC:IUN
                                                                               egasseM a evaeL
                                                                               G1 :gniniamer ecaps sserddA
                                                                               dekcilCeliforPyM:IUN
                                                                               edoMstiderCyuB:IUN
                                                                               tohspanS a ekaT
                                                                               egapemoH nwO tisiV
                                                                               elbbuB a epyT
                                                                               t a eS e g n a h C

                                                                               dneirF ddA
                                                                               revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                               l e n aP t i dE
                                                                               woN tahC
                                                                               teP yalP
                                                                               teP deeF
                                                                               2 petS egaP traC esahcruP edaM remotsuC
                                                                               M215 :gniniamer ecaps sserddA
                                                                               gnihtolC no tuP
                                                                               bew :metI na yuB
                                                                               edoMeivoM:IUN
                                                                               ytinummoc ,tneilc :detratS weiV eivoM
                                                                               teP weN etaerC
                                                                               detrats etius tset :tseTytivitcennoC
                                                                               emag pazyeh dehcnuaL
                                                                               eciov mooRcilbuP tahC
                                                                               egasseM yadhtriB
                                                                               edoMlairotuT:IUN
                                                                               ybbol semag dehcnuaL
                                                                               noitartsigeR euqinU
                                                                               wodniW D3 nepO
  ‣ calculate analytic variance

  ‣ manipulate order complexity

  ‣ make use of learning theory

  ‣ collab with DevOps, Stakeholders
probability estimation
 “a random variable or stochastic variable is a
 variable whose value is subject to variations”
 “an estimator is a rule for calculating an
 estimate of a given quantity based on observed
 data”
 estimators and probability
 distributions provide the essential
 basis for our insights
 bayesian methods, shrinkage…
 these are our friends
 quantile estimation, empirical CDFs…
 …versus frequentist notions
analytic variance
 our tools for automation leverage deep
 understanding of covariance
 cannot overstate the importance of
 sampling… insist on metrics described
 as confidence intervals, where valid
 bootstrapping, bagging…
 these are our friends
 Monte Carlo methods resolve “black box”
 problems
 point estimates may help prevent
 “uninformed” decisions
 do not skimp on this part, ever…
 a hard lesson learned from BI failures
order complexity
 techniques for manipulating order complexity:
 dimensional reduction… with clustering
 as a common case
 e.g., you may have 100 million HTML docs,
 but there are only ~10K useful keywords
 low-dimensional structures, PCA
 linear algebra tricks: eigenvalues, matrix
 decomposition, etc.
 many hard problems resolved by “divide and
 conquer”
 this is an area ripe for much advancement in
 algorithms research near-term
learning theory
 in general, apps alternate between learning
 patterns/rules and retrieving similar things…
 statistical learning theory – rigorous,
 prevents you from making billion dollar
 mistakes, probably our future
 machine learning – scalable, enables
 you to make billion dollar mistakes, much
 commercial emphasis
 supervised vs. unsupervised
 arguably, optimization is a related area

 once Big Data projects get beyond merely
 digesting log files, optimization will likely
 become yet another buzzword :)
use case: marketing funnel
  •   must optimize a very large ad spend
  •   different vendors report different metrics




                                                                Wikipedia
  •   seasonal variation distorts performance
  •   some campaigns are much smaller than others
  •   hard to predict ROI for incremental spend

  approach:
  • log aggregation, followed with cohort analysis
  • bayesian point estimates compare different-sized ad tests
  • customer lifetime value quantifies ROI of new leads
  • time series analysis normalizes for seasonal variation
  • geolocation adjusts for regional cost/benefit
  • linear programming models estimate elasticity of demand
use case: ecommerce fraud
  • sparse data means lots of missing values




                                                             stat.berkeley.edu
  • “needle in a haystack” lack of training cases
  • answers are available in large-scale batch, results
      are needed in real-time event processing
  •   not just one pattern to detect – many, ever-changing

  approach:
  • random forest (RF) classifiers predict likely fraud
  • subsampled data to re-balance training sets
  • impute missing values based on density functions
  • train on massive log files, run on in-memory grid
  • adjust metrics to minimize customer support costs
  • detect novelty – report anomalies via notifications
use case: customer segmentation
  • many millions of customers, hard to determine
      which features resonate




                                                                Mathworks
  •   multi-modal distributions get obscured by the
      practice of calculating an “average”
  •   not much is known about individual customers

  approach:
  • connected components for sessionization, determining
      uniques from logs
  •   estimates for age, gender, income, geo, etc.
  •   clustering algorithms to group into market segments
  •   social graph infers “unknown” relationships
  • covariance/heat maps visualizes segments vs. feature sets
use case: monetizing content
  • need to suggest relevant content which would




                                                               Digital Humanities
      otherwise get buried in the back catalog
  •   big disconnect between inventory and limited
      performance ad market
  •   enormous amounts of text, hard to categorize

  approach:
  • text analytics glean key phrases from documents
  • hierarchical clustering of char frequencies detects lang
  • latent dirichlet allocation (LDA) reduces dimension to
      topic models
  •   recommenders suggest similar topics to customers
  • collaborative filters connect known users with less known
data+code “political spectrum”
 “Notes from the Mystery Machine Bus”
 by Steve Yegge, Google
 goo.gl/SeRZa
         “conservative”                             “liberal”
           (mostly) Enterprise                   (mostly) Start-Up

            risk management                    customer experiments

                assurance                            flexibility

          well-defined schema                   schema follows code
          explicit configuration                     convention

         type-checking compiler                 interpreted scripts

           wants no surprises                  wants no impediments

         Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

  Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.
a selection of great tools…
                                                                reporting:
                                   visualization:
                                                                Graphite, PowerPivot,
                                   ggplot2, D3, Gephi
   analytics/modeling:                                          Pentaho, Jaspersoft, SAS
   R, Weka, Matlab, PMML, GLPK
                                      text:
                                      LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

       apps:
       Cascading, Scalding, Cascalog, R markdown, SWF
                                                scale-out:
                                                Scalr, RightScale, CycleComputing, vFabric, Beanstalk
               graph:          column:
               Gremlin,        Vertica,
               GraphLab,       HBase,           key/val:        index:               relational:
               Neo4J           Drill,           Redis,          Lucene/Solr,         usual suspects
                               Dynamo           Membase,        ElasticSearch
                                                MySQL

   imdg:
   Spark, Storm,         hadoop:
                         EMR, HW, MapR,               machine data:
   Gigaspaces
                         EMC, Azure, Compute          Splunk, collectd,         durable storage:
                                                      Nagios                    S3, ASV, GCS,
                                                                                Riak, Couch
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




4. example:
Cascading apps
the workflow abstraction

  cascading.org/category/impatient/
  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                          GroupBy    R
                          Stop Word                        token
                             List
                                         RHS




                                                             Count




                                                                         Word
                                                                         Count
layers of a workflow
  business     domain expertise, business trade-offs,
  process      market position, operating parameters, etc.

     API       Scala, Clojure, Python, Ruby, Java, etc.
  language
               …envision whatever runs in a JVM

  optimize /
   schedule    major changes in technology now
                  Document
                  Collection



                                               Scrub
                               Tokenize
                                               token




   physical
                          M



                                                       HashJoin   Regex
                                                         Left     token
                                                                          GroupBy    R



    plan
                                          Stop Word                        token
                                             List
                                                         RHS




                                                                             Count




                                                                                         Word
                                                                                         Count




  compute      Apache Hadoop, in-memory local mode




                                                                                                 “assembler”
                                                                                                  code
  substrate
               …envision GPUs, streaming, etc.

  machine
   data        Splunk, Nagios, Collectd, New Relic, etc.
audience?
 •   Business Stakeholder POV:
     business process management for workflow orchestration (think BPM/BPEL)

 •   Systems Integrator POV:
     system integration of heterogenous data sources and compute platforms

 •   Data Scientist POV:
     a directed, acyclic graph (DAG) on which we can apply Amdahl's Law

 •   Data Architect POV:
     a physical plan for large-scale data flow management

 •   Software Architect POV:
     a pattern language, similar to plumbing or circuit design
                                                                  Document




 •
                                                                  Collection




     App Developer POV:                                                   M
                                                                               Tokenize
                                                                                               Scrub
                                                                                               token




     API bindings for Scala, Clojure, Python, Ruby, Java                                  Stop Word
                                                                                             List
                                                                                                       HashJoin
                                                                                                         Left


                                                                                                         RHS
                                                                                                                  Regex
                                                                                                                  token
                                                                                                                          GroupBy
                                                                                                                           token
                                                                                                                                     R




 •
                                                                                                                             Count




     Systems Engineer POV:                                                                                                               Word
                                                                                                                                         Count




     a JAR file, has passed CI, available in a Maven repo
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

          M                // create the sink tap
                           Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                Sink
                           // specify a pipe to connect the taps
                           Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

 1 mapper                  }
                         }
 0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy…
  seems like a lot.
same JAR, any scale…
                                                       MegaCorp Enterprise IT:
                                                       Pb’s data
                                                       1000+ node private cluster
                                                       EVP calls you when app fails
                                                       runtime: days+

                                        Production Cluster:
                                        Tb’s data
                                        EMR w/ 50 HPC Instances
                                        Ops monitors results
                                        runtime: hours – days

                    Staging Cluster:
                    Gb’s data
                    EMR + 4 Spot Instances
                    CI shows red or green lights
                    runtime: minutes – hours

 Your Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


Document
Collection




                Tokenize
                           GroupBy
        M                   token    Count




                              R              Word
                                             Count




 1 mapper
 1 reducer
18 lines code
3: wc + scrub


Document
Collection



                        Scrub   GroupBy
             Tokenize
                        token    token
                                          Count
        M

                                   R              Word
                                                  Count




 1 mapper
 1 reducer
22+10 lines code
4: wc + scrub + stop words


Document
Collection



                             Scrub
             Tokenize
                             token

        M



                                     HashJoin   Regex
                                       Left     token
                                                        GroupBy    R
                        Stop Word                        token
                           List
                                       RHS




                                                           Count



 1 mapper                                                              Word

 1 reducer                                                             Count


28+10 lines code
5: tf-idf


                                                                        Unique                 Insert   SumBy




                                                                  D
                                                                        doc_id                   1      doc_id
Document
Collection

                                                                  M       R           M                   R      M     RHS

                               Scrub
             Tokenize
                               token
                                                                                                                     HashJoin
        M

                                                                                                                                            RHS




                                                          token
                                       HashJoin   Regex                 Unique                GroupBy




                                                                  DF
                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                           List
                                         RHS
                                                                  M       R           M          R               M                                   R
                                                                                                                                                                          TF-IDF




                                                                                                                 M

                                                                       GroupBy
                                                                  TF

                                                                        doc_id,
                                                                         token                 Count
                                                                                                                             GroupBy                 Count
                                                                                                                              token

                                                                  M       R       M       R
                                                                                                                                                                  Word
                                                                                                                                R      M      R                   Count




  11 mappers
   9 reducers
  65+10 lines code
6: tf-idf + tdd


                                                                                                Unique                 Insert   SumBy




                                                                                          D
                                                                                                doc_id                   1      doc_id
Document
Collection

                                                                                                                                               RHS
                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                Tokenize
                                                       token
                                                                                                                                             HashJoin              Checkpoint
        M
                                                                                                                                                                                  M

                                                                                                                                                                                       RHS




                                                                                  token
                                                               HashJoin   Regex                 Unique                GroupBy




                                                                                          DF
                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                       tf-idf
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                TF-IDF




                                                                                                                                         M
                                                                                               GroupBy




                                                                                          TF
                                                                                                doc_id,
             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                      token

                                                                                          M       R       M       R
                                                                                                                                                                                             Word
                                                                                                                                                                                             Count
                                                                                                                                                        R      M    R




  12 mappers
   9 reducers
  76+14 lines code
City of Palo Alto open data
                                                   Regex           Regex




                                            tree
                                                                                 Scrub
                                                    filter         parser        species




                                            M
                                                                                                       HashJoin
                                                                                                         Left     Geohash
    CoPA
  GIS exprot                                                                                 Tree
                                                                                           Metadata                                M
                                                                                                         RHS                            RHS
                                                                                                                            tree
               Regex     Checkpoint




                                            road
                                                   Regex           Regex

                                      tsv
               parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                   parser                                                              CoGroup
                                                                                                                                                     Distance   tree_dist       tree_name         shade
  M

                                                                                                                                                 R                          M               R                M    RHS
                                            M
                                                                            HashJoin        Estimate     Road
                                                                              Left           Albedo    Segments   Geohash                                                                                        CoGroup
                                                              Road
                                                             Metadata                                                                                                              GPS
               Failure                                                        RHS                                                  M                                               logs
                Traps                                                                                                                                                                                                      R
                                                                                                                            road


                                                                                                                                                                                                 Geohash


                                                                                                                                                                                                             M

                                                   Regex
                                            park




                                                    filter                                                                                                                                                                     reco




                                            M
                                                                  park




github.com/Cascading/CoPA/wiki
  ‣ GIS export for parks, roads, trees (unstructured / open data)
  ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
  ‣ curated metadata, used to enrich the dataset
  ‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
CoPA: log events
CoPA: results                                      0.12
                                                               Estimated Tree Height (meters)




                                                   0.10




                                                   0.08
                                                                                                          count
                                                                                                             0




                                         density
                                                                                                             100
                                                   0.06                                                      200
                                                                                                             300



                                                   0.04




                                                   0.02




                                                   0.00


                                                          0   10        20            30        40   50
                                                                         avg_height




 ‣   addr: 115 HAWTHORNE AVE
 ‣   lat/lng: 37.446, -122.168
 ‣   geohash: 9q9jh0
 ‣   tree: 413 site 2
 ‣   species: Liquidambar styraciflua
 ‣   avg height 23 m
 ‣   road albedo: 0.12
 ‣   distance: 10 m
 ‣   a short walk from my train stop ✔
drill-down


  blog, code/wiki/gists, jars, list, DevOps products:
  cascading.org/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/

Más contenido relacionado

Destacado

Excellent Cities For Young Entrepreneurs in 2017 | Jerry Novack
Excellent Cities For Young Entrepreneurs in 2017 | Jerry NovackExcellent Cities For Young Entrepreneurs in 2017 | Jerry Novack
Excellent Cities For Young Entrepreneurs in 2017 | Jerry NovackJerry Novack
 
¡Más tecnología, menos precio!
¡Más tecnología, menos precio!¡Más tecnología, menos precio!
¡Más tecnología, menos precio!Beep Informática
 
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1ar-rifke.com
 
NIKHIL LAZARUS (BBA) RESUME
NIKHIL LAZARUS (BBA) RESUMENIKHIL LAZARUS (BBA) RESUME
NIKHIL LAZARUS (BBA) RESUMENikhil Lazarus
 
край де варто жити
край де варто житикрай де варто жити
край де варто житиvaleria karnatovska
 
Curious Coders Java Web Frameworks Comparison
Curious Coders Java Web Frameworks ComparisonCurious Coders Java Web Frameworks Comparison
Curious Coders Java Web Frameworks ComparisonHamed Hatami
 

Destacado (11)

Excellent Cities For Young Entrepreneurs in 2017 | Jerry Novack
Excellent Cities For Young Entrepreneurs in 2017 | Jerry NovackExcellent Cities For Young Entrepreneurs in 2017 | Jerry Novack
Excellent Cities For Young Entrepreneurs in 2017 | Jerry Novack
 
Enciclopedia de los diferentes comandos de personalizacion de diapositivas p...
Enciclopedia  de los diferentes comandos de personalizacion de diapositivas p...Enciclopedia  de los diferentes comandos de personalizacion de diapositivas p...
Enciclopedia de los diferentes comandos de personalizacion de diapositivas p...
 
¡Más tecnología, menos precio!
¡Más tecnología, menos precio!¡Más tecnología, menos precio!
¡Más tecnología, menos precio!
 
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1
102201645 percubaan-upsr-2012-negeri-kelantan-mt-kertas-1
 
RBG COM
RBG COMRBG COM
RBG COM
 
Providers
ProvidersProviders
Providers
 
NIKHIL LAZARUS (BBA) RESUME
NIKHIL LAZARUS (BBA) RESUMENIKHIL LAZARUS (BBA) RESUME
NIKHIL LAZARUS (BBA) RESUME
 
край де варто жити
край де варто житикрай де варто жити
край де варто жити
 
THE SEASONAL EFFECT
THE SEASONAL EFFECTTHE SEASONAL EFFECT
THE SEASONAL EFFECT
 
Ismawati
IsmawatiIsmawati
Ismawati
 
Curious Coders Java Web Frameworks Comparison
Curious Coders Java Web Frameworks ComparisonCurious Coders Java Web Frameworks Comparison
Curious Coders Java Web Frameworks Comparison
 

Más de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Más de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Intro to Data Science with Cascading

  • 1. Intro to Data Science with Cascading Paco Nathan Document Collection Tokenize Scrub Concurrent, Inc. token M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count Word Count @pacoid Copyright @2012, Concurrent, Inc.
  • 2. opportunity Unstructured Data meets Enterprise Scale 1. backstory: how we got here 2. build: data science teams 3. overview: typical use cases 4. example: Cascading apps
  • 3. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. backstory: how we got here
  • 4. inflection point huge Internet successes after 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG 1997 consider this metric: 1998 annual revenue per customer / amount of data stored which dropped 100x within a few years after 1997 storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… 2004 our methods must adapt “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… tends toward inertia! MapReduce and the Hadoop open source stack grew directly out of that contention… however, that effort + only solves parts of the puzzle
  • 5. inflection point: consequences Geoffrey Moore (Mohr Davidow Ventures, author of Crossing The Chasm) Hadoop Summit, 2012: “All of Fortune 500 is now on notice over the next 10-year period.” Amazon and Google as exemplars of massive disruption in retail, advertising, etc. data as the major force displacing Global 1000 over the next decade, mostly through apps — verticals, leveraging domain expertise Michael Stonebraker (INGRES, PostgreSQL,Vertica,VoltDB, etc.) XLDB, 2012: “Complex analytics workloads are now displacing SQL as the basis  for Enterprise apps.”
  • 6. primary sources Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
  • 7. the world before… BI, SQL, and highly optimized code
  • 8. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
  • 9. the world after… machine learning, leveraging log files
  • 10. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
  • 11. the world ahead… what our customers are doing now
  • 12. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS
  • 14. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
  • 15. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • 16. social caveats “This data cannot be correct!” may be an early warning about the organization itself much depends on how the people whom you work alongside tend to arrive at decisions: ‣ probably good: Induction, Abduction, Circumscription ‣ probably poor: Deduction, Speculation, Justification in general, one good data visualization puts many ongoing verbal arguments to rest however, let domain experts handle “data storytelling”, not data scientists xkcd
  • 17. references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L
  • 18. references by Jack Olson Data Quality Morgan Kaufmann, 2003 amazon.com/dp/1558608915
  • 19. references also check out RStudio: rstudio.org/ rpubs.com/
  • 20. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. build: data science teams
  • 21. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. an interdisciplinary pursuit which requires teams, not sole players
  • 22. team process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers build smarts into apps product features Gephi keep infrastructure systems running, cost-effective
  • 23. building teams nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • 24. references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE
  • 25. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 3. overview: typical use cases
  • 26. using science in data science edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC ‣ estimate probability teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN y d d uB d dA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ calculate analytic variance ‣ manipulate order complexity ‣ make use of learning theory ‣ collab with DevOps, Stakeholders
  • 27. probability estimation “a random variable or stochastic variable is a variable whose value is subject to variations” “an estimator is a rule for calculating an estimate of a given quantity based on observed data” estimators and probability distributions provide the essential basis for our insights bayesian methods, shrinkage… these are our friends quantile estimation, empirical CDFs… …versus frequentist notions
  • 28. analytic variance our tools for automation leverage deep understanding of covariance cannot overstate the importance of sampling… insist on metrics described as confidence intervals, where valid bootstrapping, bagging… these are our friends Monte Carlo methods resolve “black box” problems point estimates may help prevent “uninformed” decisions do not skimp on this part, ever… a hard lesson learned from BI failures
  • 29. order complexity techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but there are only ~10K useful keywords low-dimensional structures, PCA linear algebra tricks: eigenvalues, matrix decomposition, etc. many hard problems resolved by “divide and conquer” this is an area ripe for much advancement in algorithms research near-term
  • 30. learning theory in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a related area once Big Data projects get beyond merely digesting log files, optimization will likely become yet another buzzword :)
  • 31. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
  • 32. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
  • 33. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
  • 34. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
  • 35. data+code “political spectrum” “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
  • 36. a selection of great tools… reporting: visualization: Graphite, PowerPivot, ggplot2, D3, Gephi analytics/modeling: Pentaho, Jaspersoft, SAS R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL imdg: Spark, Storm, hadoop: EMR, HW, MapR, machine data: Gigaspaces EMC, Azure, Compute Splunk, collectd, durable storage: Nagios S3, ASV, GCS, Riak, Couch
  • 37. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. example: Cascading apps
  • 38. the workflow abstraction cascading.org/category/impatient/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 39. layers of a workflow business domain expertise, business trade-offs, process market position, operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token physical M HashJoin Regex Left token GroupBy R plan Stop Word token List RHS Count Word Count compute Apache Hadoop, in-memory local mode “assembler” code substrate …envision GPUs, streaming, etc. machine data Splunk, Nagios, Collectd, New Relic, etc.
  • 40. audience? • Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) • Systems Integrator POV: system integration of heterogenous data sources and compute platforms • Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law • Data Architect POV: a physical plan for large-scale data flow management • Software Architect POV: a pattern language, similar to plumbing or circuit design Document • Collection App Developer POV: M Tokenize Scrub token API bindings for Scala, Clojure, Python, Ruby, Java Stop Word List HashJoin Left RHS Regex token GroupBy token R • Count Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo
  • 41. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 42. wait! ten lines of code for a file copy… seems like a lot.
  • 43. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 44. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code
  • 45. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 46. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code
  • 47. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 48. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 49. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Segments Geohash CoGroup Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”
  • 51. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔
  • 52. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n