SlideShare una empresa de Scribd logo
1 de 18
Data Scientist Training for Librarians
   Harvard College Observatory
          March 28, 2013

          Tom Morris
           @tfmorris
Who am I?
•
    Independent software engineering & product
    management consultant
•
    Developer on open source OpenRefine project
•
    Curious data geek
•
    Contact:
     –   Twitter: @tfmorris
     –   Email: tfmorris@gmail.com

2013-03-28           Tom Morris @tfmorris         2
Data Analysis Lifecycle
 ●
     Find / Extract
 ●
     Prepare
     –   Characterize
     –   Clean
     –   Integrate / Extend
 ●
     Analyze
 ●
     Visualize / Report

2013-03-28              Tom Morris @tfmorris   3
Provenance
 ●
     Provenance is key! - both before and after
     you get data
 ●
     Record source (e.g. download URL) and date
 ●
     Unix command line
     –   build up a repeatable transformation pipeline script
     –   use make to keep from having to repeat steps)

 ●
     OpenRefine maintains an undo history (but...)


2013-03-28                        Tom Morris @tfmorris          4
Irreversible Transforms
 ●
     Be careful of anything which isn't reversible
 ●
     Keep source files and plan recovery strategy
 ●
     Common gotchas:
     –   Character encoding – can't replace
         substitution character with its original
         value
     –   Leading 0s on identifiers

2013-03-28            Tom Morris @tfmorris           5
Provenance projects
 ●
     Stanford Panda (Provenance and Data) -
     http://infolab.stanford.edu/panda/
 ●
     Open Provenance Model -
     http://openprovenance.org/
 ●
     Both focus on bi-directional traceability




2013-03-28           Tom Morris @tfmorris        6
Tools vs Scale
 ●
     Editor with macro facility: emacs, vim
 ●
     Spreadsheet: Excel, OO Calc
 ●
     OpenRefine
 ●
     Unix shell commands – awk, sed, grep, cut,
     sort, head, tail
 ●
     “Real” programming – Python, Ruby, Java
 ●
     Map-Reduce

2013-03-28          Tom Morris @tfmorris          7
Regular Expressions
 ●
     Useful in so many contexts
 ●
     A little confusing to learn, but
 ●
     Absolutely worth the effort!




2013-03-28           Tom Morris @tfmorris   8
OpenRefine
 ●
     Power tool for working with messy data
 ●
     Free and open source
 ●
     Desktop based (data stays private)
 ●
     Faceted browsing interface
 ●
     Lots of input & output formats
 ●
     Powerful transformations
 ●
     Useful for analysis & web scraping/APIs too
2013-03-28          Tom Morris @tfmorris           9
OpenRefine Data Formats
 ●
     CSV/TSV/separator based
 ●
     Fixed width field
 ●
     JSON & XML
 ●
     Excel & OpenOffice Calc
 ●
     Google Spreadsheets & Fusion Tables
 ●
     RDF
 ●
     URLs & zip files too!
2013-03-28           Tom Morris @tfmorris   10
Data Characterization
 ●
     Coded vs free-form fields
 ●
     Distribution of values
     –   Missing values – skip, impute, ...
     –   Outliers – cause? Can they be rescaled?
 ●
     Delimiters & escaping (e.g. HTML, XML)
 ●
     Formatting problems
 ●
     Character encoding issues?

2013-03-28                 Tom Morris @tfmorris    11
Hands-on
 ●
     Let's play with some data!
 ●
     http://code.google.com/p/google-refine/




2013-03-28          Tom Morris @tfmorris       12
Export
 ●
     OpenRefine exports most import formats:
     Excel, CSV, TSV, OpenOffice, Google
     Spreadsheets, Fusion tables, JSON, RDF
 ●
     Template-based exporter for everything else:
     custom JSON formats, etc.




2013-03-28          Tom Morris @tfmorris            13
Scaling Up
 ●
     Experiment with a (representative) sample of
     your data
 ●
     Reuse regexs, filters, etc with more heavy
     duty tools – awk, sed, Map-Reduce




2013-03-28          Tom Morris @tfmorris            14
Resources
 ●
     Berkeley Data Science course
     http://datascienc.es/schedule/
     –   week 2 - Data Preparation has good R examples
         http://berkeleydatascience.files.wordpress.com/2012/02/2012
 ●
     Mike Loukides "Data Hand Tools"
     http://radar.oreilly.com/2011/04/data-hand-tools
 ●
     Jeremy Howard Getting in shape for the sport
     of Data Science
     http://media.kaggle.com/MelbURN.html
2013-03-28              Tom Morris @tfmorris                  15
More resources
●
    MIT IAP Data Science course materials
    –   http://dataiap.github.com/dataiap/
●
    Quora
    –   http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
    –   http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning

●
    OKFN School of Data handbook
    –   http://handbook.schoolofdata.org
●
    Hilary Mason
    –   http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2013-03-28                          Tom Morris @tfmorris                                    16
Resources mentioned
 ●
     Harvard Business Review competition at
     Kaggle
     –   Competition ends 8/27/2012 4:00 AM UTC !
     –   https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect


 ●
     Stanford Data Wrangler
     –   http://vis.stanford.edu/wrangler/




2013-03-28                    Tom Morris @tfmorris                             17
Thanks!

•
     Questions now?
•
     Questions later:
      –   Twitter: @tfmorris
      –   Email: tfmorris@gmail.com




    2013-03-28          Tom Morris @tfmorris   18

Más contenido relacionado

La actualidad más candente

Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseAndrea Nuzzolese
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...João Rocha da Silva
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2Dimitris Kontokostas
 
Making art (and more!) with metadata
Making art (and more!) with metadataMaking art (and more!) with metadata
Making art (and more!) with metadataMatthew Miguez
 
Varad s karmarkar resume
Varad s karmarkar resumeVarad s karmarkar resume
Varad s karmarkar resumeVarad Karmarkar
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysisthinrhino
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data JournalismIrina Radchenko
 
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of Interest
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of InterestWherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of Interest
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of InterestWhereCampBerlin
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128ARDC
 

La actualidad más candente (14)

Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
Graph Databases and Web Frameworks (NodeJS, AngularJS, GridFS, OpenLink Virtu...
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Making art (and more!) with metadata
Making art (and more!) with metadataMaking art (and more!) with metadata
Making art (and more!) with metadata
 
Varad s karmarkar resume
Varad s karmarkar resumeVarad s karmarkar resume
Varad s karmarkar resume
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of Interest
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of InterestWherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of Interest
Wherecamp Navigation Conference 2015 - SPOI SDI4pps: Points of Interest
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
co:op-READ-Convention Marburg - Sebastian Colutto
co:op-READ-Convention Marburg - Sebastian Coluttoco:op-READ-Convention Marburg - Sebastian Colutto
co:op-READ-Convention Marburg - Sebastian Colutto
 
co:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatosco:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatos
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 

Similar a OpenRefine - Data Science Training for Librarians

Linked Open Data - State of the Art, Challenges and Applications
Linked Open Data - State of the Art, Challenges and ApplicationsLinked Open Data - State of the Art, Challenges and Applications
Linked Open Data - State of the Art, Challenges and ApplicationsRui Vieira
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
Dragging old data forward: finding yourself an RDA Helper
Dragging old data forward:  finding yourself an RDA HelperDragging old data forward:  finding yourself an RDA Helper
Dragging old data forward: finding yourself an RDA HelperTerry Reese
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPROIDEA
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterEnrico Daga
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Sard HMSC Tech Talk
Sard HMSC Tech TalkSard HMSC Tech Talk
Sard HMSC Tech TalkNick Sard
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
data-scientist-learning-path.pptx
data-scientist-learning-path.pptxdata-scientist-learning-path.pptx
data-scientist-learning-path.pptxsandipkishore
 
Graphing Your Data
Graphing Your DataGraphing Your Data
Graphing Your DataAlex Meadows
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleRob Vesse
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 

Similar a OpenRefine - Data Science Training for Librarians (20)

Linked Open Data - State of the Art, Challenges and Applications
Linked Open Data - State of the Art, Challenges and ApplicationsLinked Open Data - State of the Art, Challenges and Applications
Linked Open Data - State of the Art, Challenges and Applications
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Dragging old data forward: finding yourself an RDA Helper
Dragging old data forward:  finding yourself an RDA HelperDragging old data forward:  finding yourself an RDA Helper
Dragging old data forward: finding yourself an RDA Helper
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Sard HMSC Tech Talk
Sard HMSC Tech TalkSard HMSC Tech Talk
Sard HMSC Tech Talk
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
data-scientist-learning-path.pptx
data-scientist-learning-path.pptxdata-scientist-learning-path.pptx
data-scientist-learning-path.pptx
 
Graphing Your Data
Graphing Your DataGraphing Your Data
Graphing Your Data
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
HCII2014 presentation
HCII2014 presentationHCII2014 presentation
HCII2014 presentation
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
 
Data Science in the Cloud
Data Science in the CloudData Science in the Cloud
Data Science in the Cloud
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 

OpenRefine - Data Science Training for Librarians

  • 1. Data Scientist Training for Librarians Harvard College Observatory March 28, 2013 Tom Morris @tfmorris
  • 2. Who am I? • Independent software engineering & product management consultant • Developer on open source OpenRefine project • Curious data geek • Contact: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 2
  • 3. Data Analysis Lifecycle ● Find / Extract ● Prepare – Characterize – Clean – Integrate / Extend ● Analyze ● Visualize / Report 2013-03-28 Tom Morris @tfmorris 3
  • 4. Provenance ● Provenance is key! - both before and after you get data ● Record source (e.g. download URL) and date ● Unix command line – build up a repeatable transformation pipeline script – use make to keep from having to repeat steps) ● OpenRefine maintains an undo history (but...) 2013-03-28 Tom Morris @tfmorris 4
  • 5. Irreversible Transforms ● Be careful of anything which isn't reversible ● Keep source files and plan recovery strategy ● Common gotchas: – Character encoding – can't replace substitution character with its original value – Leading 0s on identifiers 2013-03-28 Tom Morris @tfmorris 5
  • 6. Provenance projects ● Stanford Panda (Provenance and Data) - http://infolab.stanford.edu/panda/ ● Open Provenance Model - http://openprovenance.org/ ● Both focus on bi-directional traceability 2013-03-28 Tom Morris @tfmorris 6
  • 7. Tools vs Scale ● Editor with macro facility: emacs, vim ● Spreadsheet: Excel, OO Calc ● OpenRefine ● Unix shell commands – awk, sed, grep, cut, sort, head, tail ● “Real” programming – Python, Ruby, Java ● Map-Reduce 2013-03-28 Tom Morris @tfmorris 7
  • 8. Regular Expressions ● Useful in so many contexts ● A little confusing to learn, but ● Absolutely worth the effort! 2013-03-28 Tom Morris @tfmorris 8
  • 9. OpenRefine ● Power tool for working with messy data ● Free and open source ● Desktop based (data stays private) ● Faceted browsing interface ● Lots of input & output formats ● Powerful transformations ● Useful for analysis & web scraping/APIs too 2013-03-28 Tom Morris @tfmorris 9
  • 10. OpenRefine Data Formats ● CSV/TSV/separator based ● Fixed width field ● JSON & XML ● Excel & OpenOffice Calc ● Google Spreadsheets & Fusion Tables ● RDF ● URLs & zip files too! 2013-03-28 Tom Morris @tfmorris 10
  • 11. Data Characterization ● Coded vs free-form fields ● Distribution of values – Missing values – skip, impute, ... – Outliers – cause? Can they be rescaled? ● Delimiters & escaping (e.g. HTML, XML) ● Formatting problems ● Character encoding issues? 2013-03-28 Tom Morris @tfmorris 11
  • 12. Hands-on ● Let's play with some data! ● http://code.google.com/p/google-refine/ 2013-03-28 Tom Morris @tfmorris 12
  • 13. Export ● OpenRefine exports most import formats: Excel, CSV, TSV, OpenOffice, Google Spreadsheets, Fusion tables, JSON, RDF ● Template-based exporter for everything else: custom JSON formats, etc. 2013-03-28 Tom Morris @tfmorris 13
  • 14. Scaling Up ● Experiment with a (representative) sample of your data ● Reuse regexs, filters, etc with more heavy duty tools – awk, sed, Map-Reduce 2013-03-28 Tom Morris @tfmorris 14
  • 15. Resources ● Berkeley Data Science course http://datascienc.es/schedule/ – week 2 - Data Preparation has good R examples http://berkeleydatascience.files.wordpress.com/2012/02/2012 ● Mike Loukides "Data Hand Tools" http://radar.oreilly.com/2011/04/data-hand-tools ● Jeremy Howard Getting in shape for the sport of Data Science http://media.kaggle.com/MelbURN.html 2013-03-28 Tom Morris @tfmorris 15
  • 16. More resources ● MIT IAP Data Science course materials – http://dataiap.github.com/dataiap/ ● Quora – http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning ● OKFN School of Data handbook – http://handbook.schoolofdata.org ● Hilary Mason – http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2013-03-28 Tom Morris @tfmorris 16
  • 17. Resources mentioned ● Harvard Business Review competition at Kaggle – Competition ends 8/27/2012 4:00 AM UTC ! – https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect ● Stanford Data Wrangler – http://vis.stanford.edu/wrangler/ 2013-03-28 Tom Morris @tfmorris 17
  • 18. Thanks! • Questions now? • Questions later: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 18