Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Strojová cesta do zákazníkovy duše
Next
Download to read offline and view in fullscreen.

3

Share

Získáváme, čistíme a ukládáme data

Download to read offline

Digital Humanities, Lekce druhá
Studia nových médií, 15. 10. 2012

Získáváme, čistíme a ukládáme data

  1. 1. Získáváme, čistíme a ukládáme data Digital Humanities, Lekce druhá Josef Šlerka, Studia nových médií, 15. 10. 2012
  2. 2. ETL (light verze) Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse) (viz Wikipedie)
  3. 3. Real-life podle Wiki 1. Cycle initiation 2. Build reference data 3. Extract (from sources) 4. Validate 5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) 6. Stage (load into staging tables, if used)
  4. 4. Real-life podle Wiki 7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) 8. Publish (to target tables) 9. Archive 10. Clean up
  5. 5. Extracting co se vám bude hodit...
  6. 6. Extract strukturovaná data vs nestrukturovaná pro DH nejčastěji databáze vs web web API vs scrapping lze si vystačit i jen malým znalostmi statická data vs real-time mohou být zákeřná, ale jde to řešit
  7. 7. XPATH XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C)
  8. 8. Jednoduché nástroje Google Docs (hlavně statická data) http://drive.google.com YQL (hlavně statická data) http://developer.yahoo.com/yql/console/ Yahoo Pipes (hlavně dynamická data) http://pipes.yahoo.com/pipes/ IFTTT (hlavně dynamická data) https://ifttt.com/
  9. 9. Ale mocné.... Twitter Archiving Google Spreadsheet TAGS v3 http://mashe.hawksey.info/2012/01/twitter-archive- tagsv3/
  10. 10. Transforming Hlavně o čištění a sjednocování dat ...
  11. 11. Google Refine http://code.google.com/p/google-refine/downloads/list? can=1 Google Refine is a standalone desktop application provided by Google for data cleanup and transformation to other formats. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however acts more like database.
  12. 12. Loading kam s nimi, když ne do tradiční databáze...
  13. 13. Google Fusion Tables jednoduché řešení pro běžné uživatele http://www.google.com/fusiontables/Home/ Web service provided by Google for data management. Data is stored in multiple tables that Internet users can view and download. The Web service provides means for visualizing data with pie charts, bar charts, lineplots, scatterplots, timelines as well as geographical maps. Data is exported in a comma-separated values file format.
  14. 14. A teď ještě jedno demo....
  • mxdpeep

    Oct. 19, 2012
  • davidsimak1

    Oct. 16, 2012
  • MichalIschia

    Oct. 16, 2012

Digital Humanities, Lekce druhá Studia nových médií, 15. 10. 2012

Views

Total views

1,113

On Slideshare

0

From embeds

0

Number of embeds

8

Actions

Downloads

6

Shares

0

Comments

0

Likes

3

×