2. ETL (light verze)
Extracting data from outside sources
Transforming it to fit operational needs (which can
include quality levels)
Loading it into the end target (database, more
specifically, operational data store, data mart or data
warehouse)
(viz Wikipedie)
3. Real-life podle Wiki
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for
data integrity, create aggregates or disaggregates)
6. Stage (load into staging tables, if used)
4. Real-life podle Wiki
7. Audit reports (for example, on compliance with
business rules. Also, in case of failure, helps to
diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
6. Extract
strukturovaná data vs nestrukturovaná
pro DH nejčastěji databáze vs web
web API vs scrapping
lze si vystačit i jen malým znalostmi
statická data vs real-time mohou být zákeřná, ale jde
to řešit
7. XPATH
XPath, the XML Path Language, is a query language
for selecting nodes from an XML document. In
addition, XPath may be used to compute values (e.g.,
strings, numbers, or Boolean values) from the content
of an XML document. XPath was defined by the World
Wide Web Consortium (W3C)
8. Jednoduché nástroje
Google Docs (hlavně statická data)
http://drive.google.com
YQL (hlavně statická data)
http://developer.yahoo.com/yql/console/
Yahoo Pipes (hlavně dynamická data)
http://pipes.yahoo.com/pipes/
IFTTT (hlavně dynamická data)
https://ifttt.com/
13. Google Fusion Tables
jednoduché řešení pro běžné uživatele
http://www.google.com/fusiontables/Home/
Web service provided by Google for data
management. Data is stored in multiple tables that
Internet users can view and download. The Web
service provides means for visualizing data with pie
charts, bar charts, lineplots, scatterplots, timelines as
well as geographical maps. Data is exported in a
comma-separated values file format.