Is the traditional data warehouse dead?

Is the traditionnel data
warehouse dead?
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
(Data Lake and Data Warehouse – the
best of both worlds)

About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”

Agenda
 Data Warehouse
 Data Lake
 The best of both worlds
 Federated querying
 Patterns

Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types

Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up

Of course you still need a data warehouse
A data warehouse is where you store data from multiple data sources to be used for historical and
trend analysis reporting. It acts as a central repository for many subject areas and contains the "single
version of truth".
Reasons for a data warehouse:
 Reduce stress on production system
 Optimized for read access, sequential disk scans
 Integrate many sources of data
 Keep historical records (no need to save hardcopy reports)
 Restructure/rename tables and fields, model data
 Protect against source system upgrades
 Use Master Data Management, including hierarchies
 No IT involvement needed for users to create reports
 Improve data quality and plugs holes in source systems
 One version of the truth
 Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Traditional Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements

ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
14

Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”

Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
17

The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure

Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• Place to backup data to
• Place to move older data

Needs data governance so your data lake does not turn
into a data swamp!

The real cost of Hadoop
https://www.scribd.com/document/172491475/WinterCorp-
Report-Big-Data-What-Does-It-Really-Cost/

A data lake is just a glorified file folder with data files in it – how many end-users can
accurately create reports from it?

• Query performance not as good as relational database
• Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management,
concurrency, dynamic workload management and robust indexing
• Concurrency limitations
• No concept of “hot” and “cold” data storage with different levels of performance to reduce cost
• Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security
• File based so no granular security definition at the column level
• No metadata stored in HDFS, so another tool required adding complexity and slowing performance
• Finding expertise in Hadoop is very difficult
• Super complex, with lot’s of integration with multiple technologies to make it work
• Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard
• Lack of master data management tools for Hadoop
• Requires end-users to learn new reporting tools and Hadoop technologies to query the data
• Pace of change is so quick many Hadoop technologies become obsolete, adding risk
• Lack of cost savings: cloud consumption, support, licenses, training, and migration costs
• Need conversion process to convert data to a relational format if a reporting tool requires it
• Some reporting tools don’t work against Hadoop

Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY

Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING

Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures

Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?

Modern Data Warehouse
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance

Data Lake Data Warehouse
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Complementary to DW Can be sourced from Data Lake

Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions

Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)

Reasons you still need a cube/OLAP
• Semantic layer
• Handle many concurrent users
• Aggregating data for performance
• Multidimensional analysis
• No joins or relationships
• Hierarchies, KPI’s
• Row-level security
• Advanced time-calculations
• Slowly Changing Dimensions (SCD)

Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology

SQL Server and PolyBase
Query relational and non-relational data with T-SQL

Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Big data store Hadoop/Spark and
machine learning
Data warehouse
Cloud Bursting
BI + Reporting
Azure Data Factory Azure Blob Storage Azure Databricks
Azure Data Lake
Azure HDInsight
Azure Machine Learning
Machine Learning Server
Azure SQL Data Warehouse
Azure Analysis Services

Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Databricks
Azure HDInsight
Data Lake Analytics
Analytical
dashboards
PolyBase
Business/custom apps
(Structured) Azure Analysis
Services
Azure Data Lake Store

Azure Data Lake Store
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Tableau
Server
PolyBase
Operational
Reports
Ad-Hoc
Query
Azure SQL
Database
Hortonworks

Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Is the traditional data warehouse dead?

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Is the traditional data warehouse dead?

Similar a Is the traditional data warehouse dead? (20)

Más de James Serra

Más de James Serra (20)

Último

Último (20)

Is the traditional data warehouse dead?

Notas del editor