With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Is the traditional data warehouse dead?
1. Is the traditionnel data
warehouse dead?
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
(Data Lake and Data Warehouse – the
best of both worlds)
2. About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
4. Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
5. Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
6.
7.
8. Of course you still need a data warehouse
A data warehouse is where you store data from multiple data sources to be used for historical and
trend analysis reporting. It acts as a central repository for many subject areas and contains the "single
version of truth".
Reasons for a data warehouse:
Reduce stress on production system
Optimized for read access, sequential disk scans
Integrate many sources of data
Keep historical records (no need to save hardcopy reports)
Restructure/rename tables and fields, model data
Protect against source system upgrades
Use Master Data Management, including hierarchies
No IT involvement needed for users to create reports
Improve data quality and plugs holes in source systems
One version of the truth
Easy to create BI solutions on top of it (i.e. SSAS Cubes)
9. Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Traditional Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
10.
11.
12.
13.
14. ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
14
15. Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
17. Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
17
18. The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
19. Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
20. Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• Place to backup data to
• Place to move older data
24. The real cost of Hadoop
https://www.scribd.com/document/172491475/WinterCorp-
Report-Big-Data-What-Does-It-Really-Cost/
25.
26. A data lake is just a glorified file folder with data files in it – how many end-users can
accurately create reports from it?
27. • Query performance not as good as relational database
• Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management,
concurrency, dynamic workload management and robust indexing
• Concurrency limitations
• No concept of “hot” and “cold” data storage with different levels of performance to reduce cost
• Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security
• File based so no granular security definition at the column level
• No metadata stored in HDFS, so another tool required adding complexity and slowing performance
• Finding expertise in Hadoop is very difficult
• Super complex, with lot’s of integration with multiple technologies to make it work
• Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard
• Lack of master data management tools for Hadoop
• Requires end-users to learn new reporting tools and Hadoop technologies to query the data
• Pace of change is so quick many Hadoop technologies become obsolete, adding risk
• Lack of cost savings: cloud consumption, support, licenses, training, and migration costs
• Need conversion process to convert data to a relational format if a reporting tool requires it
• Some reporting tools don’t work against Hadoop
28.
29. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
30. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
31. Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
32. Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
33. Modern Data Warehouse
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance
34. Data Lake Data Warehouse
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Complementary to DW Can be sourced from Data Lake
35. Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
36. Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
37. Reasons you still need a cube/OLAP
• Semantic layer
• Handle many concurrent users
• Aggregating data for performance
• Multidimensional analysis
• No joins or relationships
• Hierarchies, KPI’s
• Row-level security
• Advanced time-calculations
• Slowly Changing Dimensions (SCD)
39. Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
40. SQL Server and PolyBase
Query relational and non-relational data with T-SQL
41.
42. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Big data store Hadoop/Spark and
machine learning
Data warehouse
Cloud Bursting
BI + Reporting
Azure Data Factory Azure Blob Storage Azure Databricks
Azure Data Lake
Azure HDInsight
Azure Machine Learning
Machine Learning Server
Azure SQL Data Warehouse
Azure Analysis Services
43. INGEST STORE PREP & TRAIN MODEL & SERVE
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Databricks
Azure HDInsight
Data Lake Analytics
Analytical
dashboards
PolyBase
Business/custom apps
(Structured) Azure Analysis
Services
Azure Data Lake Store
44. INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Data Lake Store
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Tableau
Server
PolyBase
Operational
Reports
Ad-Hoc
Query
Azure SQL
Database
Hortonworks
47. Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)
Notas del editor
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I'll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I'll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I'll put it all together by showing common big data architectures.
http://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/
https://www.slideshare.net/jamserra/big-data-architectures-and-the-data-lake
https://www.slideshare.net/jamserra/differentiate-big-data-vs-data-warehouse-use-cases-for-a-cloud-solution
Fluff, but point is I bring real work experience to the session
No
Pay the piper now or later
The real question is…
Dump files into data lake and tell user to go for it
Relational databases (RDBMS) generally work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it
Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data
There are two approaches to doing information management for analytics:
Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?
Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?
In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach.
.
One version of truth story: different departments using different financial formulas to help bonus
This leads to reasons to use BI. This is used to convince your boss of need for DW
Note that you still want to do some reporting off of source system (i.e. current inventory counts).
It’s important to know upfront if data warehouse needs to be updated in real-time or very frequently as that is a major architectural decision
JD Edwards has tables names like T117
The Data Warehouses leverages the top-down approach where there is a well-architected information store and enterprisewide BI solution. To build a data warehouse follows the top-down approach where the company’s corporate strategy is defined first. This is followed by gathering of business and technical requirements for the warehouse. The data warehouse is then implemented by dimension modelling and ETL design followed by the actual development of the warehouse. This is all done prior to any data being collected. It utilizes a rigorous and formalized methodology because a true enterprise data warehouse supports many users/applications within an organization to make better decisions.
Key Points:
Businesses can use new data streams to gain a competitive advantage.
Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming.
Talk Track:
Does it not seem like every day there is a new kind of data that we need to understand?
New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it.
Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business.
Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social.
Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second.
All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach:
Collecting data on-premises with SQL Server
SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing
With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps.
If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple.
Capture new data types using the power and flexibility of the Microsoft Azure Cloud
Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business.
Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools.
SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology.
Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images.
Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities.
Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities.
Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning.
Document DB: a fully managed, highly scalable, NoSQL document database service
Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data
Azure Data Factory: enables information production by orchestrating and managing diverse data
Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds
Microsoft Analytics Platform System:
In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse.
But this traditional data warehouse is under pressure, hitting limits amidst massive change.
Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights.
They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments.
Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer.
The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance.
Now you can collect relational and non-relational data in one appliance.
You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.
All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
All data has immediate or potential value
This leads to data hoarding—all data is stored indefinitely
With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation
Schema is imposed and transformations are done at query time(schema-on-read). Applications and users interpret the data as they see fit.
Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)
http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/
http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/
http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx
http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/
http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses
http://www.martinsights.com/?p=1088
http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/
http://www.martinsights.com/?p=1082
http://www.martinsights.com/?p=1094
http://www.martinsights.com/?p=1102
Inexpensively store unlimited data
Collect all data “just in case”
Easy integration of differently-structured data
Store data with no modeling – “Schema on read”
Complements enterprise data warehouse (EDW)
Frees up expensive EDW resources, especially for refining data
Hadoop cluster offers faster ETL processing over SMP solutions
Quick user access to data
Data exploration to see if data valuable before writing ETL and schema for relational database
Allows use of Hadoop tools such as ETL and extreme analytics
Place to land IoT streaming data
On-line archive or backup for data warehouse data
Easily scalable
With Hadoop, high availability built in
Allows for data to be used many times for different analytic needs and use cases
Low-cost storage for raw data saving space on the EDW
https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake
https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning
Question: Do you see many companies building data lakes?
Raw: Raw events are stored for historical reference. Also called staging layer or landing area
Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation
Sandbox: Optional layer to be used to “play” in. Also called exploration layer or data science workspace
Question: Do you see many companies building data lakes?
I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset for? Is your IT department radical or is it conservative?
does a data scientist or analyst think locally or globally? Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases? So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.
As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this. The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question. Another risk in the first case is slower performance because the data is not laid out efficiently. Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).
http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/
Not saying your EDW can’t consist of a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset? Radical vs conservative
http://www.wintercorp.com/tcod-report/
Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training
I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:
- Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
- Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS
- Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
- ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake
- There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
- The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottoms-up approach becomes part of the top-down approach.
The Top-down approach with the data warehouse utilizes a rigorous and formal approach to designing an enterprise wide data warehouse that can support the entire enterprise. It usually can answer questions that are backwards facing like what just happened or even answer why things happened.
The bottoms-up approach with the data lake utilizes a exploratory and informal approach of collecting all data in a single place so that data scientists can do advanced analytics like leveraging Hadoop and machine learning tools. It usually can identify new opportunities, predict future outcomes, etc.
In the ideal world, both are leveraged so that they can exploit information in the most valued way where each works together with the other to grow the business.
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
HDInsights benefits: Cheap, quickly procure
Key goal of slide: Highlight the four main use cases for PolyBase.
Slide talk track:
There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.
PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.
PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.
There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.
Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
Question: Data warehouses and data lakes are now so fast, do we still need cubes?
Question: Do we need a relational database if we create a data lake? So we don’t have to make another copy of the data? Hive LLAP, Spark SQL, and Impala are so fast can’t we get away with just having a data lake? In other words, is the traditional data warehouse dead?
I’m a little confused with the update to SQL DW that has unlimited columnar storage. What was the limit before this change? Does this change the maximum database size of 240TB?
The max previously was governed by max db size of 240TB. We now are holding the columnstore data out on blob store and so the amount of that data we can hold is unlimited. We are still bound by the 240TB limit for page data so indexes and heaps are still capped today.
Speed: Blob/ADLS/Local, size, versions, query type, data size, driver, front-end tool, concurrency, etc.
Storage spaces
Azure SQL Database Managed Instance will be added to this.
SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
http://demo.sqlmag.com/scaling-success-sql-server-2016/integrating-big-data-and-sql-server-2016
When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.
Question: Should SQL Database be considered in the Model & Serve blade, using it as a data mart?
Four Reasons to Migrate Your SQL Server Databases to the Cloud: Security, Agility, Availability, and Reliability
Reasons not to move to the cloud:
Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house)
Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider)
Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own. Or a concern that you share a cloud data center with other companies and someone from another company can be onsite near your servers)
Compliance restrictions
Regulations (health, financial)
Legal restrictions (i.e. data can’t leave your country)
Company policies
You may be sharing resources on your server, as well as competing for system and network resources
Data getting stolen in-flight (i.e. from the cloud data center to the on-prem user)
Question: Where should be do data transformations (data lake, relational database, Databricks, etc)?
Question: What are the cost vs performance tradeoffs with our products? (many companies will sacrifice performance to save money)