The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
1. Data Lake Overview
James Serra
Data & AI Architect
Microsoft, NYC MTC
JamesSerra3@gmail.com
Blog: JamesSerra.com
2. About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
3. Agenda
Big Data Architectures
Why data lakes?
Top-down vs Bottom-up
Data lake defined
Creating ADLS Gen2
Data Lake Use Cases
5. Enterprise data warehouse augmentation
• Seen when EDW has been in
existence a while and EDW can’t
handle new data
• Data hub, not data lake
• Cons: not offloading EDW work,
can’t use existing tools, difficulty
joining data in data hub with EDW
6. Data hub plus EDW
• Data hub is used as temporary
staging and refining, no reporting
• Cons: data hub is temporary, no
reporting/analyzing done with the
data hub
(temporary)
7. All-in-one
Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/
• Data hub is total solution, no EDW
• Cons: queries are slower, new
training for reporting tools, difficulty
understanding data, security
limitations
8. Modern Data Warehouse
• Evolution of three previous scenarios
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in the
data lake or moved to EDW for more
quality and performance
9. INGEST STORE PREP & TRAIN MODEL & SERVE
M O D E R N D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
10. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflows
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
12. ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
12
13. Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
15. Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
Use a data lake:
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
17. Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
18. Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
19. The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
20. Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
22. Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Centralized place for multiple subjects (single version of the truth)
• Collect all data “just in case” (data hoarding)
• Easy integration of differently-structured data
• Store data with no modeling – “Schema on read”
• Complements enterprise data warehouse (EDW)
• Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention)
• Hadoop cluster offers faster ETL processing over SMP solutions
• Quick user access to data for power users/data scientists (allowing for faster ROI)
• Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report
• Allows use of Hadoop tools such as ETL and extreme analytics
• Place to land IoT streaming data
• On-line archive or backup for data warehouse data
• With Hadoop/ADLS, high availability and disaster recovery built in
• Keep raw data so don’t have to go back to source if need to re-run
• Allows for data to be used many times for different analytic needs and use cases
• Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different
sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW
• Extreme performance for transformations by having multiple compute options each accessing different folders containing data
• The ability for an end-user or product to easily access the data from any location
23. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
24. Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
25. Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
26. Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
28. Objectives
Plan the structure based on optimal data retrieval
Avoid a chaotic, unorganized data swamp
Data Retention Policy
Temporary data
Permanent data
Applicable period (ex: project lifetime)
etc…
Business Impact / Criticality
High (HBI)
Medium (MBI)
Low (LBI)
etc…
Confidential Classification
Public information
Internal use only
Supplier/partner confidential
Personally identifiable information (PII)
Sensitive – financial
Sensitive – intellectual property
etc…
Probability of Data Access
Recent/current data
Historical data
etc…
Owner / Steward / SME
Subject Area
Security Boundaries
Department
Business unit
etc…
Time Partitioning
Year/Month/Day/Hour/Minute
Downstream App/Purpose
Common ways to organize the data:
Special thanks to:
Melissa Coates
CoatesDataStrategies.com
29. Example 1
Pros: Subject area at top level, organization-wide
Partitioned by time
Cons: No obvious security or organizational boundaries
Raw Data Zone
Subject Area
Data Source
Object
Date Loaded
File(s)
----------------------------------------------------
Sales
Salesforce
CustomerContacts
2016
12
01
CustContact_2016_12_01.txt
Curated Data Zone
Purpose
Type
Snapshot Date
File(s)
-----------------------------------
Sales Trending Analysis
Summarized
2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
30. Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
32. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e G e n 2
M A N A G E A B L E S C A L A B L EF A S TS E C U R E
No limits on
data store size
Global footprint
(50 regions)
Optimized for Spark
and Hadoop Analytic
Engines
Tightly integrated
with Azure end to
end analytics
solutions
Automated
Lifecycle Policy
Management
Object Level
tiering
Support for fine-grained
ACLs, protecting data at
the file and folder level
Multi-layered protection
via at-rest Storage
Service encryption and
Azure Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
Atomic file
operations
means jobs
complete faster
Object store
pricing levels
File system
operations
minimize
transactions
required for job
completion
33. Blob Storage Data Lake Store
Azure Data Lake Storage Gen2
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
34. Missing blob storage features:
- Archive and Premium tier
- Soft Delete
- Snapshots
- Some features in preview
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues
Missing ADLS Gen1 features:
- Microsoft product support: ADC, Excel, AAS
- 3rd-party products: Informatica, Attunity, Alteryx
- Some features in preview
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
35. Azure Data Lake Store – Distributed File
System
ADLS File
Files of any size can be stored because ADLS is a distributed system which file contents are
divided up across backend storage nodes.
A read operation on the file is also parallelized across the nodes.
Blocks are also replicated for fault tolerance.
Data
Node 1
Block
Data
Node 2
Block
Data
Node 4
Block
Data
Node 3
Block
The ideal file size in ADLS is 256MB –
2GB in size.
Many very tiny files introduces
significant overhead which reduces
performance. This is a well-known
issue with storing data in HDFS.
Techniques:
• Append-only data streams Thanks to Melissa Coates, www.CoatesDataStrategies.com
36.
37. LRS
Multiple replicas across
a datacenter
Protect against disk,
node, rack failures
Write is ack’d when all
replicas are committed
Superior to dual-parity
RAID
11 9s of durability
SLA: 99.9%
GRS
Multiple replicas across each
of 2 regions
Protects against major
regional disasters
Asynchronous to secondary
16 9s of durability
SLA: 99.9%
RA-GRS
GRS + Read access to secondary
Separate secondary endpoint
RPO delay to secondary can be
queried
SLA: 99.99% (read), 99.9% (write)
Zone 1
ZRS
Replicas across 3 Zones
Protect against disk, node, rack and
zone failures
Synchronous writes to all 3 zones
12 9s of durability
Available in 8 regions
SLA: 99.9%
Zone 2 Zone 3
38. File Sync
• Windows Srv <-> Azure
• Local caching
• With offline (Databox) can
'sync' remainder
Fuse
• Mount blobs as local FS
• Commit on write
• Linux
Site Replication
• On premise & cloud
• Windows, Linux
• Physical, virtual
• Hyper-V, VMWare
Network Acceleration
• Aspera
• Signiant
AZCopy
• Throughput +30%
• S3 to Azure Blobs
• Sync to cloud
• Hi Latency 10-100%
NetApp
• CloudSync
• SnapMirror
• SnapVault
Data Factory
• On premise & cloud sources
• Structured & unstructured
• Over 60 connectors
• UI design data flow
Partners
• Peer Global File Service
• Talon FAST
• Zerto
• …
Offline
• Data Box
• Data Box Heavy
• Data Box Disk
• Disk Import / Export
Fast Data Transfer
microsoft.com/en-us/garage/profiles/fast-data-transfer/
39. Data Box Heavy PREVIEW
• Capacity: 1 PB
• Weight 500+ lbs
• Secure, ruggedized
appliance
• Same service as Data Box,
but targeted to petabyte-
sized datasets.
Data Box Gateway
• Virtual device provisioned in
your hypervisor
• Supports storage gateway,
SMB, NFS, Azure blob, files
• Virtual network transfer
appliance (VM), runs on your
choice of hardware.
Data Box Edge
• Local Cache Capacity: ~12 TB
• Includes Data Box Gateway
and Azure IoT Edge.
• Data Box Edge manages
uploads to Azure and can
pre-process data prior to
upload.
Data Box
• Capacity: 100 TB
• Weight: ~50 lbs
• Secure, ruggedized
appliance
• Data Box enables bulk
migration to Azure when
network isn’t an option.
Data Box Disk
• Capacity: 8TB ea.; 40TB/order
• Secure, ruggedized USB
drives orderable in packs of 5
(up to 40TB).
• Perfect for projects that
require a smaller form factor,
e.g., autonomous vehicles.
Order Fill UploadSend Return Cloud to Edge Edge to Cloud Pre-processing ML Inferencing
Network Data Transfer Edge Compute
Offline Data Transfer Online Data Transfer
48. Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Preparatory file storage for
multi-structured data
Exploratory analysis +
POCs to determine value of
new data types & sources
Affords additional time for
longer-term planning while
accumulating data or
handling an influx of data
Raw Data
Web Logs
Images, Audio,
Video
Spatial,
GPS
Exploratory
Analysis
Data Science
Sandbox
Ingestion of New File Types
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
49. Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Sandbox solutions for initial
data prep, experimentation,
and analysis
Migrate from proof of concept
to operationalized solution
Integrate with open source
projects such as Hive, Pig,
Spark, Storm, etc.
Big data clusters
SQL-on-Hadoop solutions
Raw Data
Spatial,
GPS
Exploratory
Analysis
Data Science
Sandbox
Hadoop
Advanced
Analytics
Curated Data
Analytics &
Reporting
Machine
Learning
Data Science Experimentation | Hadoop
Integration
Flat Files
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
50. Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Data
Warehouse
Analytics &
Reporting
Cubes &
Semantic
Models
Corporate
Data
Cloud
Systems
Third Party
Data, Flat Files
ELT strategy
Reduce storage needs in
relational platform by
using the data lake as
landing area
Practical use for data
stored in the data lake
Potentially also handle
transformations in the
data lake
Data
Processing
Jobs
Raw Data:
Staging Area
Data Warehouse Staging Area
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
51. Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Data
Warehouse
Analytics &
Reporting
Cubes &
Semantic
Models
Corporate
Data
Cloud
Systems
Third Party
Data, Flat Files
Grow around existing
DW
Aged data available for
querying when needed
Complement to the DW
via data virtualization
Federated queries to
access current data
(relational DB) +
archive (data lake)
Raw Data:
Staging Area
Archived Data
Integration with DW | Data Archival |
Centralization
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
52. Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Speed Layer
Data
Warehouse Analytics &
Reporting
Cubes &
Semantic
Models
Support for low-latency,
high-velocity data in
near real time
Support for batch-
oriented operations
Batch Layer
Serving Layer
Data
Ingestion
Data
Processing
Data Lake:
Raw Data
Corporate
Data
Data Lake:
Curated Data
Lambda Architecture
Near Real-Time
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
53. Q & A
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)
Editor's Notes
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Fluff, but point is I bring real work experience to the session
Removed
- SMP vs MPP
This scenario uses an enterprise data warehouse (EDW) built on a RDBMS, but will extract data from the EDW and load it into a big data hub along with data from other sources that are deemed not cost-effective to move into the EDW (usually high-volume data or cold data). Some data enrichment is usually done in the data hub. This data hub can then be queried, but primary analytics remain with the EDW. The data hub is usually built on Hadoop or NoSQL. This can save costs since storage using Hadoop or NoSQL is much cheaper than an EDW. Plus, this can speed up the development of reports since the data in Hadoop or NoSQL can be used right away instead of waiting for an IT person to write the ETL and create the schema’s to ingest the data into the EDW. Another benefit is it can support data growth faster as it is easy to expand storage on a Hadoop/NoSQL solution instead of on a SAN with an EDW solution. Finally, it can help by reducing the number of queries on the EDW.
This scenario is most common when a EDW has been in existence for a while and users are requesting data that the EDW cannot handle because of space, performance, and data loading times.
The challenges to this approach is you might not be able to use your existing tools to query the data hub, as well as the data in the hub being difficult to understand and join and may not be completely clean. Not using big data repository as cleaning area, offloading from EDW
The data hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed in the EDW. Hadoop or NoSQL is used to refine the data in the data hub. Once refined, the data is copied to the EDW and then deleted from the data hub.
This will lower the cost of data capture, provide scalable data refinement, and provide fast queries via the EDW. It also offloads the data refinement from the EDW.
Cons: data hub is temporary, no reporting or analyzing done with it
A distributed data system is implemented for long-term, high-detail big data persistence in the data hub and analytics without employing a EDW. Low level code is written or big data packages are added that integrate directly with the distributed data store for extreme-scale operations and analytics.
The distributed data hub is usually created with Hadoop, HBase, Cassandra, or MongoDB. BI tools specifically integrated with or designed for distributed data access and manipulation are needed. Data operations either use BI tools that provide NoSQL capability or low-level code is required (e.g., MapReduce or Pig script).
The disadvantages of this scenario are reports and queries can have longer latency, new reporting tools require training which could lead to lower adoption, and the difficulty of providing governance and structure on top of a non-RDBMS solution.
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
10
Key Points:
Businesses can use new data streams to gain a competitive advantage.
Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming.
Talk Track:
Does it not seem like every day there is a new kind of data that we need to understand?
New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it.
Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business.
Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social.
Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second.
All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach:
Collecting data on-premises with SQL Server
SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing
With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps.
If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple.
Capture new data types using the power and flexibility of the Microsoft Azure Cloud
Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business.
Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools.
SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology.
Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images.
Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities.
Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities.
Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning.
Document DB: a fully managed, highly scalable, NoSQL document database service
Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data
Azure Data Factory: enables information production by orchestrating and managing diverse data
Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds
Microsoft Analytics Platform System:
In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse.
But this traditional data warehouse is under pressure, hitting limits amidst massive change.
Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights.
They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments.
Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer.
The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance.
Now you can collect relational and non-relational data in one appliance.
You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.
All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
All data has immediate or potential value
This leads to data hoarding—all data is stored indefinitely
With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation
Schema is imposed and transformations are done at query time(schema-on-read). Applications and users interpret the data as they see fit.
Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it
Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data
There are two approaches to doing information management for analytics:
Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?
Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?
In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach.
.
https://www.jamesserra.com/archive/2017/06/data-lake-details/
https://blog.pythian.com/reduce-costs-by-adding-a-data-lake-to-your-cloud-data-warehouse/
Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)
http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/
http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/
http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx
http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/
http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses
http://www.martinsights.com/?p=1088
http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/
http://www.martinsights.com/?p=1082
http://www.martinsights.com/?p=1094
http://www.martinsights.com/?p=1102
Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training
I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:
- Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
- Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS
- Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
- ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake
- There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
- The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake
https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning
Question: Do you see many companies building data lakes?
Raw: Raw events are stored for historical reference. Also called staging layer or landing area
Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation
Sandbox: Optional layer to be used to “play” in. Also called exploration layer or data science workspace
Thanks to Melissa Coates, www.CoatesDataStrategies.com
Gen1 vs Gen2:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade