SlideShare a Scribd company logo
1 of 53
Data Lake Overview
James Serra
Data & AI Architect
Microsoft, NYC MTC
JamesSerra3@gmail.com
Blog: JamesSerra.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Big Data Architectures
 Why data lakes?
 Top-down vs Bottom-up
 Data lake defined
 Creating ADLS Gen2
 Data Lake Use Cases
?
?
?
?
Big Data Architectures
Enterprise data warehouse augmentation
• Seen when EDW has been in
existence a while and EDW can’t
handle new data
• Data hub, not data lake
• Cons: not offloading EDW work,
can’t use existing tools, difficulty
joining data in data hub with EDW
Data hub plus EDW
• Data hub is used as temporary
staging and refining, no reporting
• Cons: data hub is temporary, no
reporting/analyzing done with the
data hub
(temporary)
All-in-one
Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/
• Data hub is total solution, no EDW
• Cons: queries are slower, new
training for reporting tools, difficulty
understanding data, security
limitations
Modern Data Warehouse
• Evolution of three previous scenarios
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in the
data lake or moved to EDW for more
quality and performance
INGEST STORE PREP & TRAIN MODEL & SERVE
M O D E R N D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflows
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
?
?
?
?
Why data lakes?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
12
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
The three V’s
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
Use a data lake:
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
?
?
?
?
Top-down vs Bottom-up
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
?
?
?
?
Data lake defined
Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Centralized place for multiple subjects (single version of the truth)
• Collect all data “just in case” (data hoarding)
• Easy integration of differently-structured data
• Store data with no modeling – “Schema on read”
• Complements enterprise data warehouse (EDW)
• Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention)
• Hadoop cluster offers faster ETL processing over SMP solutions
• Quick user access to data for power users/data scientists (allowing for faster ROI)
• Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report
• Allows use of Hadoop tools such as ETL and extreme analytics
• Place to land IoT streaming data
• On-line archive or backup for data warehouse data
• With Hadoop/ADLS, high availability and disaster recovery built in
• Keep raw data so don’t have to go back to source if need to re-run
• Allows for data to be used many times for different analytic needs and use cases
• Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different
sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW
• Extreme performance for transformations by having multiple compute options each accessing different folders containing data
• The ability for an end-user or product to easily access the data from any location
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
Needs data governance so your data lake does not turn
into a data swamp!
Objectives
 Plan the structure based on optimal data retrieval
 Avoid a chaotic, unorganized data swamp
Data Retention Policy
Temporary data
Permanent data
Applicable period (ex: project lifetime)
etc…
Business Impact / Criticality
High (HBI)
Medium (MBI)
Low (LBI)
etc…
Confidential Classification
Public information
Internal use only
Supplier/partner confidential
Personally identifiable information (PII)
Sensitive – financial
Sensitive – intellectual property
etc…
Probability of Data Access
Recent/current data
Historical data
etc…
Owner / Steward / SME
Subject Area
Security Boundaries
Department
Business unit
etc…
Time Partitioning
Year/Month/Day/Hour/Minute
Downstream App/Purpose
Common ways to organize the data:
Special thanks to:
Melissa Coates
CoatesDataStrategies.com
Example 1
Pros: Subject area at top level, organization-wide
Partitioned by time
Cons: No obvious security or organizational boundaries
Raw Data Zone
Subject Area
Data Source
Object
Date Loaded
File(s)
----------------------------------------------------
Sales
Salesforce
CustomerContacts
2016
12
01
CustContact_2016_12_01.txt
Curated Data Zone
Purpose
Type
Snapshot Date
File(s)
-----------------------------------
Sales Trending Analysis
Summarized
2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
?
?
?
?
Creating ADLS Gen2
A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e G e n 2
M A N A G E A B L E S C A L A B L EF A S TS E C U R E
 No limits on
data store size
 Global footprint
(50 regions)
 Optimized for Spark
and Hadoop Analytic
Engines
 Tightly integrated
with Azure end to
end analytics
solutions
 Automated
Lifecycle Policy
Management
 Object Level
tiering
 Support for fine-grained
ACLs, protecting data at
the file and folder level
 Multi-layered protection
via at-rest Storage
Service encryption and
Azure Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
 Atomic file
operations
means jobs
complete faster
 Object store
pricing levels
 File system
operations
minimize
transactions
required for job
completion
Blob Storage Data Lake Store
Azure Data Lake Storage Gen2
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
Missing blob storage features:
- Archive and Premium tier
- Soft Delete
- Snapshots
- Some features in preview
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues
Missing ADLS Gen1 features:
- Microsoft product support: ADC, Excel, AAS
- 3rd-party products: Informatica, Attunity, Alteryx
- Some features in preview
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
Azure Data Lake Store – Distributed File
System
ADLS File
Files of any size can be stored because ADLS is a distributed system which file contents are
divided up across backend storage nodes.
A read operation on the file is also parallelized across the nodes.
Blocks are also replicated for fault tolerance.
Data
Node 1
Block
Data
Node 2
Block
Data
Node 4
Block
Data
Node 3
Block
The ideal file size in ADLS is 256MB –
2GB in size.
Many very tiny files introduces
significant overhead which reduces
performance. This is a well-known
issue with storing data in HDFS.
Techniques:
• Append-only data streams Thanks to Melissa Coates, www.CoatesDataStrategies.com
LRS
Multiple replicas across
a datacenter
Protect against disk,
node, rack failures
Write is ack’d when all
replicas are committed
Superior to dual-parity
RAID
11 9s of durability
SLA: 99.9%
GRS
Multiple replicas across each
of 2 regions
Protects against major
regional disasters
Asynchronous to secondary
16 9s of durability
SLA: 99.9%
RA-GRS
GRS + Read access to secondary
Separate secondary endpoint
RPO delay to secondary can be
queried
SLA: 99.99% (read), 99.9% (write)
Zone 1
ZRS
Replicas across 3 Zones
Protect against disk, node, rack and
zone failures
Synchronous writes to all 3 zones
12 9s of durability
Available in 8 regions
SLA: 99.9%
Zone 2 Zone 3
File Sync
• Windows Srv <-> Azure
• Local caching
• With offline (Databox) can
'sync' remainder
Fuse
• Mount blobs as local FS
• Commit on write
• Linux
Site Replication
• On premise & cloud
• Windows, Linux
• Physical, virtual
• Hyper-V, VMWare
Network Acceleration
• Aspera
• Signiant
AZCopy
• Throughput +30%
• S3 to Azure Blobs
• Sync to cloud
• Hi Latency 10-100%
NetApp
• CloudSync
• SnapMirror
• SnapVault
Data Factory
• On premise & cloud sources
• Structured & unstructured
• Over 60 connectors
• UI design data flow
Partners
• Peer Global File Service
• Talon FAST
• Zerto
• …
Offline
• Data Box
• Data Box Heavy
• Data Box Disk
• Disk Import / Export
Fast Data Transfer
microsoft.com/en-us/garage/profiles/fast-data-transfer/
Data Box Heavy PREVIEW
• Capacity: 1 PB
• Weight 500+ lbs
• Secure, ruggedized
appliance
• Same service as Data Box,
but targeted to petabyte-
sized datasets.
Data Box Gateway
• Virtual device provisioned in
your hypervisor
• Supports storage gateway,
SMB, NFS, Azure blob, files
• Virtual network transfer
appliance (VM), runs on your
choice of hardware.
Data Box Edge
• Local Cache Capacity: ~12 TB
• Includes Data Box Gateway
and Azure IoT Edge.
• Data Box Edge manages
uploads to Azure and can
pre-process data prior to
upload.
Data Box
• Capacity: 100 TB
• Weight: ~50 lbs
• Secure, ruggedized
appliance
• Data Box enables bulk
migration to Azure when
network isn’t an option.
Data Box Disk
• Capacity: 8TB ea.; 40TB/order
• Secure, ruggedized USB
drives orderable in packs of 5
(up to 40TB).
• Perfect for projects that
require a smaller form factor,
e.g., autonomous vehicles.
Order Fill UploadSend Return Cloud to Edge Edge to Cloud Pre-processing ML Inferencing
Network Data Transfer Edge Compute
Offline Data Transfer Online Data Transfer
https://azure.microsoft.com/en-us/features/storage-explorer/
Caching Layer (Avere tech)
Extra Hot Tier - Premium (SSD + NVME)
Hot Tier (HDD)
Cool Tier (HDD)
Cooler Tier (Pelican)
Archive Tier (Tape)
Deep Storage Tier (Glass, DNA, etc.)
Analytics Engines (Hadoop,
Spark, SCOPE …)
High Performance
Compute
AI / ML
Current Future
Edge
Azure File Sync
Azure Backup
Data Box
Data Box
Edge
Azure
Stack
REST HDFS NFS SMB …
Automatic
Lifecycle
Management
Priority
Retrieval
[preview]
1hr vs 15
Avere
FXT
?
?
?
?
Data Lake Use Cases
Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
 Preparatory file storage for
multi-structured data
 Exploratory analysis +
POCs to determine value of
new data types & sources
 Affords additional time for
longer-term planning while
accumulating data or
handling an influx of data
Raw Data
Web Logs
Images, Audio,
Video
Spatial,
GPS
Exploratory
Analysis
Data Science
Sandbox
Ingestion of New File Types
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
 Sandbox solutions for initial
data prep, experimentation,
and analysis
 Migrate from proof of concept
to operationalized solution
 Integrate with open source
projects such as Hive, Pig,
Spark, Storm, etc.
 Big data clusters
 SQL-on-Hadoop solutions
Raw Data
Spatial,
GPS
Exploratory
Analysis
Data Science
Sandbox
Hadoop
Advanced
Analytics
Curated Data
Analytics &
Reporting
Machine
Learning
Data Science Experimentation | Hadoop
Integration
Flat Files
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Data
Warehouse
Analytics &
Reporting
Cubes &
Semantic
Models
Corporate
Data
Cloud
Systems
Third Party
Data, Flat Files
 ELT strategy
 Reduce storage needs in
relational platform by
using the data lake as
landing area
 Practical use for data
stored in the data lake
 Potentially also handle
transformations in the
data lake
Data
Processing
Jobs
Raw Data:
Staging Area
Data Warehouse Staging Area
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Data Lake
Data
Warehouse
Analytics &
Reporting
Cubes &
Semantic
Models
Corporate
Data
Cloud
Systems
Third Party
Data, Flat Files
 Grow around existing
DW
 Aged data available for
querying when needed
 Complement to the DW
via data virtualization
 Federated queries to
access current data
(relational DB) +
archive (data lake)
Raw Data:
Staging Area
Archived Data
Integration with DW | Data Archival |
Centralization
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Data Lake Use Cases
Social
Media Data
Devices &
Sensors
Speed Layer
Data
Warehouse Analytics &
Reporting
Cubes &
Semantic
Models
 Support for low-latency,
high-velocity data in
near real time
 Support for batch-
oriented operations
Batch Layer
Serving Layer
Data
Ingestion
Data
Processing
Data Lake:
Raw Data
Corporate
Data
Data Lake:
Curated Data
Lambda Architecture
Near Real-Time
Thanks to Melissa Coates,
www.CoatesDataStrategies.com
Q & A
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

What's hot

Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 

What's hot (20)

Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Data mesh
Data meshData mesh
Data mesh
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 

Similar to Data Lake Overview

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformIke Ellis
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresCCG
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeMicrosoft
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?James Serra
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligenceshraddha mane
 
Sql server 2008 business intelligence tdm deck
Sql server 2008 business intelligence tdm deckSql server 2008 business intelligence tdm deck
Sql server 2008 business intelligence tdm deckKlaudiia Jacome
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 

Similar to Data Lake Overview (20)

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Derfor skal du bruge en DataLake
Derfor skal du bruge en DataLakeDerfor skal du bruge en DataLake
Derfor skal du bruge en DataLake
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Sql server 2008 business intelligence tdm deck
Sql server 2008 business intelligence tdm deckSql server 2008 business intelligence tdm deck
Sql server 2008 business intelligence tdm deck
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 

More from James Serra

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric IntroductionJames Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernanceJames Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI OverviewJames Serra
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your careerJames Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed InstanceJames Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at itJames Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016James Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB James Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine LearningJames Serra
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridJames Serra
 

More from James Serra (18)

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybrid
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Data Lake Overview

  • 1. Data Lake Overview James Serra Data & AI Architect Microsoft, NYC MTC JamesSerra3@gmail.com Blog: JamesSerra.com
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Creating ADLS Gen2  Data Lake Use Cases
  • 5. Enterprise data warehouse augmentation • Seen when EDW has been in existence a while and EDW can’t handle new data • Data hub, not data lake • Cons: not offloading EDW work, can’t use existing tools, difficulty joining data in data hub with EDW
  • 6. Data hub plus EDW • Data hub is used as temporary staging and refining, no reporting • Cons: data hub is temporary, no reporting/analyzing done with the data hub (temporary)
  • 7. All-in-one Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ • Data hub is total solution, no EDW • Cons: queries are slower, new training for reporting tools, difficulty understanding data, security limitations
  • 8. Modern Data Warehouse • Evolution of three previous scenarios • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  • 9. INGEST STORE PREP & TRAIN MODEL & SERVE M O D E R N D A T A W A R E H O U S E Azure Data Lake Store Gen2 Logs (unstructured) Azure Data Factory Azure Databricks Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. Media (unstructured) Files (unstructured) PolyBase Business/custom apps (structured) Azure SQL Data Warehouse Azure Analysis Services Power BI
  • 10. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP MODEL & SERVE (& store) Data orchestration and monitoring Big data store Transform & Clean Data warehouse AI BI + Reporting Azure Data Factory SSIS Azure Data Lake Storage Gen2 Blob Storage SQL Server 2019 Big Data Cluster Azure Databricks Azure HDInsight PolyBase & Stored Procedures Power BI Dataflows Azure SQL Data Warehouse Azure Analysis Services SQL Database (Single, MI, HyperScale, Serverless) SQL Server in a VM Cosmos DB Power BI Aggregations
  • 12. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 12
  • 13. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  • 15. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value Use a data lake: All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit
  • 17. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  • 18. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  • 19. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 20. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  • 22. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Centralized place for multiple subjects (single version of the truth) • Collect all data “just in case” (data hoarding) • Easy integration of differently-structured data • Store data with no modeling – “Schema on read” • Complements enterprise data warehouse (EDW) • Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention) • Hadoop cluster offers faster ETL processing over SMP solutions • Quick user access to data for power users/data scientists (allowing for faster ROI) • Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report • Allows use of Hadoop tools such as ETL and extreme analytics • Place to land IoT streaming data • On-line archive or backup for data warehouse data • With Hadoop/ADLS, high availability and disaster recovery built in • Keep raw data so don’t have to go back to source if need to re-run • Allows for data to be used many times for different analytic needs and use cases • Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW • Extreme performance for transformations by having multiple compute options each accessing different folders containing data • The ability for an end-user or product to easily access the data from any location
  • 23. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  • 24. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  • 25. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  • 26. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  • 27. Needs data governance so your data lake does not turn into a data swamp!
  • 28. Objectives  Plan the structure based on optimal data retrieval  Avoid a chaotic, unorganized data swamp Data Retention Policy Temporary data Permanent data Applicable period (ex: project lifetime) etc… Business Impact / Criticality High (HBI) Medium (MBI) Low (LBI) etc… Confidential Classification Public information Internal use only Supplier/partner confidential Personally identifiable information (PII) Sensitive – financial Sensitive – intellectual property etc… Probability of Data Access Recent/current data Historical data etc… Owner / Steward / SME Subject Area Security Boundaries Department Business unit etc… Time Partitioning Year/Month/Day/Hour/Minute Downstream App/Purpose Common ways to organize the data: Special thanks to: Melissa Coates CoatesDataStrategies.com
  • 29. Example 1 Pros: Subject area at top level, organization-wide Partitioned by time Cons: No obvious security or organizational boundaries Raw Data Zone Subject Area Data Source Object Date Loaded File(s) ---------------------------------------------------- Sales Salesforce CustomerContacts 2016 12 01 CustContact_2016_12_01.txt Curated Data Zone Purpose Type Snapshot Date File(s) ----------------------------------- Sales Trending Analysis Summarized 2016_12_01 SalesTrend_2016_12_01.txt Organizing a Data Lake Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 30. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  • 32. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage A z u r e D a t a L a k e S t o r a g e G e n 2 M A N A G E A B L E S C A L A B L EF A S TS E C U R E  No limits on data store size  Global footprint (50 regions)  Optimized for Spark and Hadoop Analytic Engines  Tightly integrated with Azure end to end analytics solutions  Automated Lifecycle Policy Management  Object Level tiering  Support for fine-grained ACLs, protecting data at the file and folder level  Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration C O S T E F F E C T I V E I N T E G R AT I O N R E A D Y  Atomic file operations means jobs complete faster  Object store pricing levels  File system operations minimize transactions required for job completion
  • 33. Blob Storage Data Lake Store Azure Data Lake Storage Gen2 Large partner ecosystem Global scale – All 50 regions Durability options Tiered - Hot/Cool/Archive Cost Efficient Built for Hadoop Hierarchical namespace ACLs, AAD and RBAC Performance tuned for big data Very high scale capacity and throughput Large partner ecosystem Global scale – All 50 regions Durability options Tiered - Hot/Cool/Archive Cost Efficient Built for Hadoop Hierarchical namespace ACLs, AAD and RBAC Performance tuned for big data Very high scale capacity and throughput
  • 34. Missing blob storage features: - Archive and Premium tier - Soft Delete - Snapshots - Some features in preview https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-known-issues Missing ADLS Gen1 features: - Microsoft product support: ADC, Excel, AAS - 3rd-party products: Informatica, Attunity, Alteryx - Some features in preview https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
  • 35. Azure Data Lake Store – Distributed File System ADLS File Files of any size can be stored because ADLS is a distributed system which file contents are divided up across backend storage nodes. A read operation on the file is also parallelized across the nodes. Blocks are also replicated for fault tolerance. Data Node 1 Block Data Node 2 Block Data Node 4 Block Data Node 3 Block The ideal file size in ADLS is 256MB – 2GB in size. Many very tiny files introduces significant overhead which reduces performance. This is a well-known issue with storing data in HDFS. Techniques: • Append-only data streams Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 36.
  • 37. LRS Multiple replicas across a datacenter Protect against disk, node, rack failures Write is ack’d when all replicas are committed Superior to dual-parity RAID 11 9s of durability SLA: 99.9% GRS Multiple replicas across each of 2 regions Protects against major regional disasters Asynchronous to secondary 16 9s of durability SLA: 99.9% RA-GRS GRS + Read access to secondary Separate secondary endpoint RPO delay to secondary can be queried SLA: 99.99% (read), 99.9% (write) Zone 1 ZRS Replicas across 3 Zones Protect against disk, node, rack and zone failures Synchronous writes to all 3 zones 12 9s of durability Available in 8 regions SLA: 99.9% Zone 2 Zone 3
  • 38. File Sync • Windows Srv <-> Azure • Local caching • With offline (Databox) can 'sync' remainder Fuse • Mount blobs as local FS • Commit on write • Linux Site Replication • On premise & cloud • Windows, Linux • Physical, virtual • Hyper-V, VMWare Network Acceleration • Aspera • Signiant AZCopy • Throughput +30% • S3 to Azure Blobs • Sync to cloud • Hi Latency 10-100% NetApp • CloudSync • SnapMirror • SnapVault Data Factory • On premise & cloud sources • Structured & unstructured • Over 60 connectors • UI design data flow Partners • Peer Global File Service • Talon FAST • Zerto • … Offline • Data Box • Data Box Heavy • Data Box Disk • Disk Import / Export Fast Data Transfer microsoft.com/en-us/garage/profiles/fast-data-transfer/
  • 39. Data Box Heavy PREVIEW • Capacity: 1 PB • Weight 500+ lbs • Secure, ruggedized appliance • Same service as Data Box, but targeted to petabyte- sized datasets. Data Box Gateway • Virtual device provisioned in your hypervisor • Supports storage gateway, SMB, NFS, Azure blob, files • Virtual network transfer appliance (VM), runs on your choice of hardware. Data Box Edge • Local Cache Capacity: ~12 TB • Includes Data Box Gateway and Azure IoT Edge. • Data Box Edge manages uploads to Azure and can pre-process data prior to upload. Data Box • Capacity: 100 TB • Weight: ~50 lbs • Secure, ruggedized appliance • Data Box enables bulk migration to Azure when network isn’t an option. Data Box Disk • Capacity: 8TB ea.; 40TB/order • Secure, ruggedized USB drives orderable in packs of 5 (up to 40TB). • Perfect for projects that require a smaller form factor, e.g., autonomous vehicles. Order Fill UploadSend Return Cloud to Edge Edge to Cloud Pre-processing ML Inferencing Network Data Transfer Edge Compute Offline Data Transfer Online Data Transfer
  • 40.
  • 41.
  • 43.
  • 44. Caching Layer (Avere tech) Extra Hot Tier - Premium (SSD + NVME) Hot Tier (HDD) Cool Tier (HDD) Cooler Tier (Pelican) Archive Tier (Tape) Deep Storage Tier (Glass, DNA, etc.) Analytics Engines (Hadoop, Spark, SCOPE …) High Performance Compute AI / ML Current Future Edge Azure File Sync Azure Backup Data Box Data Box Edge Azure Stack REST HDFS NFS SMB … Automatic Lifecycle Management Priority Retrieval [preview] 1hr vs 15 Avere FXT
  • 45.
  • 46.
  • 48. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake  Preparatory file storage for multi-structured data  Exploratory analysis + POCs to determine value of new data types & sources  Affords additional time for longer-term planning while accumulating data or handling an influx of data Raw Data Web Logs Images, Audio, Video Spatial, GPS Exploratory Analysis Data Science Sandbox Ingestion of New File Types Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 49. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake  Sandbox solutions for initial data prep, experimentation, and analysis  Migrate from proof of concept to operationalized solution  Integrate with open source projects such as Hive, Pig, Spark, Storm, etc.  Big data clusters  SQL-on-Hadoop solutions Raw Data Spatial, GPS Exploratory Analysis Data Science Sandbox Hadoop Advanced Analytics Curated Data Analytics & Reporting Machine Learning Data Science Experimentation | Hadoop Integration Flat Files Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 50. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake Data Warehouse Analytics & Reporting Cubes & Semantic Models Corporate Data Cloud Systems Third Party Data, Flat Files  ELT strategy  Reduce storage needs in relational platform by using the data lake as landing area  Practical use for data stored in the data lake  Potentially also handle transformations in the data lake Data Processing Jobs Raw Data: Staging Area Data Warehouse Staging Area Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 51. Data Lake Use Cases Social Media Data Devices & Sensors Data Lake Data Warehouse Analytics & Reporting Cubes & Semantic Models Corporate Data Cloud Systems Third Party Data, Flat Files  Grow around existing DW  Aged data available for querying when needed  Complement to the DW via data virtualization  Federated queries to access current data (relational DB) + archive (data lake) Raw Data: Staging Area Archived Data Integration with DW | Data Archival | Centralization Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 52. Data Lake Use Cases Social Media Data Devices & Sensors Speed Layer Data Warehouse Analytics & Reporting Cubes & Semantic Models  Support for low-latency, high-velocity data in near real time  Support for batch- oriented operations Batch Layer Serving Layer Data Ingestion Data Processing Data Lake: Raw Data Corporate Data Data Lake: Curated Data Lambda Architecture Near Real-Time Thanks to Melissa Coates, www.CoatesDataStrategies.com
  • 53. Q & A James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Editor's Notes

  1. The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
  2. Fluff, but point is I bring real work experience to the session
  3. Removed - SMP vs MPP
  4. This scenario uses an enterprise data warehouse (EDW) built on a RDBMS, but will extract data from the EDW and load it into a big data hub along with data from other sources that are deemed not cost-effective to move into the EDW (usually high-volume data or cold data).  Some data enrichment is usually done in the data hub.  This data hub can then be queried, but primary analytics remain with the EDW.  The data hub is usually built on Hadoop or NoSQL.  This can save costs since storage using Hadoop or NoSQL is much cheaper than an EDW.  Plus, this can speed up the development of reports since the data in Hadoop or NoSQL can be used right away instead of waiting for an IT person to write the ETL and create the schema’s to ingest the data into the EDW.  Another benefit is it can support data growth faster as it is easy to expand storage on a Hadoop/NoSQL solution instead of on a SAN with an EDW solution.  Finally, it can help by reducing the number of queries on the EDW. This scenario is most common when a EDW has been in existence for a while and users are requesting data that the EDW cannot handle because of space, performance, and data loading times. The challenges to this approach is you might not be able to use your existing tools to query the data hub, as well as the data in the hub being difficult to understand and join and may not be completely clean. Not using big data repository as cleaning area, offloading from EDW
  5. The data hub is used as a data staging and extreme-scale data transformation platform, but long-term persistence and analytics is performed in the EDW.  Hadoop or NoSQL is used to refine the data in the data hub.  Once refined, the data is copied to the EDW and then deleted from the data hub. This will lower the cost of data capture, provide scalable data refinement, and provide fast queries via the EDW.  It also offloads the data refinement from the EDW. Cons: data hub is temporary, no reporting or analyzing done with it
  6. A distributed data system is implemented for long-term, high-detail big data persistence in the data hub and analytics without employing a EDW.  Low level code is written or big data packages are added that integrate directly with the distributed data store for extreme-scale operations and analytics. The distributed data hub is usually created with Hadoop, HBase, Cassandra, or MongoDB.  BI tools specifically integrated with or designed for distributed data access and manipulation are needed.  Data operations either use BI tools that provide NoSQL capability or low-level code is required (e.g., MapReduce or Pig script). The disadvantages of this scenario are reports and queries can have longer latency, new reporting tools require training which could lead to lower adoption, and the difficulty of providing governance and structure on top of a non-RDBMS solution.
  7. An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data. Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
  8. 10
  9. Key Points: Businesses can use new data streams to gain a competitive advantage. Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming. Talk Track: Does it not seem like every day there is a new kind of data that we need to understand? New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it. Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business. Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social. Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second. All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach: Collecting data on-premises with SQL Server SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps. If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple. Capture new data types using the power and flexibility of the Microsoft Azure Cloud Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business. Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools. SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology. Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images. Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities. Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities. Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. Document DB: a fully managed, highly scalable, NoSQL document database service Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data Azure Data Factory: enables information production by orchestrating and managing diverse data Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds Microsoft Analytics Platform System: In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse. But this traditional data warehouse is under pressure, hitting limits amidst massive change. Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights. They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments. Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer. The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance. Now you can collect relational and non-relational data in one appliance. You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.   All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
  10. All data has immediate or potential value This leads to data hoarding—all data is stored indefinitely With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  11. Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data There are two approaches to doing information management for analytics: Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen? Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen? In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach. .
  12. https://www.jamesserra.com/archive/2017/06/data-lake-details/ https://blog.pythian.com/reduce-costs-by-adding-a-data-lake-to-your-cloud-data-warehouse/ Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/ http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses http://www.martinsights.com/?p=1088 http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/ http://www.martinsights.com/?p=1082 http://www.martinsights.com/?p=1094 http://www.martinsights.com/?p=1102
  13. Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:   - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools) - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
  14. https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning Question: Do you see many companies building data lakes? Raw: Raw events are stored for historical reference. Also called staging layer or landing area Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation Sandbox: Optional layer to be used to “play” in.  Also called exploration layer or data science workspace
  15. Thanks to Melissa Coates, www.CoatesDataStrategies.com
  16. Gen1 vs Gen2: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-upgrade
  17. https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_3/
  18. Transfer options: https://www.microsoft.com/en-us/garage/profiles/fast-data-transfer/ Azure data box
  19. https://azure.microsoft.com/en-us/services/databox/