Build a Successful Data Lake with a Smart Data Catalog

How to Build a
Successful Data Lake
Alex Gorelik
Waterline Data
Founder and CEO

Data Lakes Power Data-Driven Decision Making

Maximize Business Value With a Data
Lake
How Do You Democratize the Data Lake to Maximize Business
Value?
Data
Lake
Data
Puddle
Data
Swamp
No Value Enterprise Impact
Tight Control
“Governed”
Self-Service
Business Value
Data
Democratization
DW Off-
loading

Data Swamps
Raw data
Can’t find or use
data
Can’t allow access
without protecting
sensitive data

Data Warehouse Offloading: Cost Savings
I prefer a data
warehouse--it’s
more predictable
It takes IT 3 months of data
architecture and ETL work to
add new data to the data lake
I can’t get the original data


Low variety of data and low adoption
• Focused use case (e.g., fraud detection)
• Fully automated programs (e.g., ETL off-loading)
• Small user community (e.g., data science sand box)
Strong technical skill set requirement
Data Puddles: Limited Scope and Value

What Makes a Successful Data Lake?
Right Data Right InterfaceRight Platform + +

Right Platform:
• Volume—Massively scalable
• Variety—Schema on read
• Future proof—modular—same data can be used by
many different projects and technologies
• Platform cost – extremely attractive cost structure

Right Data Challenges
Most Data is Lost, So it Can’t Be Analyzed Later
Only a small portion of data in enterprises today
is saved in data warehouses
Data Exhaust

Right Data: Save Raw Data Now to Analyze Later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later

• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Save raw data, so it can be treated
correctly for each use case
Right Data: Save Raw Data Now to Analyze Later

• Departments hoard and protect
their data and do not share it with
the rest of the enterprise
• Frictionless ingestion does not
depend on data owners
Right Data Challenges: Data Silos and Data Hoarding

Right Interface: Key to Broad Adoption
• Data marketplace for
data self-service
• Providing data at the
right level of expertise

Providing Data at the Right Level of Expertise
Data scientists Business analysts
Raw data
Clean, trusted,
prepared data

Roadmap to Data Lake Success
Organize the lake
Set up for self-service
Open the lake to the users

Organize the Data Lake into Zones Organize
the lake

Multi-modal IT – Different Governance
Levels for Different Zones
Raw or
Landing Sensitive
Gold or
Curated
Work
Data Stewards
Data Scientists
Data Engineers
Data Scientists, Business Analysts
 Minimal governance
 Make sure there is no
sensitive data
 Minimal governance
 Make sure there is no
sensitive data
 Heavy governance
 Trusted, curated data
 Lineage, data quality
 Heavy governance
 Restricted access

Business Analyst Self-Service Workflow
Find and
Understand Provision Prep Analyze
Set up for
self-service

Finding, understanding and governing data in
a data lake is like shopping at a flea market
“We have 100 million fields of data – how can anyone find or trust
anything?” – Telco Executive

Botond Horvath / Shutterstock.com
DATA SCIENTIST /
BUSINESS ANALYST
DATA
STEWARD
BIG DATA
ARCHITECT
Can’t govern and trust data
(unknown metadata, data
quality, PII, data lineage)
Need data to use with self-
service tools but can’t explore
everything manually to find
and understand data
Can’t catalog all the data
manually and keep up with
data provisioning

Instead Imaging Shopping On Amazon.com
Catalog
Find, Understand And
Collaborate
Provision

Catalog
Find, Understand And
Collaborate
Provision
Waterline Data is like Amazon for Data in Hadoop

Finding and Understanding Data
• Crowdsource metadata and automate
creation of a catalog
• Institutionalize tribal data knowledge
• Automate discovery to cover all data
sets
• Establish trust
• Curated annotated data sets
• Lineage
• Data quality
• Governance
Find and
Understand

Accessing and Provisioning Data
You cannot give all access to all users
You must protect PII data and sensitive business information
Provision
Agile/Self-service
approach
Create a metadata-only catalog
When users request access,
data is de-identified and
provisioned
Top down approach
Find and de-identify all
sensitive data
Provide access to every user for
every dataset as needed

Provide a Self-Service Interface to Find,
Understand, and Provision Data

Prepare data for analytics Prep
Clean data
Remove or fix bad data, fill in
missing values, convert to
common units of measure
Shape data
Combine (join, concatenate)
Resolve entities (create a single
customer record from multiple
records or sources)
Transform (aggregate, bucketize,
filter, convert codes to names, etc.)
Blend data
Harmonize data from multiple
sources to a common schema
or model
Tooling
Many great dedicated data
wrangling tools on the horizon
Some capabilities in BI and data
visualization tools
SQL and scripting languages for
the more technical analysts

Data Analysis
• Many wonderful self-
service BI and data
visualization tools
• Mature space with many
established and
innovative vendors
Magic Quadrant for Business Intelligence and Analytics Platforms
04 February 2016 | ID:G00275847
Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Analyze

Unlock the Value of the Data Lake with the
Waterline Data Smart Data Catalog
Time To Value Tribal Knowledge Sharing Trust

Waterline Data Is The Only Smart Data
Catalog For The Data Lake
“Use an INFORMATION
CATALOG TO MAXIMIZE
BUSINESS VALUE From
Information Assets”
“automatically identify, profile,
and metatag files in HDFS and
make them available for
analysis and exploration”
“tapped into an important and
underserved opportunity”
“comprehensive big data
governance and discovery
platform”
“opens the data to a
wider variety of people”
“fills a critical gap in big data
exploratory analytics by
automating the tagging and
cataloging of data”

Current Customers
Healthcare
Insurance
Life Sciences
Aerospace
Automotive
Banking
Government
Marketing
"Opening up a data lake for self-service analytics requires a
data catalog that's smart enough to automatically catalog every
field of data so business analysts can maximize time to value” --
Jerry Megaro, Global Head Of Data Analytics, Merck KGaA
“Understanding where your data came from and what it means
in context is vital to making a data lake initiative successful and
not just another data quagmire – the catalog plays a critical
component in this” -- Global Head of Data Governance, Risk,
and Standard, International Multi-Line Insurer
“A governed yet agile data catalog is key to open up the data
lake to business people” -- Paolo Arvati, Big Data, CSI-
Piemonte

We Run Natively On Hadoop And Integrate
With Existing Tools

Workflow of Enabling Self-Service
Analytics With Hortonworks
Hortonworks Atlas And Ranger
Data Prep Analytics &
Visualization
Smart Data
DiscoveryProfiling, Sensitive
Data & Data
Lineage
Discovery,
Automated
Tagging
Data
Stewardship
Curate Tags
Self-Service
Data
Catalog
Find, Collaborate
And Take Action
Metadata,
Tags, Data
Lineage
Metadata,
Tags, Roles &
Access Control
Roles &
Access Control

A Successful Data Lake
Right Data Right InterfaceRight Platform + +

Come to Booth 303 to see a demo
and talk to us about your data lake
Come to the Atlas session at 4:00 PM on
Thursday in room 210C

Waterline Data
The Smart Data Catalog Company

Build a Successful Data Lake with a Smart Data Catalog

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Build a Successful Data Lake with a Smart Data Catalog

Similar a Build a Successful Data Lake with a Smart Data Catalog (20)

Más de DataWorks Summit/Hadoop Summit

Más de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Build a Successful Data Lake with a Smart Data Catalog

Notas del editor