This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
3. Maximize Business Value With a Data
Lake
How Do You Democratize the Data Lake to Maximize Business
Value?
Data
Lake
Data
Puddle
Data
Swamp
No Value Enterprise Impact
Tight Control
“Governed”
Self-Service
Business Value
Data
Democratization
DW Off-
loading
5. Data Warehouse Offloading: Cost Savings
I prefer a data
warehouse--it’s
more predictable
It takes IT 3 months of data
architecture and ETL work to
add new data to the data lake
I can’t get the original data
6. Low variety of data and low adoption
• Focused use case (e.g., fraud detection)
• Fully automated programs (e.g., ETL off-loading)
• Small user community (e.g., data science sand box)
Strong technical skill set requirement
Data Puddles: Limited Scope and Value
7. What Makes a Successful Data Lake?
Right Data Right InterfaceRight Platform + +
8. Right Platform:
• Volume—Massively scalable
• Variety—Schema on read
• Future proof—modular—same data can be used by
many different projects and technologies
• Platform cost – extremely attractive cost structure
9. Right Data Challenges
Most Data is Lost, So it Can’t Be Analyzed Later
Only a small portion of data in enterprises today
is saved in data warehouses
Data Exhaust
10. Right Data: Save Raw Data Now to Analyze Later
• Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
11. • Don’t know now what data will be
needed later
• Save as much data as possible now
to analyze later
• Save raw data, so it can be treated
correctly for each use case
Right Data: Save Raw Data Now to Analyze Later
12. • Departments hoard and protect
their data and do not share it with
the rest of the enterprise
• Frictionless ingestion does not
depend on data owners
Right Data Challenges: Data Silos and Data Hoarding
13. Right Interface: Key to Broad Adoption
• Data marketplace for
data self-service
• Providing data at the
right level of expertise
14. Providing Data at the Right Level of Expertise
Data scientists Business analysts
Raw data
Clean, trusted,
prepared data
15. Roadmap to Data Lake Success
Organize the lake
Set up for self-service
Open the lake to the users
17. Multi-modal IT – Different Governance
Levels for Different Zones
Raw or
Landing Sensitive
Gold or
Curated
Work
Data Stewards
Data Scientists
Data Engineers
Data Scientists, Business Analysts
Minimal governance
Make sure there is no
sensitive data
Minimal governance
Make sure there is no
sensitive data
Heavy governance
Trusted, curated data
Lineage, data quality
Heavy governance
Restricted access
19. Finding, understanding and governing data in
a data lake is like shopping at a flea market
“We have 100 million fields of data – how can anyone find or trust
anything?” – Telco Executive
20. Botond Horvath / Shutterstock.com
DATA SCIENTIST /
BUSINESS ANALYST
DATA
STEWARD
BIG DATA
ARCHITECT
Can’t govern and trust data
(unknown metadata, data
quality, PII, data lineage)
Need data to use with self-
service tools but can’t explore
everything manually to find
and understand data
Can’t catalog all the data
manually and keep up with
data provisioning
23. Finding and Understanding Data
• Crowdsource metadata and automate
creation of a catalog
• Institutionalize tribal data knowledge
• Automate discovery to cover all data
sets
• Establish trust
• Curated annotated data sets
• Lineage
• Data quality
• Governance
Find and
Understand
24. Accessing and Provisioning Data
You cannot give all access to all users
You must protect PII data and sensitive business information
Provision
Agile/Self-service
approach
Create a metadata-only catalog
When users request access,
data is de-identified and
provisioned
Top down approach
Find and de-identify all
sensitive data
Provide access to every user for
every dataset as needed
26. Prepare data for analytics Prep
Clean data
Remove or fix bad data, fill in
missing values, convert to
common units of measure
Shape data
Combine (join, concatenate)
Resolve entities (create a single
customer record from multiple
records or sources)
Transform (aggregate, bucketize,
filter, convert codes to names, etc.)
Blend data
Harmonize data from multiple
sources to a common schema
or model
Tooling
Many great dedicated data
wrangling tools on the horizon
Some capabilities in BI and data
visualization tools
SQL and scripting languages for
the more technical analysts
27. Data Analysis
• Many wonderful self-
service BI and data
visualization tools
• Mature space with many
established and
innovative vendors
Magic Quadrant for Business Intelligence and Analytics Platforms
04 February 2016 | ID:G00275847
Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Analyze
28. Unlock the Value of the Data Lake with the
Waterline Data Smart Data Catalog
Time To Value Tribal Knowledge Sharing Trust
29. Waterline Data Is The Only Smart Data
Catalog For The Data Lake
“Use an INFORMATION
CATALOG TO MAXIMIZE
BUSINESS VALUE From
Information Assets”
“automatically identify, profile,
and metatag files in HDFS and
make them available for
analysis and exploration”
“tapped into an important and
underserved opportunity”
“comprehensive big data
governance and discovery
platform”
“opens the data to a
wider variety of people”
“fills a critical gap in big data
exploratory analytics by
automating the tagging and
cataloging of data”
30. Current Customers
Healthcare
Insurance
Life Sciences
Aerospace
Automotive
Banking
Government
Marketing
"Opening up a data lake for self-service analytics requires a
data catalog that's smart enough to automatically catalog every
field of data so business analysts can maximize time to value” --
Jerry Megaro, Global Head Of Data Analytics, Merck KGaA
“Understanding where your data came from and what it means
in context is vital to making a data lake initiative successful and
not just another data quagmire – the catalog plays a critical
component in this” -- Global Head of Data Governance, Risk,
and Standard, International Multi-Line Insurer
“A governed yet agile data catalog is key to open up the data
lake to business people” -- Paolo Arvati, Big Data, CSI-
Piemonte
32. Workflow of Enabling Self-Service
Analytics With Hortonworks
Hortonworks Atlas And Ranger
Data Prep Analytics &
Visualization
Smart Data
DiscoveryProfiling, Sensitive
Data & Data
Lineage
Discovery,
Automated
Tagging
Data
Stewardship
Curate Tags
Self-Service
Data
Catalog
Find, Collaborate
And Take Action
Metadata,
Tags, Data
Lineage
Metadata,
Tags, Roles &
Access Control
Roles &
Access Control
End-user tools only provide the last mile to leverage data, but they of and by themselves don’t know where the right data is. The right data has to be found, quickly and securely.
The opposite of a flea market is Amazon.
It gives the consumer self-service, but it functions as a managed application.
Like Amazon, we offer a solution that catalogs the data assets, provides a front-end to find, understand, and share, and provides a way to take action and quickly open the data in any end-user tool to wrangle, visualize, or analyze the data.
A data lake provides one place where any data can be saved and used by business analysts and data scientists to mash up data in new ways to answer new business questions.
Waterline Data enables you to open up the data lake to business analysts and data scientists so they can do data prep, analytics, or modeling.
Our product delivers value along 3 dimensions (i.e., the 3 T’s).
We catalog every field of data for the entire data lake and we provide an interface to quickly find, understand, and take action on the data (e.g., you can provision or open the data in Trifacta) – The end result is faster time to uncover value
We don’t just discover what the data means, but we also empower subject matter experts to augment the data catalog with additional tags and comments to capture additional information, such as the intended use of the data, to help accelerate future projects
We facilitate data governance by tagging data based on approved business glossaries and data stewardship curation, as well as by providing secure self-service access to the data based on roles and visibility rules
Waterline Data has been acknowledged as filling an important gap in opening up data lakes for self-service data preparation and analytics.
The need for a data catalog has been recognized as key to enabling a data democracy and self-service by the business. For instance Gartner just released a paper on how CDOs can leverage an information catalog to get more business value from data assets.
We are the only company that can build a data catalog automatically, and at scale, for a data lake.
We have customers in production across many industries. They realize value by being able to catalog all the data quickly and make it easily available to the business to do self-service data preparation and analytics. They also get value from the fact that the data catalog supports agile data governance, by enabling data stewards to quickly curate tags, and by providing several levels of access control based on the data governance policies (e.g., access to sensitive data is protected).
(if they ask, data lakes range from smaller 5-node clusters to over 100 nodes, so our product can be used right away even when the lake is small, and grow to a large lake)
Our product runs natively on the major platforms like AWS, Cloudera, Hortonworks, MapR, and Pivotal. We are also in the process of certifying on IIP.
We integrate with existing data management tools:
We can import and export data lineage and tag information with Atlas and Navigator
We support access control policies and integrate with LDAP, Ranger and Sentry
We can import existing business glossaries from Collibra, Informatica, or IBM (note this is done through our API so we should be able to import from any business glossary)
We can integrate with ETL tools to import metadata
We integrate with end-user tools through an open framework (we provide the ability to generate Hive tables automatically, as well as the ability to open the data directly in end-user tools)
Waterline Data accelerates the creation of the data catalog at big data scale:
We parse, profile, and discover sensitive data and data lineage, and automatically tag fields based on an integrated business glossary and tagging rules
We empower data stewards to quickly curate tags
We empower business analysts and data scientists to quick find the right data they need and take immediate action with the data by being able to open it with the desired end-user tool