SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
by
Mark Herman
herman_mark@bah.com
Michael Delurey
delurey_mike@bah.com
Turning Big Data into Opportunity
The Data Lake
Table of Contents
Introduction........................................................................................................................ 1
A New Mindset .................................................................................................................. 1
Ingesting Data into the Data Lake ....................................................................................... 1
Opening Up the Data.......................................................................................................... 2
Tagging the Data................................................................................................................ 3
A New Way of Storing Data.................................................................................................. 4
Accessing the Data for Analytics ......................................................................................... 4
Bottom-Line Savings and Top-Line Growth ............................................................................ 5
1
Introduction
Big data by itself does not create opportunity. The most
successful, competitive organizations will be the ones
with the ability to turn that data into game-changing
paths to new kinds of value, cost savings, revenue
growth, and operational effectiveness.
Organizations today are amassing so much information
so quickly that they are reaching a tipping point. They
are gaining remarkable potential to use big data in
new ways, and redefine the very nature of how they do
business. Yet none of this is guaranteed. Current tools
cannot easily integrate disparate data collections, or
fully use the kinds of “unstructured” data—such as
photographs, doctors’ examination notes, and social
media posts—that hold the most promise.
The bigger that data gets, the more impractical these
tools become, in terms of time, cost, and analytic
ability. The conventional approaches have, in effect,
created a glass ceiling with big data. Organizations
may be able to envision new opportunities for growth
and effectiveness, and yet have no method of reaching
them. There is, however, a way around the glass ceiling.
Booz Allen Hamilton has developed a revolutionary
approach known as the “data lake” that removes the
current constraints.
With the data lake, an organization’s repository of
information—structured and unstructured, along with
streaming and batch data—is consolidated in a single,
large table. The entire body of information in the data
lake is available for every inquiry, and all at once—a
capability that can create powerful new knowledge and
insight. And because the data lake simplifies virtually
every aspect of the loading, storing, and accessing
of data, it provides business and government with
substantial cost savings and efficiencies.
The data lake is now being used in a wide range of
business and government applications. For example, it
is helping a pharmaceutical company bring to market
successful new drug compounds up to three times
faster than was previously possible. It is enabling
hospitals to more quickly identify and treat life-
threatening infections. And it is helping the US military
integrate its intelligence sources to track insurgents
and others who are planting improvised explosive
devices (IEDs).
In these and other instances, the data lake is creating
the kinds of opportunities would have been prohibitively
expensive and time-consuming with conventional tools.
Instead of being left behind by big data, organizations
are now using it to compete and win in our digitally
enabled economy.
A New Mindset
Many organizations are now collecting large amounts
of data in the cloud. But the data lake is an entirely
different model—it does not just bring data together,
it helps connect and integrate the data so that its full
value can be realized. Even in the cloud, data is stored
in rigid, regimented data structures—essentially data
silos—that are difficult to connect, limiting our ability to
see the big picture. Despite its promise to revolutionize
data analysis, the cloud does not truly integrate data—
it simply makes the data silos taller and fatter.
The data lake is not an incremental advance, but rather
represents a completely new mindset. Big data requires
organizations to stop thinking in terms of data mining
and data warehouses—the equivalent of industrial
processes—and to consider how data can be more fluid
and expansive, like in a data lake.
Organizations may be concerned that by consolidating
and connecting their data, they might be making it more
vulnerable. Just the opposite is true. The data lake
incorporates a new, granular level of security and privacy
that is not available with conventional techniques.1
Ingesting Data into the Data Lake
As with much of the conventional approach, the process
of preparing the data for analysis, known as extract/
The Data Lake
Turning Big Data into Opportunity:
1
See Booz Allen Viewpoint “Enabling Cloud Analytics with Data-Level Security: Tapping the
Full Value of Big Data and the Cloud” http://www.boozallen.com/media/file/Enabling_
Cloud_Analytics_with_Data-Level_Security.pdf
2
transform/load (ETL), tends to be highly inefficient in
terms of the resources used. At many organizations,
analysts may spend as much as 80 percent of their
time preparing the data, leaving just 20 percent for
conducting actual analysis. The reason is that with
each new line of inquiry, a specific data structure
and analytic is custom-built. All information entered
into the data structure must first be converted into a
recognizable format, often a slow, painstaking task.
For example, an analyst might be faced with merging
several different data sources that each use different
fields. The analyst must decide which fields to use
and whether new ones need to be created. The more
complex the query, the more data sources that typically
must be homogenized. Formatting also carries the risk
of data-entry errors. By contrast, data from a wide
range of sources is smoothly and easily ingested into
the data lake.
More importantly, there are no requirements for rigid
data structures—and so no need for formal data
formatting as the information is loaded. In a data lake,
indexing is not done en masse at the time of ingestion,
which is a time-consuming part of the traditional ETL
process. Instead, indices and relationships can be
derived to enrich the information base over time, and
executed at the time of the analysis to create “views”
that are tailored to the needs of a specific analysis,
reducing the time to operationalize data.
The data lake might be thought of as a giant collection
grid, like a spreadsheet—with billions of rows and
billions of columns available to hold data. Each cell
of the grid contains a piece of data—a document,
perhaps, or maybe a paragraph, or even a single
word from the document. Cells might contain names,
photographs, incident reports, or Twitter feeds—
anything and everything. It does not matter where in the
grid each bit of information is located. It also makes
no difference where the data comes from, whether it is
formatted, or how it might relate to any other piece of
information in the data lake. The data simply takes its
place in the cell, and after only minimal preparation by
analysts, is ready for use.
The image of the grid helps describe the difference
between data mining and the data lake. If we want to
mine precious metals, we have to find where they are,
then dig deep to retrieve them. But imagine if, when
the Earth was formed, nuggets of precious metals were
laid out in a big grid on top of the ground. We could just
walk along, picking up what we wanted. The data lake
makes information just as readily available.
The process of placing the data in open cells as it
comes in gives the ingest process remarkable speed.
Large amounts of data that might take 3 weeks to
prepare using conventional cloud computing can be
placed into the data lake in as little as 3 hours. This
enables organizations to achieve substantial savings
in IT resources and manpower. Just as important, it
frees analysts for the more important task of finding
connections and value in the data. Many organizations
today are trying to “do more with less.” That is difficult
with the conventional approach, but becomes possible,
for the first time, with the data lake.
Opening Up the Data
The ingest process of the data lake also removes
another disadvantage of the conventional approach—
the need to pre-define our questions. With conventional
computing techniques, we have to know in advance
what kinds of answers we are looking for and where
in the existing data the computer needs to look to
answer the inquiry. Analysts do not really ask questions
of the data—they form hypotheses well in advance of
the actual analysis, and then create data structures
and analytics that will enable them to test those
hypotheses. The only results that come back are the
ones that the custom-made databases and analytics
happen to provide.
What makes this exercise even more constraining is
that the data supporting an analysis typically contains
only a portion of the potentially available information.
Because the process of formatting and structuring
the data is so time-intensive, analysts have no choice
but to cull the data by some method. One of the most
prevalent techniques is to discount (and even ignore)
unstructured data. This simplifies the data ingest, but it
severely reduces the value of the data for analysis.
Hampered by these severe limitations, analysts can
pose only narrow questions of the data. And there is a
3
risk that the data structures will become closed-loop
systems—echo chambers that merely validate the
original hypotheses. When we ask the system what is
important, it points to the data that we happened to put
in. The fact that a particular piece of data is included in
a database tends to make it de facto significant—it is
important only because the hypothesis sees it that way.
With the data lake, data is ingested with a wide-open
view as to the queries that may come later. Because
there are no structures, we can get all of the data in—
all 100 variables, or 500, or any other number, so that
the data in its totality becomes available. Organizations
may have a great deal of data stored in the cloud, but
without the data lake they cannot easily connect it all,
and discover the often-hidden relationships in the world
around us. It is in those relationships that knowledge
and insight—and opportunity—reside.
Tagging the Data
The data lake also provides organization with value in
the way the data itself is managed. When a piece of
data is ingested, certain details, called metadata (or
“data about the data”), are added so that the basic
information can be quickly located and identified. For
example, an investor’s portfolio balance (the data)
might be stored with the name of the investor, the
account number, the location of the account, the types
of investments, the country the investor lives in, and so
on. These metadata “tags” serve the same purpose as
old-style card catalogues, which allow readers to find a
book by searching the author, title, or subject. As with
the card catalogues, tags enable us to find particular
information from a number of different starting points—
but with today’s tagging abilities, we can characterize
data in nearly limitless ways. The more tags, the more
complex and rich the analytics can become.
With the tags, we can look not only for connections and
patterns in the data, but in the tags themselves. As an
example of how this technology can be applied, tags
were used to help a major pharmaceutical company find
connections in a wide range of public data sources to
identify drug compounds with few adverse reactions,
and a high likelihood of clinical and commercial
success. Those sources have included market and
social media data—to help determine the need—
as well as data on clinical development, structural
analysis, disease structures, and patents—to determine
where there might be a gap. Data from those sources
were tagged and ingested into a data lake, enabling the
pharmaceutical company to identify the most promising
compounds. With conventional techniques, those
compounds would have been needles in a haystack, but
tags and the data lake help them stand out brightly.
The data lake allows us to ask questions and search
for patterns using either the data itself, the tags
themselves, or a combination of both. We can begin our
search with any piece of data or tag—for example, a
market analysis or the existing patents on a type
of drug—and pivot off of it in any direction to look
for connections.
While the process of tagging information is not new,
the data lake uses it in a unique way—as the primary
method of locating and managing the data. With
the tags, the rigid data structures that so limit the
conventional approach are no longer needed.
Along with the streamlined ingest process, tags help
give the data lake its speed. When organizations need
to update or search the data in new ways, they do not
have to tear down and rebuild data structures, as in the
conventional method. They can simply update the tags
already in place.
Tagging all of the data, and at a much more granular
level than is possible in the conventional cloud
approach, greatly expands the value that big data can
provide. Information in the data lake is not random and
chaotic, but rather is purposeful. The tags help make
the data lake like a viscous medium that holds the data
in place, and at the same time fosters connections.
The tags also provide a strong new layer of security.
We can tag each piece of data, down to the image
or paragraph in a document, with the relevant
restrictions, authorities, and security and privacy
levels. Organizations can establish rules regarding
which information can be shared, with whom, and
under what circumstances. With the conventional
approach, the primary obstacle to information sharing
is not technology, but rather the concern that secure
4
information will be compromised. The data lake,
by contrast, makes it possible for business and
government organizations to easily share information,
confident that security, privacy, and other rules
governing the data will be strictly maintained. The
security of data in the data lake has been proven
to work in very secure environments within the US
government, where the highest levels of precision
in security and privacy are required.
A New Way of Storing Data
With the conventional approach, data storage is
expensive—even in the cloud. The reason is that
so much space is wasted. Imagine a spreadsheet
combining two data sources, an original one with 100
fields and the other with 50. The process of combining
means that we will be adding 50 new ”columns” into
the original spreadsheet. Rows from the original will
hold no data for the new columns, and rows from the
new source will hold no data from the original. The
result will be a great deal of empty cells. This is wasted
storage space, and creates the opportunity for a great
many errors.
In the data lake, however, every cell is filled—no
space is wasted. This makes it possible to store
vast amounts of data in far less space than would
be required for even relatively small conventional
cloud databases. With the conventional approach,
organizations must continually reinvest in infrastructure
as analytic needs change. Connecting the data silos,
for example, typically requires reconfiguring and even
expanding the infrastructure. But with the data lake, the
infrastructure becomes a stable platform. Organizations
do not need to continually rebuild and reconfigure their
infrastructure. Their initial investment in infrastructure is
both enduring and cost-effective.
The data lake’s almost limitless capacity also
enables organizations to store data in a variety of
different forms, to aid in later analysis. A financial
institution, for example, could store records of certain
transactions converted into all of the world’s major
currencies. Or, a company could translate every
document on a particular subject into Chinese, and
store it until it might be needed.
One of the more transformative aspects of the data
lake is that it stores every type of data equally—not
just structured and unstructured, but also batch
and streaming. Batch data is typically collected on
an automated basis and then delivered for analysis
en masse—for example, the utility meter readings
from homes. Streaming data is information from a
continuous feed, such as video surveillance.
Formatting unstructured, batch, and streaming data
inevitably strips it of much of its richness. And even
if a portion of the information can be put into a
conventional cloud database, we are still constrained
by limited, pre-defined questions. The data lake
holds no such constraints. When unstructured, batch,
and streaming data are ingested, analytics can take
advantage of the tagging approach to begin to look for
patterns that naturally emerge. All types of data, and
the value they hold, now become fully accessible.
The US military is taking advantage of this capability
to help track insurgents and others who are planting
improvised explosive devices (IEDs) and other bombs.
Many of the military’s data sources include unstructured
data, and using the conventional approach—with
its extensive preparation—had proved unwieldy and
time-consuming. With the data lake, the military is now
able to quickly integrate and analyze its vast array of
disparate data sources—including its unstructured
data—giving military commanders unprecedented
situational awareness. This is another example of why
simply amassing large amounts of data does not create
a data lake. The military was collecting an enormous
quantity of data, but without the data lake could not
make full use of it to try to stop IEDs. Commanders
have reported that the current approach—which has the
data lake as its centerpiece—is saving more lives, and
at a lower operating cost than the traditional methods.
Accessing the Data for Analytics
One of the chief drawbacks of the conventional
approach, which the cloud does not ameliorate, is
that it essentially samples the data. When we have
questions (or want to test hypotheses), we select a
sample of the available data and apply analytics to it.
The problem is that we are never quite sure we are
5
pulling the right sample—that is, whether it is really
representative of the whole. The data lake eliminates
sampling. We no longer have to guess about which data
to use, because we are using it all.
With the data lake, our information is available for
analysis on-demand, when the need arises. The
conventional approach not only requires extensive data
preparation, but it is difficult to change databases
as queries change. Say the pharmaceutical company
wants to add new data sources to identify promising
drug compounds, or perhaps wants to change the type
of financial analyses it uses. With the conventional
approach, analysts would have to tear down the initial
data and analytics structures, and re-engineer new
ones. With the data lake, analysts would simply add the
new data, and ask the new questions.
Because there is no need to continually engineer
and re-engineer data structures, the data lake also
becomes accessible to non-technical subject matter
experts. They no longer need to rely on computer
scientists and others to explore the data—they can ask
the questions themselves. Subject matter experts best
understand the needs and goals of their organizations,
and the data lake helps make it possible for them to
identify where a specific opportunity may lie. This might
entail pinpointing a promising area for revenue growth
that has been overlooked by competitors, or finding
ways to execute a government agency’s mission faster
and more effectively, as in the military’s search for
insurgents and IEDs.
The data lake sets the stage for the advanced, high-
powered analytics that can point the way to top-line
business growth, and help government agencies
achieve their mission goals in better ways. Analytics
that search for connections and look for patterns have
long been hamstrung by being confined to limited, rigid
datasets and databases. The data lake frees them to
search for knowledge and insight across all of the data.
In essence, it allows the analytics, for the first time, to
reach their true potential.
A version of the data lake, for example, helped
researchers from Booz Allen and a large hospital chain
in the Midwest gain surprising insights into severe
sepsis and septic shock, life-threatening conditions
brought on by serious infection. Using the data lake,
researchers consolidated the electronic medical records
of tens of thousands of past patients with sepsis, and
found unexpected patterns in how their conditions
progressed. Those insights prompted the hospital chain
to begin a program to more quickly identify and treat
current sepsis patients. The program was credited with
saving nearly 100 lives during just its first 9 months.
Bottom-Line Savings and Top-Line Growth
Virtually every aspect of the data lake creates cost
savings and efficiencies, from freeing up analysts
to the ability to easily and inexpensively scale to an
organization’s growing data. While the conventional
methods have worked in the past, they are simply too
costly and cumbersome in the age of big data. The data
lake gives organizations a reset, in a sense, allowing
them to distribute their resources to obtain optimal
efficiency and effectiveness. That is critical in today’s
economic climate. Organizations can address budgetary
constraints while significantly expanding, rather than
limiting, their data analysis.
At the same time, the data lake helps organizations
to reach and then exploit the tipping point of
opportunity. Ultimately, the real value of big data lies
in big analytics—the capacity to help us do things
not just cheaper and better, but in ways we have not
yet imagined. For government, this can mean new
paradigms for mission success. For business, it can
show the way to entire new areas of revenue growth.
As big data grows even larger in the coming years, it
will increasingly be used by organizations to differentiate
themselves and compete in the marketplace. The
winners will be the ones with the greatest ability to
extract knowledge and insight from that data, and use it
to remake their futures. The data lake opens that door.
6
7
About Booz Allen Hamilton
ContactsBooz Allen Hamilton has been at the forefront of
strategy and technology consulting for nearly a century.
Today, Booz Allen is a leading provider of management
and technology consulting services to the US
government in defense, intelligence, and civil markets,
and to major corporations, institutions, and not-for-
profit organizations. In the commercial sector, the firm
focuses on leveraging its existing expertise for clients in
the financial services, healthcare, and energy markets,
and to international clients in the Middle East. Booz
Allen offers clients deep functional knowledge spanning
strategy and organization, engineering and operations,
technology, and analytics—which it combines with
specialized expertise in clients’ mission and domain
areas to help solve their toughest problems.
The firm’s management consulting heritage is the
basis for its unique collaborative culture and operating
model, enabling Booz Allen to anticipate needs and
opportunities, rapidly deploy talent and resources, and
deliver enduring results. By combining a consultant’s
problem-solving orientation with deep technical
knowledge and strong execution, Booz Allen helps
clients achieve success in their most critical missions—
as evidenced by the firm’s many client relationships that
span decades. Booz Allen helps shape thinking and
prepare for future developments in areas of national
importance, including cybersecurity, homeland security,
healthcare, and information technology.
Booz Allen is headquartered in McLean, Virginia,
employs approximately 25,000 people, and had revenue
of $5.86 billion for the 12 months ended March 31,
2012. For over a decade, Booz Allen’s high standing
as a business and an employer has been recognized
by dozens of organizations and publications, including
Fortune, Working Mother, G.I. Jobs, and DiversityInc.
More information is available at www.boozallen.com.
(NYSE: BAH)
Mark Herman
Executive Vice President
herman_mark@bah.com
703-902-5986
Michael Delurey
Principal
delurey_mike@bah.com
703-902-6858
The most complete, recent list of offices and their addresses and telephone numbers can be found on
www.boozallen.com
Principal Offices
Huntsville, Alabama
Sierra Vista, Arizona
Los Angeles, California
San Diego, California
San Francisco, California
Colorado Springs, Colorado
Denver, Colorado
District of Columbia
Orlando, Florida
Pensacola, Florida
Sarasota, Florida
Tampa, Florida
Atlanta, Georgia
Honolulu, Hawaii
O’Fallon, Illinois
Indianapolis, Indiana
Leavenworth, Kansas
Aberdeen, Maryland
Annapolis Junction, Maryland
Hanover, Maryland
Lexington Park, Maryland
Linthicum, Maryland
Rockville, Maryland
Troy, Michigan
Kansas City, Missouri
Omaha, Nebraska
Red Bank, New Jersey
New York, New York
Rome, New York
Dayton, Ohio
Philadelphia, Pennsylvania
Charleston, South Carolina
Houston, Texas
San Antonio, Texas
Abu Dhabi, United Arab Emirates
Alexandria, Virginia
Arlington, Virginia
Chantilly, Virginia
Charlottesville, Virginia
Falls Church, Virginia
Herndon, Virginia
McLean, Virginia
Norfolk, Virginia
Stafford, Virginia
Seattle, Washington
www.boozallen.com/cloud ©2013 Booz Allen Hamilton Inc.
12.032.12G

Más contenido relacionado

Destacado

The Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentThe Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentBooz Allen Hamilton
 
Balancing the tension between Lean and Agile
Balancing the tension between Lean and AgileBalancing the tension between Lean and Agile
Balancing the tension between Lean and AgileJames Coplien
 
Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...CA Technologies
 
Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Arbunize
 
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?Sage HR
 
Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security CA Technologies
 
India Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisIndia Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisGermin8
 
Retail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionRetail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionBooz Allen Hamilton
 
The Marketing Automation Revolution
The Marketing Automation RevolutionThe Marketing Automation Revolution
The Marketing Automation RevolutionUberflip
 
Paper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownPaper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownAdobe
 
Figuring out World Cup 2014 – An animated Infographic
Figuring out World Cup 2014 – An animated InfographicFiguring out World Cup 2014 – An animated Infographic
Figuring out World Cup 2014 – An animated InfographicEthinos Digital Marketing
 
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...Social Data Intelligence: Integrating Social and Enterprise Data for Competit...
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...Susan Etlinger
 
Mattermark - Fortune Brainstorm Tech 2015
Mattermark - Fortune Brainstorm Tech 2015Mattermark - Fortune Brainstorm Tech 2015
Mattermark - Fortune Brainstorm Tech 2015Mattermark
 

Destacado (15)

The Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentThe Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing Environment
 
Balancing the tension between Lean and Agile
Balancing the tension between Lean and AgileBalancing the tension between Lean and Agile
Balancing the tension between Lean and Agile
 
Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...
 
Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016
 
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
 
Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security
 
India Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisIndia Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media Analysis
 
Retail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionRetail Revolution: Thrive in Disruption
Retail Revolution: Thrive in Disruption
 
The Retail Reality Check
The Retail Reality CheckThe Retail Reality Check
The Retail Reality Check
 
The Marketing Automation Revolution
The Marketing Automation RevolutionThe Marketing Automation Revolution
The Marketing Automation Revolution
 
Paper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownPaper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us Down
 
The Signs of Life
The Signs of LifeThe Signs of Life
The Signs of Life
 
Figuring out World Cup 2014 – An animated Infographic
Figuring out World Cup 2014 – An animated InfographicFiguring out World Cup 2014 – An animated Infographic
Figuring out World Cup 2014 – An animated Infographic
 
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...Social Data Intelligence: Integrating Social and Enterprise Data for Competit...
Social Data Intelligence: Integrating Social and Enterprise Data for Competit...
 
Mattermark - Fortune Brainstorm Tech 2015
Mattermark - Fortune Brainstorm Tech 2015Mattermark - Fortune Brainstorm Tech 2015
Mattermark - Fortune Brainstorm Tech 2015
 

Más de Booz Allen Hamilton

You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesYou Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesBooz Allen Hamilton
 
Examining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsExamining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsBooz Allen Hamilton
 
Booz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen Hamilton
 
Homeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowHomeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowBooz Allen Hamilton
 
Preparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsPreparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsBooz Allen Hamilton
 
The Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingThe Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingBooz Allen Hamilton
 
Immersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereImmersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereBooz Allen Hamilton
 
Nuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceNuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceBooz Allen Hamilton
 
Frenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesFrenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesBooz Allen Hamilton
 
Booz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Hamilton
 
Booz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Hamilton
 
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton
 
Modern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksModern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksBooz Allen Hamilton
 
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Booz Allen Hamilton
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 

Más de Booz Allen Hamilton (20)

You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesYou Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
 
Examining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsExamining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working Moms
 
The True Cost of Childcare
The True Cost of ChildcareThe True Cost of Childcare
The True Cost of Childcare
 
Booz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of Directors
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Military Spouse Career Roadmap
Military Spouse Career Roadmap Military Spouse Career Roadmap
Military Spouse Career Roadmap
 
Homeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowHomeland Threats: Today and Tomorrow
Homeland Threats: Today and Tomorrow
 
Preparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsPreparing for New Healthcare Payment Models
Preparing for New Healthcare Payment Models
 
The Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingThe Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile Coaching
 
Immersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereImmersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is Here
 
Nuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceNuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving Performance
 
Frenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesFrenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join Forces
 
Booz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Secure Agile Development
Booz Allen Secure Agile Development
 
Booz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat Briefing
 
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
 
CITRIX IN AMAZON WEB SERVICES
CITRIX IN AMAZON WEB SERVICESCITRIX IN AMAZON WEB SERVICES
CITRIX IN AMAZON WEB SERVICES
 
Modern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksModern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military Networks
 
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
 
Women On The Leading Edge
Women On The Leading Edge Women On The Leading Edge
Women On The Leading Edge
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 

Último

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 

Último (20)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 

Turning Big Data into Opportunity: The Data Lake

  • 2.
  • 3. Table of Contents Introduction........................................................................................................................ 1 A New Mindset .................................................................................................................. 1 Ingesting Data into the Data Lake ....................................................................................... 1 Opening Up the Data.......................................................................................................... 2 Tagging the Data................................................................................................................ 3 A New Way of Storing Data.................................................................................................. 4 Accessing the Data for Analytics ......................................................................................... 4 Bottom-Line Savings and Top-Line Growth ............................................................................ 5
  • 4.
  • 5. 1 Introduction Big data by itself does not create opportunity. The most successful, competitive organizations will be the ones with the ability to turn that data into game-changing paths to new kinds of value, cost savings, revenue growth, and operational effectiveness. Organizations today are amassing so much information so quickly that they are reaching a tipping point. They are gaining remarkable potential to use big data in new ways, and redefine the very nature of how they do business. Yet none of this is guaranteed. Current tools cannot easily integrate disparate data collections, or fully use the kinds of “unstructured” data—such as photographs, doctors’ examination notes, and social media posts—that hold the most promise. The bigger that data gets, the more impractical these tools become, in terms of time, cost, and analytic ability. The conventional approaches have, in effect, created a glass ceiling with big data. Organizations may be able to envision new opportunities for growth and effectiveness, and yet have no method of reaching them. There is, however, a way around the glass ceiling. Booz Allen Hamilton has developed a revolutionary approach known as the “data lake” that removes the current constraints. With the data lake, an organization’s repository of information—structured and unstructured, along with streaming and batch data—is consolidated in a single, large table. The entire body of information in the data lake is available for every inquiry, and all at once—a capability that can create powerful new knowledge and insight. And because the data lake simplifies virtually every aspect of the loading, storing, and accessing of data, it provides business and government with substantial cost savings and efficiencies. The data lake is now being used in a wide range of business and government applications. For example, it is helping a pharmaceutical company bring to market successful new drug compounds up to three times faster than was previously possible. It is enabling hospitals to more quickly identify and treat life- threatening infections. And it is helping the US military integrate its intelligence sources to track insurgents and others who are planting improvised explosive devices (IEDs). In these and other instances, the data lake is creating the kinds of opportunities would have been prohibitively expensive and time-consuming with conventional tools. Instead of being left behind by big data, organizations are now using it to compete and win in our digitally enabled economy. A New Mindset Many organizations are now collecting large amounts of data in the cloud. But the data lake is an entirely different model—it does not just bring data together, it helps connect and integrate the data so that its full value can be realized. Even in the cloud, data is stored in rigid, regimented data structures—essentially data silos—that are difficult to connect, limiting our ability to see the big picture. Despite its promise to revolutionize data analysis, the cloud does not truly integrate data— it simply makes the data silos taller and fatter. The data lake is not an incremental advance, but rather represents a completely new mindset. Big data requires organizations to stop thinking in terms of data mining and data warehouses—the equivalent of industrial processes—and to consider how data can be more fluid and expansive, like in a data lake. Organizations may be concerned that by consolidating and connecting their data, they might be making it more vulnerable. Just the opposite is true. The data lake incorporates a new, granular level of security and privacy that is not available with conventional techniques.1 Ingesting Data into the Data Lake As with much of the conventional approach, the process of preparing the data for analysis, known as extract/ The Data Lake Turning Big Data into Opportunity: 1 See Booz Allen Viewpoint “Enabling Cloud Analytics with Data-Level Security: Tapping the Full Value of Big Data and the Cloud” http://www.boozallen.com/media/file/Enabling_ Cloud_Analytics_with_Data-Level_Security.pdf
  • 6. 2 transform/load (ETL), tends to be highly inefficient in terms of the resources used. At many organizations, analysts may spend as much as 80 percent of their time preparing the data, leaving just 20 percent for conducting actual analysis. The reason is that with each new line of inquiry, a specific data structure and analytic is custom-built. All information entered into the data structure must first be converted into a recognizable format, often a slow, painstaking task. For example, an analyst might be faced with merging several different data sources that each use different fields. The analyst must decide which fields to use and whether new ones need to be created. The more complex the query, the more data sources that typically must be homogenized. Formatting also carries the risk of data-entry errors. By contrast, data from a wide range of sources is smoothly and easily ingested into the data lake. More importantly, there are no requirements for rigid data structures—and so no need for formal data formatting as the information is loaded. In a data lake, indexing is not done en masse at the time of ingestion, which is a time-consuming part of the traditional ETL process. Instead, indices and relationships can be derived to enrich the information base over time, and executed at the time of the analysis to create “views” that are tailored to the needs of a specific analysis, reducing the time to operationalize data. The data lake might be thought of as a giant collection grid, like a spreadsheet—with billions of rows and billions of columns available to hold data. Each cell of the grid contains a piece of data—a document, perhaps, or maybe a paragraph, or even a single word from the document. Cells might contain names, photographs, incident reports, or Twitter feeds— anything and everything. It does not matter where in the grid each bit of information is located. It also makes no difference where the data comes from, whether it is formatted, or how it might relate to any other piece of information in the data lake. The data simply takes its place in the cell, and after only minimal preparation by analysts, is ready for use. The image of the grid helps describe the difference between data mining and the data lake. If we want to mine precious metals, we have to find where they are, then dig deep to retrieve them. But imagine if, when the Earth was formed, nuggets of precious metals were laid out in a big grid on top of the ground. We could just walk along, picking up what we wanted. The data lake makes information just as readily available. The process of placing the data in open cells as it comes in gives the ingest process remarkable speed. Large amounts of data that might take 3 weeks to prepare using conventional cloud computing can be placed into the data lake in as little as 3 hours. This enables organizations to achieve substantial savings in IT resources and manpower. Just as important, it frees analysts for the more important task of finding connections and value in the data. Many organizations today are trying to “do more with less.” That is difficult with the conventional approach, but becomes possible, for the first time, with the data lake. Opening Up the Data The ingest process of the data lake also removes another disadvantage of the conventional approach— the need to pre-define our questions. With conventional computing techniques, we have to know in advance what kinds of answers we are looking for and where in the existing data the computer needs to look to answer the inquiry. Analysts do not really ask questions of the data—they form hypotheses well in advance of the actual analysis, and then create data structures and analytics that will enable them to test those hypotheses. The only results that come back are the ones that the custom-made databases and analytics happen to provide. What makes this exercise even more constraining is that the data supporting an analysis typically contains only a portion of the potentially available information. Because the process of formatting and structuring the data is so time-intensive, analysts have no choice but to cull the data by some method. One of the most prevalent techniques is to discount (and even ignore) unstructured data. This simplifies the data ingest, but it severely reduces the value of the data for analysis. Hampered by these severe limitations, analysts can pose only narrow questions of the data. And there is a
  • 7. 3 risk that the data structures will become closed-loop systems—echo chambers that merely validate the original hypotheses. When we ask the system what is important, it points to the data that we happened to put in. The fact that a particular piece of data is included in a database tends to make it de facto significant—it is important only because the hypothesis sees it that way. With the data lake, data is ingested with a wide-open view as to the queries that may come later. Because there are no structures, we can get all of the data in— all 100 variables, or 500, or any other number, so that the data in its totality becomes available. Organizations may have a great deal of data stored in the cloud, but without the data lake they cannot easily connect it all, and discover the often-hidden relationships in the world around us. It is in those relationships that knowledge and insight—and opportunity—reside. Tagging the Data The data lake also provides organization with value in the way the data itself is managed. When a piece of data is ingested, certain details, called metadata (or “data about the data”), are added so that the basic information can be quickly located and identified. For example, an investor’s portfolio balance (the data) might be stored with the name of the investor, the account number, the location of the account, the types of investments, the country the investor lives in, and so on. These metadata “tags” serve the same purpose as old-style card catalogues, which allow readers to find a book by searching the author, title, or subject. As with the card catalogues, tags enable us to find particular information from a number of different starting points— but with today’s tagging abilities, we can characterize data in nearly limitless ways. The more tags, the more complex and rich the analytics can become. With the tags, we can look not only for connections and patterns in the data, but in the tags themselves. As an example of how this technology can be applied, tags were used to help a major pharmaceutical company find connections in a wide range of public data sources to identify drug compounds with few adverse reactions, and a high likelihood of clinical and commercial success. Those sources have included market and social media data—to help determine the need— as well as data on clinical development, structural analysis, disease structures, and patents—to determine where there might be a gap. Data from those sources were tagged and ingested into a data lake, enabling the pharmaceutical company to identify the most promising compounds. With conventional techniques, those compounds would have been needles in a haystack, but tags and the data lake help them stand out brightly. The data lake allows us to ask questions and search for patterns using either the data itself, the tags themselves, or a combination of both. We can begin our search with any piece of data or tag—for example, a market analysis or the existing patents on a type of drug—and pivot off of it in any direction to look for connections. While the process of tagging information is not new, the data lake uses it in a unique way—as the primary method of locating and managing the data. With the tags, the rigid data structures that so limit the conventional approach are no longer needed. Along with the streamlined ingest process, tags help give the data lake its speed. When organizations need to update or search the data in new ways, they do not have to tear down and rebuild data structures, as in the conventional method. They can simply update the tags already in place. Tagging all of the data, and at a much more granular level than is possible in the conventional cloud approach, greatly expands the value that big data can provide. Information in the data lake is not random and chaotic, but rather is purposeful. The tags help make the data lake like a viscous medium that holds the data in place, and at the same time fosters connections. The tags also provide a strong new layer of security. We can tag each piece of data, down to the image or paragraph in a document, with the relevant restrictions, authorities, and security and privacy levels. Organizations can establish rules regarding which information can be shared, with whom, and under what circumstances. With the conventional approach, the primary obstacle to information sharing is not technology, but rather the concern that secure
  • 8. 4 information will be compromised. The data lake, by contrast, makes it possible for business and government organizations to easily share information, confident that security, privacy, and other rules governing the data will be strictly maintained. The security of data in the data lake has been proven to work in very secure environments within the US government, where the highest levels of precision in security and privacy are required. A New Way of Storing Data With the conventional approach, data storage is expensive—even in the cloud. The reason is that so much space is wasted. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new ”columns” into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This is wasted storage space, and creates the opportunity for a great many errors. In the data lake, however, every cell is filled—no space is wasted. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. With the conventional approach, organizations must continually reinvest in infrastructure as analytic needs change. Connecting the data silos, for example, typically requires reconfiguring and even expanding the infrastructure. But with the data lake, the infrastructure becomes a stable platform. Organizations do not need to continually rebuild and reconfigure their infrastructure. Their initial investment in infrastructure is both enduring and cost-effective. The data lake’s almost limitless capacity also enables organizations to store data in a variety of different forms, to aid in later analysis. A financial institution, for example, could store records of certain transactions converted into all of the world’s major currencies. Or, a company could translate every document on a particular subject into Chinese, and store it until it might be needed. One of the more transformative aspects of the data lake is that it stores every type of data equally—not just structured and unstructured, but also batch and streaming. Batch data is typically collected on an automated basis and then delivered for analysis en masse—for example, the utility meter readings from homes. Streaming data is information from a continuous feed, such as video surveillance. Formatting unstructured, batch, and streaming data inevitably strips it of much of its richness. And even if a portion of the information can be put into a conventional cloud database, we are still constrained by limited, pre-defined questions. The data lake holds no such constraints. When unstructured, batch, and streaming data are ingested, analytics can take advantage of the tagging approach to begin to look for patterns that naturally emerge. All types of data, and the value they hold, now become fully accessible. The US military is taking advantage of this capability to help track insurgents and others who are planting improvised explosive devices (IEDs) and other bombs. Many of the military’s data sources include unstructured data, and using the conventional approach—with its extensive preparation—had proved unwieldy and time-consuming. With the data lake, the military is now able to quickly integrate and analyze its vast array of disparate data sources—including its unstructured data—giving military commanders unprecedented situational awareness. This is another example of why simply amassing large amounts of data does not create a data lake. The military was collecting an enormous quantity of data, but without the data lake could not make full use of it to try to stop IEDs. Commanders have reported that the current approach—which has the data lake as its centerpiece—is saving more lives, and at a lower operating cost than the traditional methods. Accessing the Data for Analytics One of the chief drawbacks of the conventional approach, which the cloud does not ameliorate, is that it essentially samples the data. When we have questions (or want to test hypotheses), we select a sample of the available data and apply analytics to it. The problem is that we are never quite sure we are
  • 9. 5 pulling the right sample—that is, whether it is really representative of the whole. The data lake eliminates sampling. We no longer have to guess about which data to use, because we are using it all. With the data lake, our information is available for analysis on-demand, when the need arises. The conventional approach not only requires extensive data preparation, but it is difficult to change databases as queries change. Say the pharmaceutical company wants to add new data sources to identify promising drug compounds, or perhaps wants to change the type of financial analyses it uses. With the conventional approach, analysts would have to tear down the initial data and analytics structures, and re-engineer new ones. With the data lake, analysts would simply add the new data, and ask the new questions. Because there is no need to continually engineer and re-engineer data structures, the data lake also becomes accessible to non-technical subject matter experts. They no longer need to rely on computer scientists and others to explore the data—they can ask the questions themselves. Subject matter experts best understand the needs and goals of their organizations, and the data lake helps make it possible for them to identify where a specific opportunity may lie. This might entail pinpointing a promising area for revenue growth that has been overlooked by competitors, or finding ways to execute a government agency’s mission faster and more effectively, as in the military’s search for insurgents and IEDs. The data lake sets the stage for the advanced, high- powered analytics that can point the way to top-line business growth, and help government agencies achieve their mission goals in better ways. Analytics that search for connections and look for patterns have long been hamstrung by being confined to limited, rigid datasets and databases. The data lake frees them to search for knowledge and insight across all of the data. In essence, it allows the analytics, for the first time, to reach their true potential. A version of the data lake, for example, helped researchers from Booz Allen and a large hospital chain in the Midwest gain surprising insights into severe sepsis and septic shock, life-threatening conditions brought on by serious infection. Using the data lake, researchers consolidated the electronic medical records of tens of thousands of past patients with sepsis, and found unexpected patterns in how their conditions progressed. Those insights prompted the hospital chain to begin a program to more quickly identify and treat current sepsis patients. The program was credited with saving nearly 100 lives during just its first 9 months. Bottom-Line Savings and Top-Line Growth Virtually every aspect of the data lake creates cost savings and efficiencies, from freeing up analysts to the ability to easily and inexpensively scale to an organization’s growing data. While the conventional methods have worked in the past, they are simply too costly and cumbersome in the age of big data. The data lake gives organizations a reset, in a sense, allowing them to distribute their resources to obtain optimal efficiency and effectiveness. That is critical in today’s economic climate. Organizations can address budgetary constraints while significantly expanding, rather than limiting, their data analysis. At the same time, the data lake helps organizations to reach and then exploit the tipping point of opportunity. Ultimately, the real value of big data lies in big analytics—the capacity to help us do things not just cheaper and better, but in ways we have not yet imagined. For government, this can mean new paradigms for mission success. For business, it can show the way to entire new areas of revenue growth. As big data grows even larger in the coming years, it will increasingly be used by organizations to differentiate themselves and compete in the marketplace. The winners will be the ones with the greatest ability to extract knowledge and insight from that data, and use it to remake their futures. The data lake opens that door.
  • 10. 6
  • 11. 7 About Booz Allen Hamilton ContactsBooz Allen Hamilton has been at the forefront of strategy and technology consulting for nearly a century. Today, Booz Allen is a leading provider of management and technology consulting services to the US government in defense, intelligence, and civil markets, and to major corporations, institutions, and not-for- profit organizations. In the commercial sector, the firm focuses on leveraging its existing expertise for clients in the financial services, healthcare, and energy markets, and to international clients in the Middle East. Booz Allen offers clients deep functional knowledge spanning strategy and organization, engineering and operations, technology, and analytics—which it combines with specialized expertise in clients’ mission and domain areas to help solve their toughest problems. The firm’s management consulting heritage is the basis for its unique collaborative culture and operating model, enabling Booz Allen to anticipate needs and opportunities, rapidly deploy talent and resources, and deliver enduring results. By combining a consultant’s problem-solving orientation with deep technical knowledge and strong execution, Booz Allen helps clients achieve success in their most critical missions— as evidenced by the firm’s many client relationships that span decades. Booz Allen helps shape thinking and prepare for future developments in areas of national importance, including cybersecurity, homeland security, healthcare, and information technology. Booz Allen is headquartered in McLean, Virginia, employs approximately 25,000 people, and had revenue of $5.86 billion for the 12 months ended March 31, 2012. For over a decade, Booz Allen’s high standing as a business and an employer has been recognized by dozens of organizations and publications, including Fortune, Working Mother, G.I. Jobs, and DiversityInc. More information is available at www.boozallen.com. (NYSE: BAH) Mark Herman Executive Vice President herman_mark@bah.com 703-902-5986 Michael Delurey Principal delurey_mike@bah.com 703-902-6858
  • 12. The most complete, recent list of offices and their addresses and telephone numbers can be found on www.boozallen.com Principal Offices Huntsville, Alabama Sierra Vista, Arizona Los Angeles, California San Diego, California San Francisco, California Colorado Springs, Colorado Denver, Colorado District of Columbia Orlando, Florida Pensacola, Florida Sarasota, Florida Tampa, Florida Atlanta, Georgia Honolulu, Hawaii O’Fallon, Illinois Indianapolis, Indiana Leavenworth, Kansas Aberdeen, Maryland Annapolis Junction, Maryland Hanover, Maryland Lexington Park, Maryland Linthicum, Maryland Rockville, Maryland Troy, Michigan Kansas City, Missouri Omaha, Nebraska Red Bank, New Jersey New York, New York Rome, New York Dayton, Ohio Philadelphia, Pennsylvania Charleston, South Carolina Houston, Texas San Antonio, Texas Abu Dhabi, United Arab Emirates Alexandria, Virginia Arlington, Virginia Chantilly, Virginia Charlottesville, Virginia Falls Church, Virginia Herndon, Virginia McLean, Virginia Norfolk, Virginia Stafford, Virginia Seattle, Washington www.boozallen.com/cloud ©2013 Booz Allen Hamilton Inc. 12.032.12G