SlideShare una empresa de Scribd logo
1 de 102
Descargar para leer sin conexión
WHAT DOES SCALABLE MEAN?
Operationally:
•  In the past: “Works even if data doesn’t fit in main memory”
•  Now: “Can make use of 1000s of cheap computers”
Formally:
•  In the past: If you have N data items, you must do no more than Nk
operations -- “polynomial time algorithms”
•  Soon: If you have N data items, you must do no more than N * log(N)
operations -- “logarithmic time algorithms”
•  As data sizes go up, you’ll only get one pass at the data
•  So you better make that one pass count
1/16/16
Bill Howe, eScience Institute
1
EXAMPLE:FIND MATCHING DNA SEQUENCES
Given a set of sequences
Find all sequences equal to “GATTACGATATTA”
1/16/16
Bill Howe, eScience Institute
2
GATTACGATATTATACCTGCCGTAA
1/16/16
Bill Howe, eScience Institute
4
TACCTGCCGTAA = GATTACGATATTA?
GATTACGATATTA
time = 0
No.
1/16/16
Bill Howe, eScience Institute
5
CCCCCAATGAC = GATTACGATATTA?
GATTACGATATTA
time = 1
No.
1/16/16
Bill Howe, eScience Institute
6
GATTACGATATTA contains GATTACGATATTA?
GATTACGATATTA
time = 17
Yes!
Send it to the output.
1/16/16
Bill Howe, eScience Institute
7
GATTACGATATTA
40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)
GATTACGATATTA
AAAATCCTGCA
AAACGCCTGCA
TTTACGTCAA
TTTTCGTAATT
What if we sort the sequences?
CTGTACACAACCT < GATTACGATATTA
GATTACGATATTA
No match.
Skip to 75% mark
Start at the 50% mark
time = 0
0% 100%
CTGTACACAACCT
GGATACACATTTA > GATTACGATATTA
GATTACGATATTA
No match.
Go back to 62.5% mark
time = 1
0% 100%
GGATACACATTTA
GATATTTTAAGC < GATTACGATATTA
GATTACGATATTA
No match.
Skip back to 68.75% mark
0% 100%
GATATTTTAAGC
GATTACGATATTA = GATTACGATATTA
GATTACGATATTA
0% 100%
Match!
Walk through the records until we
fail to match.
GATTACGATATTA
0% 100%
How many comparisons did we do?
40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N)) Far better scalability
RELATIONAL DATABASES
Databases are good at “Needle in Haystack”
problems:
•  Extracting small results from big datasets
•  Transparently provide “old style” scalability
•  Your query will always* finish, regardless of dataset size.
•  Indexes are easily built and automatically used when
appropriate
1/16/16
Bill Howe, eScience Institute
14
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = ‘GATTACGATATTA’;
*almost
NEW TASK:READ TRIMMING
Given a set of DNA sequences
Trim the final n bps of each sequence
Generate a new dataset
1/16/16
Bill Howe, eScience Institute
15
GATTACGATATTATACCTGCCGTAA
TACCTGCCGTAA becomes TACCT
time = 0
CCCCCAATGAC becomes CCCCC
time = 1
GATTACGATATTA becomes GATTA
time = 17
Can we use an index?
No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?
time = 0
time = 1
time = 2
time = 3
time = 7
How much time did this take?
7 cycles
40 records, 6 workers
Ο(
N
k
)
You are given short
“reads”: genomic
sequences about
35-75 characters each
Distribute the reads
among k computers
f f f f f f
f is a function to
trim a read; apply it
to every item
Now we have a big
distributed set of
trimmed reads
SCHEMATIC OF A PARALLEL“READ
TRIMMING”TASK
You are given
TIFF images
Distribute the
images among
k computers
f f f f f f
f is a function
to convert
TIFF to PNG;
apply it to
every item
Now we have a
big distributed
set of converted
images
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-
timesmachine/
NEW TASK:CONVERT 405K TIFF IMAGES TO
PNG
You have sets of
parameters for
thousands of
small simulations
Divide the
parameter sets
among k
computers
f f f f f f
f is runs the
simulation and
produces some
output; apply it
to every item
Now we have a
big distributed
set of simulation
results
http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud
NEW TASK:RUN THOUSANDS OF
SIMULATIONS
You have
millions of
documents
Distribute the
documents
among k
computers
f f f f f f
f finds the most
common word in
a single
document
Now we have a
big distributed
list of (doc_id,
word) pairs
FIND THE MOST COMMONWORD IN EACH
DOCUMENT
Consider a slightly more general program to compute the
word frequency of every word in a single document
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
(people, 2)
(government, 6)
(assume, 1)
(history, 2)
…
You have millions
of documents
Distribute the
documents
among k
computers
f f f f f f
For each
document f
returns a set of
(word,freq)
pairs
Now we have a
big distributed
list of sets of
word freqs.
COMPUTE THEWORD FREQUENCY OF 5M
DOCUMENTS
THERE’S A PATTERN HERE….
A function that maps a read to a trimmed read
A function that maps a TIFF image to a PNG image
A function that maps a set of parameters to a simulation result
A function that maps a document to its most common word
A function that maps a document to a histogram of word frequencies
WHAT IFWEWANT TO COMPUTE THEWORD
FREQUENCY ACROSS ALL DOCUMENTS?
US Constitution
Declaration of
Independence
Articles of
Confederation(people, 78)
(government, 123)
(assume, 23)
(history, 38)
…
You have millions
of documents
Distribute the
documents
among k
computers
map map map map map map
For each
document,
return a set of
(word,freq)
pairs
Now what?
COMPUTE THEWORD FREQUENCY ACROSS 5M
DOCUMENTS
Distribute the
documents
among k
computers
map map map map map map
For each
document,
return a set of
(word,freq)
pairs
Now we have a
big distributed
list of sets of
word freqs.
reduce reduce reduce reduce
Now count the
occurrences of each
word
44 3
We have our
distributed
histogram
COMPUTE THEWORD FREQUENCY ACROSS 5M
DOCUMENTS
SOME DISTRIBUTED ALGORITHM…
1/16/16
Bill Howe, eScience Institute
38
Map
(Shuffle)
Reduce
MAPREDUCE PROGRAMMING MODEL
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
•  Processes input key/value pair
•  Produces set of intermediate pairs
•  Combines all intermediate values for a particular key
•  Produces a set of merged output values (usually just one)
1/16/16
Bill Howe, eScience Institute
39
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp,Scheme,and Haskell
slide source: Google,
Inc.
EXAMPLE:WHAT DOES THIS DO?
1/16/16
Bill Howe, eScience Institute
40
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator
intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google,
Inc.
EXAMPLE:DOCUMENT PROCESSING
3/12/09
Bill Howe, eScience Institute
41
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
EXAMPLE:WORD LENGTH HISTOGRAM
3/12/09
Bill Howe, eScience Institute
42
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
How many “big”, “medium”, and “small” words are used?
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big =Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2.4 letters
Tiny = Pink = 1 letter
EXAMPLE:WORD LENGTH HISTOGRAM
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
EXAMPLE:WORD LENGTH HISTOGRAM
Split the document into
chunks and process
each chunk on a
different computer
Chunk 1
Chunk 2
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
EXAMPLE:WORD LENGTH HISTOGRAM
EXAMPLE:WORD LENGTH HISTOGRAM
3/12/09
Bill Howe, eScience Institute
46
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
“Shuffle step”
(yellow,
37)
(red, 148)
(blue, 200)
(pink, 9)
Local	storage	`	
MR PHASES
Each Map and Reduce task has multiple phases:
1/16/16
Bill Howe, eScience Institute
48
Hadoop in
One Slide
src: HuyVo,NYU
Poly
TAXONOMY OF PARALLEL ARCHITECTURES
1/16/16
Bill Howe, eScience Institute
49
Easiest to program,
but $$$$
Scales to 1000s of computers
DESIGN SPACE
1/16/16
Bill Howe
50
50	
Throughput	Latency	
Internet	
Private	
data	
center	
Data-	
parallel	
Shared	
memory	
The area we’re
discussing
inspired by a slide by Michael Isard at Microsoft Research
In a few weeks
1/16/16
Bill Howe, UW
51
LARGE-SCALE DATA PROCESSING
1/16/16
Bill Howe, eScience Institute
52
Many tasks process big data, produce big data
Want to use hundreds or thousands of CPUs
•  ... but this needs to be easy
•  Parallel databases exist, but they are expensive, difficult to set
up, and do not necessarily scale to hundreds of nodes.
MapReduce is a lightweight framework, providing:
•  Automatic parallelization and distribution
•  Fault-tolerance
•  I/O scheduling
•  Status and monitoring
KEY IDEA:DECLARATIVE LANGUAGES
1/16/16
Bill Howe, eScience Institute
53
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today,along with the items ordered
TWO NOTIONS OF PARALLEL QUERY
PROCESSING
“Distributed Query”
•  Rewrite the query as a union of subqueries
•  Workers communicate through standard interfaces, so compatible
with federated, heterogeneous, or distributed databases
“Parallel Query”
•  Each operator is implemented with a parallel algorithm
1/16/16
Bill Howe, UW
54
DISTRIBUTED QUERY EXAMPLE
1/16/16
Bill Howe, UW
55
CREATE VIEW Sales AS
SELECT * FROM JanSales
UNION ALL
SELECT * FROM FebSales
UNION ALL
SELECT * FROM MarSales
CREATE TABLE MarSales(
OrderID INT,
CustomerID INT NOT NULL,
OrderDate DATETIME NULL
CHECK (DATEPART(mm, OrderDate) = 3),
CONSTRAINT OrderIDMonth PRIMARY KEY(OrderID)
)
DISTRIBUTED FILE SYSTEM (DFS)
For very large files:TBs, PBs
Each file is partitioned into chunks, typically 64MB
Each chunk is replicated several times (≥3), on different racks, for fault
tolerance
Implementations:
•  Google’s DFS: GFS, proprietary
•  Hadoop’s DFS: HDFS, open source
CSE 344 – Winter 2012
56
PARALLEL QUERY EXAMPLE:TERADATA
1/16/16
Bill Howe, eScience Institute
57
AMP = unit of parallelism
EXAMPLE SYSTEM:TERADATA
1/16/16
Bill Howe, eScience Institute
58
SELECT *
FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()
Find all orders from today,along with the items ordered
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
EXAMPLE SYSTEM:TERADATA
1/16/16
Bill Howe, eScience Institute
59
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
EXAMPLE SYSTEM:TERADATA
1/16/16
Bill Howe, eScience Institute
60
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
EXAMPLE SYSTEM:TERADATA
1/16/16
Bill Howe, eScience Institute
61
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines where
hash(item) = 1
contains all orders and all lines where
hash(item) = 2
contains all orders and all lines
where hash(item) = 3
MAPREDUCE CONTEMPORARIES
Dryad (Microsoft)
•  Relational Algebra
Pig (Yahoo)
•  Near Relational Algebra over MapReduce
HIVE (Facebook)
•  SQL over MapReduce
Cascading
•  Relational Algebra
Clustera
•  U of Wisconsin
Hbase
•  Indexing on HDFS
1/16/16
Bill Howe, eScience Institute
62
MAPREDUCEVS RDBMS
RDBMS
•  Declarative query languages
•  Schemas
•  Logical Data Independence
•  Indexing
•  Algebraic Optimization
•  Caching/Materialized Views
•  ACID/Transactions
MapReduce
•  High Scalability
•  Fault-tolerance
•  “One-person deployment”
1/16/16
Bill Howe, eScience Institute
63
HIVE, Pig, DryadLINQ
DryadLINQ, Pig, HIVE
Pig, (Dryad, HIVE)
Hbase
COMPARISON
1/16/16
Bill Howe, eScience Institute
64
Data Model Prog. Model Services
GPL * * Typing (maybe)
Workflow * dataflow typing, provenance,
scheduling, caching,
task parallelism, reuse
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
data parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance
MS Dryad IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism,
full control
Local	storage	`	
MAP REDUCE PHASES
Each Map and Reduce task has multiple phases:
MAP REDUCE
(w1,1)
(w2,1)
(w1,1)
…
(w1,1)
(w2,1)
…
(did1,v1)
(did2,v2)
(did3,v3)
. . . .
(w1, (1,1,1,…,1))
(w2, (1,1,…))
(w3,(1…))
…
…
…
…
(w1, 25)
(w2, 77)
(w3, 12)
…
…
…
…
Shuffle
Same word appears
twice. Why not just send
(w1,2)?
map map map map map map
reduce reduce reduce reduce
44 3
COUNTWORD OCCURRENCES ACROSS ALL
DOCUMENTS
MAP REDUCE
(w1,1)
(w2,1)
(w3,1)
…
(w1,1)
(w2,1)
…
(did1,v1)
(did2,v2)
(did3,v3)
. . . .
(w1, (1,1,1,…,1))
(w2, (1,1,…))
(w3,(1…))
…
…
…
…
(w1, 25)
(w2, 77)
(w3, 12)
…
…
…
…
Shuffle
MAP REDUCE
Google: paper published 2004
Free variant: Hadoop
Map-reduce = high-level programming
model and implementation for large-scale
parallel data processing
slide src: Dan Suciu and Magda Balazinska
69
DATA MODEL
Files !
A file = a bag of (key, value) pairs
A map-reduce program:
Input: a bag of (inputkey, value)pairs
Output: a bag of (outputkey, value)pairs
slide src: Dan Suciu and Magda Balazinska
70
STEP 1:THE MAP PHASE
User provides the MAP-function:
Input: (input key, value)
Ouput:
bag of (intermediate key, value)
System applies the map function in parallel to all
(input key, value) pairs in the input file
CSE 344 – Winter 2012
71
STEP 2:THE REDUCE PHASE
User provides the REDUCE function:
Input:
(intermediate key, bag of values)
Output: bag of output (values)
The system will group all pairs with the same
intermediate key, and passes the bag of values to the
REDUCE function
CSE 344 – Winter 2012
72
MAPREDUCE PROGRAMMING MODEL
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
•  Processes input key/value pair
•  Produces set of intermediate pairs
•  Combines all intermediate values for a particular key
•  Produces a set of merged output values (usually just one)
1/16/16
Bill Howe, eScience Institute
73
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp,Scheme,and Haskell
slide source: Google, Inc.
EXAMPLE:WHAT DOES THIS DO?
1/16/16
Bill Howe, eScience Institute
74
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator, intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
MORE EXAMPLES:BUILD AN INVERTED
INDEX
1/16/16
Bill Howe, UW
75
Input:
tweet1, (“I love pancakes for breakfast”)
tweet2, (“I dislike pancakes”)
tweet3, (“What should I eat for breakfast?”)
tweet4, (“I love to eat”)
Desired output:
“pancakes”, (tweet1, tweet2)
“breakfast”, (tweet1, tweet3)
“eat”, (tweet3, tweet4)
“love”, (tweet1, tweet4)
…
MORE EXAMPLES:RELATIONAL JOIN
1/16/16
Bill Howe, UW
76
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Emplyee ⨝ Assigned Departments
Employee Assigned Departments
Name SSN EmpSSN DepName
Sue 999999999 999999999 Accounts
Tony 777777777 777777777 Sales
Tony 777777777 777777777 Marketing
RELATIONAL JOIN IN MAPREDUCE:BEFORE
MAP PHASE
1/16/16
Bill Howe, UW
77
Name SSN
Sue 999999999
Tony 777777777
EmpSSN DepName
999999999 Accounts
777777777 Sales
777777777 Marketing
Employee
Assigned Departments
Key idea: Lump all the
tuples together into one
dataset
Employee, Sue, 999999999
Employee,Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
What is this for?
RELATIONAL JOIN IN MAPREDUCE:MAP
PHASE
1/16/16
Bill Howe, UW
78
Employee, Sue, 999999999
Employee,Tony, 777777777
Department, 999999999, Accounts
Department, 777777777, Sales
Department, 777777777, Marketing
key=999999999, value=(Employee, Sue, 999999999)
key=777777777, value=(Employee,Tony, 777777777)
key=999999999, value=(Department, 999999999, Accounts)
key=777777777, value=(Department, 777777777, Sales)
key=777777777, value=(Department, 777777777, Marketing)
why do we use this as the key?
RELATIONAL JOIN IN MAPREDUCE:REDUCE
PHASE
1/16/16
Bill Howe, UW
79
key=999999999, values=[(Employee, Sue, 999999999),
(Department, 999999999, Accounts)]
key=777777777, values=[(Employee,Tony, 777777777),
(Department, 777777777, Sales),
(Department, 777777777, Marketing)]
Sue, 999999999, 999999999, Accounts
Tony, 777777777, 777777777, Sales
Tony, 777777777, 777777777, Marketing
RELATIONAL JOIN IN MAPREDUCE,AGAIN
1/16/16
Bill Howe, Data Science, Autumn 2012
80
Order(orderid, account, date)
1, aaa, d1
2, aaa, d2
3, bbb, d3
LineItem(orderid, itemid, qty)
1, 10, 1
1, 20, 3
2, 10, 5
2, 50, 100
3, 20, 1tagged with
relation
nameOrder
1, aaa, d1 à 1 :“Order”, (1,aaa,d1)
2, aaa, d2 à 2 :“Order”, (2,aaa,d2)
3, bbb, d3 à 3 :“Order”, (3,bbb,d3)
Line
1, 10, 1 à 1 :“Line”, (1, 10, 1)
1, 20, 3 à 1 :“Line”, (1, 20, 3)
2, 10, 5 à 2 :“Line”, (2, 10, 5)
2, 50, 100 à 2 :“Line”, (2, 50, 100)
3, 20, 1 à 3 :“Line”, (3, 20, 1)
Map
Reducer for key 1
“Order”, (1,aaa,d1)
“Line”, (1, 10, 1)
“Line”, (1, 20, 3)
(1, aaa, d1, 1, 10, 1)
(1, aaa, d1, 1, 20, 3)
SIMPLE SOCIAL NETWORK ANALYSIS:
COUNT FRIENDS
1/16/16
Bill Howe, UW
81
Jim, Sue
Sue, Jim
Lin, Joe
Joe, Lin
Jim, Kai
Kai, Jim
Jim, Lin
Lin, Jim
Input Desired Output
Jim, 3
Lin, 2
Sue, 1
Kai, 1
Joe, 1
Jim,1
Sue,1
Lin,1
Joe,1
Jim,1
Kai,1
Jim,1
Lin,1
MAP REDUCE
Jim,(1,1,1)
Lin,(1,1)
Sue,(1)
Kai,(1)
Joe,(1)
SHUFFLE
MATRIX MULTIPLY IN MAPREDUCE
C = A X B
A has dimensions L,M
B has dimensions M,N
In the map phase:
•  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N
•  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L
In the reduce phase, emit
•  key = (i,k)
•  value = Sumj (A[i,j] * B[j,k])
1/16/16
Bill Howe, Data Science, Autumn 2012
82
1/16/16
Bill Howe, Data Science, Autumn 2012
83
AB
X
A B
=
•  One reducer per output cell
• 
•  Each reducer computes: Ai, jBj,k
j
∑
CLUSTER COMPUTING
Large number of commodity servers, connected
by high speed, commodity network
Rack: holds a small number of servers
Data center: holds many racks
CSE 344 – Winter 2012
84
CLUSTER COMPUTING
Massive parallelism:
•  100s, or 1000s, or 10000s servers
•  Many hours
Failure:
•  If medium-time-between-failure is 1 year
•  Then 10000 servers have one failure / hour
slide src: Dan Suciu and Magda Balazinska
85
DISTRIBUTED FILE SYSTEM (DFS)
For very large files:TBs, PBs
Each file is partitioned into chunks, typically 64MB
Each chunk is replicated several times (≥3), on different
racks, for fault tolerance
Implementations:
•  Google’s DFS: GFS, proprietary
•  Hadoop’s DFS: HDFS, open source
slide src: Dan Suciu and Magda Balazinska
86
MAP REDUCE
VS.DATABASES
1/16/16
87
Bill Howe, UW
HADOOPVS.RDBMS
Comparison of 3 systems
•  Hadoop
•  Vertica (a column-oriented database)
•  DBMS-X (a row-oriented database)
•  rhymes with “schmoracle”
Qualitative
•  Programming model, ease of setup, features, etc.
Quantitative
•  Data loading, different types of queries
1/16/16
Bill Howe, eScience Institute
88
Pavlo et al.2009
GREP TASK
•  Find 3-byte pattern in 100-byte record
•  1 match per 10,000 records
•  Data set:
•  10-byte unique key, 90-byte value
•  1TB spread across 25, 50, or 100 nodes
•  10 billion records
•  Original MR Paper (Dean et al. 2004)
89
GREP TASK LOADING RESULTS
90
Seconds
GREP TASK EXECUTION RESULTS
91
Seconds
SELECTION TASK
92
SELECT pageURL, pageRank
FROM Rankings WHERE pageRank > X
1 GB / node
ANALYTICAL TASKS
•  Simple web programming schema
•  Data set
•  600k HTML Documents (6GB/node)
•  155 million UserVisit records (20GB/node)
•  18 million Rankings records (1GB/node)
93
AGGREGATE TASK
•  Simple query to find adRevenue by IP prefix
94
SELECT SUBSTR(sourceIP, 1, 7),
SUM (adRevenue)
FROM userVisits
GROUP BY SUBSTR (sourceIP, 1, 7)
AGGREGATE TASK RESULTS
95
JOIN TASK
•  Find the sourceIP that generated the most adRevenue along
with its average pageRank
•  Implementations:
•  DMPSs – complex SQL using temporary table
•  MapReduce – Three separate MR programs
96
JOIN TASK
97
SELECT INTO TempsourceIP,
AVG(pageRank)as avgPageRank,
SUM(adRevenue)as totalRevenue
FROM RankingsAS R
, UserVisitsAS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate
BETWEEN ‘2000-01-15’
AND ‘2000-01-22’
GROUP BY UV.sourceIP;
SELECT sourceIP,
totalRevenue,
avgPageRank
FROM Temp
ORDER BY totalRevenueDESC
LIMIT 1;
JOIN TASK RESULTS
98
PROBLEMSWITH THIS ANALYSIS?
Other ways to avoid sequential scans?
Fault-tolerance in large clusters?
Tasks that cannot be expressed as queries?
1/16/16
Bill Howe, UW
99
GOOGLE’S RESPONSE:CLUSTER SIZE
•  Largest known database installations
•  Greenplum – 96 nodes – 4.5 PB (eBay)[1]
•  Teradata – 72 nodes – 2+PB (eBay)[1]
•  Largest known MR installations:
•  Hadoop – 3658 nodes – 1 PB (Yahoo)[2]
•  Hive – 600+ nodes – 2.5 PB (Facebook)[3]
[1] eBay’s two enormous data warehouses – April 30th, 2009
http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
[2] Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds – May 11th, 2009
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html
[3] Hive – A Petabyte Scale DataWarehouse using Hadoop – June 10th, 2009
https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919
100
CONCLUDING REMARKS
•  What can MapReduce learn from Databases?
•  Declarative languages are a good thing
•  Schemas are important
•  What can Databases learn from MapReduce?
•  Query fault-tolerance
•  Support for in situ data
•  Embrace open-source
101
SLIDES CAN BE FOUND AT:
TEACHINGDATASCIENCE.ORG

Más contenido relacionado

Similar a Bill howe 3_mapreduce

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
C++ Standard Template Library
C++ Standard Template LibraryC++ Standard Template Library
C++ Standard Template LibraryIlio Catallo
 
Concurrent programming with RTOS
Concurrent programming with RTOSConcurrent programming with RTOS
Concurrent programming with RTOSSirin Software
 
Lambda? You Keep Using that Letter
Lambda? You Keep Using that LetterLambda? You Keep Using that Letter
Lambda? You Keep Using that LetterKevlin Henney
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Our Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureOur Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureC4Media
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)David Evans
 
Functional data structures
Functional data structuresFunctional data structures
Functional data structuresRalf Laemmel
 
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...Revolution Analytics
 
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...Universidad Rey Juan Carlos
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingFullContact
 

Similar a Bill howe 3_mapreduce (20)

Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
C++ Standard Template Library
C++ Standard Template LibraryC++ Standard Template Library
C++ Standard Template Library
 
01 Machine Learning Introduction
01 Machine Learning Introduction 01 Machine Learning Introduction
01 Machine Learning Introduction
 
Concurrent programming with RTOS
Concurrent programming with RTOSConcurrent programming with RTOS
Concurrent programming with RTOS
 
stack.ppt
stack.pptstack.ppt
stack.ppt
 
Lambda? You Keep Using that Letter
Lambda? You Keep Using that LetterLambda? You Keep Using that Letter
Lambda? You Keep Using that Letter
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Our Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureOur Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed Future
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)
Zero to a Billion in 4.86 Years (A Whirlwind History of Operating Systems)
 
Functional data structures
Functional data structuresFunctional data structures
Functional data structures
 
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Bir...
 
Storm introduction
Storm introductionStorm introduction
Storm introduction
 
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...
Modeling and Reasoning in Event Calculus using Goal-Directed Constraint Answe...
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been Missing
 

Más de Mahammad Valiyev

Más de Mahammad Valiyev (6)

Bill howe 8_graphs
Bill howe 8_graphsBill howe 8_graphs
Bill howe 8_graphs
 
Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2
 
Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1
 
Bill howe 5_statistics
Bill howe 5_statisticsBill howe 5_statistics
Bill howe 5_statistics
 
Bill howe 4_bigdatasystems
Bill howe 4_bigdatasystemsBill howe 4_bigdatasystems
Bill howe 4_bigdatasystems
 
Bill howe 2_databases
Bill howe 2_databasesBill howe 2_databases
Bill howe 2_databases
 

Último

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Último (20)

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Bill howe 3_mapreduce

  • 1. WHAT DOES SCALABLE MEAN? Operationally: •  In the past: “Works even if data doesn’t fit in main memory” •  Now: “Can make use of 1000s of cheap computers” Formally: •  In the past: If you have N data items, you must do no more than Nk operations -- “polynomial time algorithms” •  Soon: If you have N data items, you must do no more than N * log(N) operations -- “logarithmic time algorithms” •  As data sizes go up, you’ll only get one pass at the data •  So you better make that one pass count 1/16/16 Bill Howe, eScience Institute 1
  • 2. EXAMPLE:FIND MATCHING DNA SEQUENCES Given a set of sequences Find all sequences equal to “GATTACGATATTA” 1/16/16 Bill Howe, eScience Institute 2
  • 4. 1/16/16 Bill Howe, eScience Institute 4 TACCTGCCGTAA = GATTACGATATTA? GATTACGATATTA time = 0 No.
  • 5. 1/16/16 Bill Howe, eScience Institute 5 CCCCCAATGAC = GATTACGATATTA? GATTACGATATTA time = 1 No.
  • 6. 1/16/16 Bill Howe, eScience Institute 6 GATTACGATATTA contains GATTACGATATTA? GATTACGATATTA time = 17 Yes! Send it to the output.
  • 7. 1/16/16 Bill Howe, eScience Institute 7 GATTACGATATTA 40 records, 40 comparions N records, N comparisons The algorithmic complexity is order N: O(N)
  • 9. CTGTACACAACCT < GATTACGATATTA GATTACGATATTA No match. Skip to 75% mark Start at the 50% mark time = 0 0% 100% CTGTACACAACCT
  • 10. GGATACACATTTA > GATTACGATATTA GATTACGATATTA No match. Go back to 62.5% mark time = 1 0% 100% GGATACACATTTA
  • 11. GATATTTTAAGC < GATTACGATATTA GATTACGATATTA No match. Skip back to 68.75% mark 0% 100% GATATTTTAAGC
  • 12. GATTACGATATTA = GATTACGATATTA GATTACGATATTA 0% 100% Match! Walk through the records until we fail to match.
  • 13. GATTACGATATTA 0% 100% How many comparisons did we do? 40 records, only 4 comparisons N records, log(N) comparisons This algorithm is O(log(N)) Far better scalability
  • 14. RELATIONAL DATABASES Databases are good at “Needle in Haystack” problems: •  Extracting small results from big datasets •  Transparently provide “old style” scalability •  Your query will always* finish, regardless of dataset size. •  Indexes are easily built and automatically used when appropriate 1/16/16 Bill Howe, eScience Institute 14 CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’; *almost
  • 15. NEW TASK:READ TRIMMING Given a set of DNA sequences Trim the final n bps of each sequence Generate a new dataset 1/16/16 Bill Howe, eScience Institute 15
  • 20. Can we use an index? No. We have to touch every record no matter what. The task is fundamentally O(N) Can we do any better?
  • 21.
  • 26. time = 7 How much time did this take? 7 cycles 40 records, 6 workers Ο( N k )
  • 27. You are given short “reads”: genomic sequences about 35-75 characters each Distribute the reads among k computers f f f f f f f is a function to trim a read; apply it to every item Now we have a big distributed set of trimmed reads SCHEMATIC OF A PARALLEL“READ TRIMMING”TASK
  • 28. You are given TIFF images Distribute the images among k computers f f f f f f f is a function to convert TIFF to PNG; apply it to every item Now we have a big distributed set of converted images http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services- timesmachine/ NEW TASK:CONVERT 405K TIFF IMAGES TO PNG
  • 29. You have sets of parameters for thousands of small simulations Divide the parameter sets among k computers f f f f f f f is runs the simulation and produces some output; apply it to every item Now we have a big distributed set of simulation results http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud NEW TASK:RUN THOUSANDS OF SIMULATIONS
  • 30. You have millions of documents Distribute the documents among k computers f f f f f f f finds the most common word in a single document Now we have a big distributed list of (doc_id, word) pairs FIND THE MOST COMMONWORD IN EACH DOCUMENT
  • 31. Consider a slightly more general program to compute the word frequency of every word in a single document Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. (people, 2) (government, 6) (assume, 1) (history, 2) …
  • 32. You have millions of documents Distribute the documents among k computers f f f f f f For each document f returns a set of (word,freq) pairs Now we have a big distributed list of sets of word freqs. COMPUTE THEWORD FREQUENCY OF 5M DOCUMENTS
  • 33. THERE’S A PATTERN HERE…. A function that maps a read to a trimmed read A function that maps a TIFF image to a PNG image A function that maps a set of parameters to a simulation result A function that maps a document to its most common word A function that maps a document to a histogram of word frequencies
  • 34. WHAT IFWEWANT TO COMPUTE THEWORD FREQUENCY ACROSS ALL DOCUMENTS? US Constitution Declaration of Independence Articles of Confederation(people, 78) (government, 123) (assume, 23) (history, 38) …
  • 35. You have millions of documents Distribute the documents among k computers map map map map map map For each document, return a set of (word,freq) pairs Now what? COMPUTE THEWORD FREQUENCY ACROSS 5M DOCUMENTS
  • 36.
  • 37. Distribute the documents among k computers map map map map map map For each document, return a set of (word,freq) pairs Now we have a big distributed list of sets of word freqs. reduce reduce reduce reduce Now count the occurrences of each word 44 3 We have our distributed histogram COMPUTE THEWORD FREQUENCY ACROSS 5M DOCUMENTS
  • 38. SOME DISTRIBUTED ALGORITHM… 1/16/16 Bill Howe, eScience Institute 38 Map (Shuffle) Reduce
  • 39. MAPREDUCE PROGRAMMING MODEL Input & Output: each a set of key/value pairs Programmer specifies two functions: •  Processes input key/value pair •  Produces set of intermediate pairs •  Combines all intermediate values for a particular key •  Produces a set of merged output values (usually just one) 1/16/16 Bill Howe, eScience Institute 39 map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp,Scheme,and Haskell slide source: Google, Inc.
  • 40. EXAMPLE:WHAT DOES THIS DO? 1/16/16 Bill Howe, eScience Institute 40 map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  • 41. EXAMPLE:DOCUMENT PROCESSING 3/12/09 Bill Howe, eScience Institute 41 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
  • 42. EXAMPLE:WORD LENGTH HISTOGRAM 3/12/09 Bill Howe, eScience Institute 42 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. How many “big”, “medium”, and “small” words are used?
  • 43. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Big =Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2.4 letters Tiny = Pink = 1 letter EXAMPLE:WORD LENGTH HISTOGRAM
  • 44. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. EXAMPLE:WORD LENGTH HISTOGRAM Split the document into chunks and process each chunk on a different computer Chunk 1 Chunk 2
  • 45. (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map Task 1 (204 words) Map Task 2 (190 words) (key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3) EXAMPLE:WORD LENGTH HISTOGRAM
  • 46. EXAMPLE:WORD LENGTH HISTOGRAM 3/12/09 Bill Howe, eScience Institute 46 (yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Reduce tasks (yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3) A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map task 1 Map task 2 “Shuffle step” (yellow, 37) (red, 148) (blue, 200) (pink, 9)
  • 47. Local storage ` MR PHASES Each Map and Reduce task has multiple phases:
  • 48. 1/16/16 Bill Howe, eScience Institute 48 Hadoop in One Slide src: HuyVo,NYU Poly
  • 49. TAXONOMY OF PARALLEL ARCHITECTURES 1/16/16 Bill Howe, eScience Institute 49 Easiest to program, but $$$$ Scales to 1000s of computers
  • 50. DESIGN SPACE 1/16/16 Bill Howe 50 50 Throughput Latency Internet Private data center Data- parallel Shared memory The area we’re discussing inspired by a slide by Michael Isard at Microsoft Research In a few weeks
  • 52. LARGE-SCALE DATA PROCESSING 1/16/16 Bill Howe, eScience Institute 52 Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs •  ... but this needs to be easy •  Parallel databases exist, but they are expensive, difficult to set up, and do not necessarily scale to hundreds of nodes. MapReduce is a lightweight framework, providing: •  Automatic parallelization and distribution •  Fault-tolerance •  I/O scheduling •  Status and monitoring
  • 53. KEY IDEA:DECLARATIVE LANGUAGES 1/16/16 Bill Howe, eScience Institute 53 SELECT * FROM Order o, Item i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order oItem i Find all orders from today,along with the items ordered
  • 54. TWO NOTIONS OF PARALLEL QUERY PROCESSING “Distributed Query” •  Rewrite the query as a union of subqueries •  Workers communicate through standard interfaces, so compatible with federated, heterogeneous, or distributed databases “Parallel Query” •  Each operator is implemented with a parallel algorithm 1/16/16 Bill Howe, UW 54
  • 55. DISTRIBUTED QUERY EXAMPLE 1/16/16 Bill Howe, UW 55 CREATE VIEW Sales AS SELECT * FROM JanSales UNION ALL SELECT * FROM FebSales UNION ALL SELECT * FROM MarSales CREATE TABLE MarSales( OrderID INT, CustomerID INT NOT NULL, OrderDate DATETIME NULL CHECK (DATEPART(mm, OrderDate) = 3), CONSTRAINT OrderIDMonth PRIMARY KEY(OrderID) )
  • 56. DISTRIBUTED FILE SYSTEM (DFS) For very large files:TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (≥3), on different racks, for fault tolerance Implementations: •  Google’s DFS: GFS, proprietary •  Hadoop’s DFS: HDFS, open source CSE 344 – Winter 2012 56
  • 57. PARALLEL QUERY EXAMPLE:TERADATA 1/16/16 Bill Howe, eScience Institute 57 AMP = unit of parallelism
  • 58. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 58 SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today() Find all orders from today,along with the items ordered join select scan scan date = today() o.item = i.item Order oItem i
  • 59. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 59 AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 4 AMP 5 AMP 6
  • 60. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 60 AMP 1 AMP 2 AMP 3 scan Item i AMP 4 AMP 5 AMP 6 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  • 61. EXAMPLE SYSTEM:TERADATA 1/16/16 Bill Howe, eScience Institute 61 AMP 4 AMP 5 AMP 6 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  • 62. MAPREDUCE CONTEMPORARIES Dryad (Microsoft) •  Relational Algebra Pig (Yahoo) •  Near Relational Algebra over MapReduce HIVE (Facebook) •  SQL over MapReduce Cascading •  Relational Algebra Clustera •  U of Wisconsin Hbase •  Indexing on HDFS 1/16/16 Bill Howe, eScience Institute 62
  • 63. MAPREDUCEVS RDBMS RDBMS •  Declarative query languages •  Schemas •  Logical Data Independence •  Indexing •  Algebraic Optimization •  Caching/Materialized Views •  ACID/Transactions MapReduce •  High Scalability •  Fault-tolerance •  “One-person deployment” 1/16/16 Bill Howe, eScience Institute 63 HIVE, Pig, DryadLINQ DryadLINQ, Pig, HIVE Pig, (Dryad, HIVE) Hbase
  • 64. COMPARISON 1/16/16 Bill Howe, eScience Institute 64 Data Model Prog. Model Services GPL * * Typing (maybe) Workflow * dataflow typing, provenance, scheduling, caching, task parallelism, reuse Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, data parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance MS Dryad IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control
  • 65. Local storage ` MAP REDUCE PHASES Each Map and Reduce task has multiple phases:
  • 66. MAP REDUCE (w1,1) (w2,1) (w1,1) … (w1,1) (w2,1) … (did1,v1) (did2,v2) (did3,v3) . . . . (w1, (1,1,1,…,1)) (w2, (1,1,…)) (w3,(1…)) … … … … (w1, 25) (w2, 77) (w3, 12) … … … … Shuffle Same word appears twice. Why not just send (w1,2)?
  • 67. map map map map map map reduce reduce reduce reduce 44 3 COUNTWORD OCCURRENCES ACROSS ALL DOCUMENTS
  • 68. MAP REDUCE (w1,1) (w2,1) (w3,1) … (w1,1) (w2,1) … (did1,v1) (did2,v2) (did3,v3) . . . . (w1, (1,1,1,…,1)) (w2, (1,1,…)) (w3,(1…)) … … … … (w1, 25) (w2, 77) (w3, 12) … … … … Shuffle
  • 69. MAP REDUCE Google: paper published 2004 Free variant: Hadoop Map-reduce = high-level programming model and implementation for large-scale parallel data processing slide src: Dan Suciu and Magda Balazinska 69
  • 70. DATA MODEL Files ! A file = a bag of (key, value) pairs A map-reduce program: Input: a bag of (inputkey, value)pairs Output: a bag of (outputkey, value)pairs slide src: Dan Suciu and Magda Balazinska 70
  • 71. STEP 1:THE MAP PHASE User provides the MAP-function: Input: (input key, value) Ouput: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file CSE 344 – Winter 2012 71
  • 72. STEP 2:THE REDUCE PHASE User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output (values) The system will group all pairs with the same intermediate key, and passes the bag of values to the REDUCE function CSE 344 – Winter 2012 72
  • 73. MAPREDUCE PROGRAMMING MODEL Input & Output: each a set of key/value pairs Programmer specifies two functions: •  Processes input key/value pair •  Produces set of intermediate pairs •  Combines all intermediate values for a particular key •  Produces a set of merged output values (usually just one) 1/16/16 Bill Howe, eScience Institute 73 map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp,Scheme,and Haskell slide source: Google, Inc.
  • 74. EXAMPLE:WHAT DOES THIS DO? 1/16/16 Bill Howe, eScience Institute 74 map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator, intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  • 75. MORE EXAMPLES:BUILD AN INVERTED INDEX 1/16/16 Bill Howe, UW 75 Input: tweet1, (“I love pancakes for breakfast”) tweet2, (“I dislike pancakes”) tweet3, (“What should I eat for breakfast?”) tweet4, (“I love to eat”) Desired output: “pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4) …
  • 76. MORE EXAMPLES:RELATIONAL JOIN 1/16/16 Bill Howe, UW 76 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Emplyee ⨝ Assigned Departments Employee Assigned Departments Name SSN EmpSSN DepName Sue 999999999 999999999 Accounts Tony 777777777 777777777 Sales Tony 777777777 777777777 Marketing
  • 77. RELATIONAL JOIN IN MAPREDUCE:BEFORE MAP PHASE 1/16/16 Bill Howe, UW 77 Name SSN Sue 999999999 Tony 777777777 EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing Employee Assigned Departments Key idea: Lump all the tuples together into one dataset Employee, Sue, 999999999 Employee,Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing What is this for?
  • 78. RELATIONAL JOIN IN MAPREDUCE:MAP PHASE 1/16/16 Bill Howe, UW 78 Employee, Sue, 999999999 Employee,Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing key=999999999, value=(Employee, Sue, 999999999) key=777777777, value=(Employee,Tony, 777777777) key=999999999, value=(Department, 999999999, Accounts) key=777777777, value=(Department, 777777777, Sales) key=777777777, value=(Department, 777777777, Marketing) why do we use this as the key?
  • 79. RELATIONAL JOIN IN MAPREDUCE:REDUCE PHASE 1/16/16 Bill Howe, UW 79 key=999999999, values=[(Employee, Sue, 999999999), (Department, 999999999, Accounts)] key=777777777, values=[(Employee,Tony, 777777777), (Department, 777777777, Sales), (Department, 777777777, Marketing)] Sue, 999999999, 999999999, Accounts Tony, 777777777, 777777777, Sales Tony, 777777777, 777777777, Marketing
  • 80. RELATIONAL JOIN IN MAPREDUCE,AGAIN 1/16/16 Bill Howe, Data Science, Autumn 2012 80 Order(orderid, account, date) 1, aaa, d1 2, aaa, d2 3, bbb, d3 LineItem(orderid, itemid, qty) 1, 10, 1 1, 20, 3 2, 10, 5 2, 50, 100 3, 20, 1tagged with relation nameOrder 1, aaa, d1 à 1 :“Order”, (1,aaa,d1) 2, aaa, d2 à 2 :“Order”, (2,aaa,d2) 3, bbb, d3 à 3 :“Order”, (3,bbb,d3) Line 1, 10, 1 à 1 :“Line”, (1, 10, 1) 1, 20, 3 à 1 :“Line”, (1, 20, 3) 2, 10, 5 à 2 :“Line”, (2, 10, 5) 2, 50, 100 à 2 :“Line”, (2, 50, 100) 3, 20, 1 à 3 :“Line”, (3, 20, 1) Map Reducer for key 1 “Order”, (1,aaa,d1) “Line”, (1, 10, 1) “Line”, (1, 20, 3) (1, aaa, d1, 1, 10, 1) (1, aaa, d1, 1, 20, 3)
  • 81. SIMPLE SOCIAL NETWORK ANALYSIS: COUNT FRIENDS 1/16/16 Bill Howe, UW 81 Jim, Sue Sue, Jim Lin, Joe Joe, Lin Jim, Kai Kai, Jim Jim, Lin Lin, Jim Input Desired Output Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1 Jim,1 Sue,1 Lin,1 Joe,1 Jim,1 Kai,1 Jim,1 Lin,1 MAP REDUCE Jim,(1,1,1) Lin,(1,1) Sue,(1) Kai,(1) Joe,(1) SHUFFLE
  • 82. MATRIX MULTIPLY IN MAPREDUCE C = A X B A has dimensions L,M B has dimensions M,N In the map phase: •  for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N •  for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L In the reduce phase, emit •  key = (i,k) •  value = Sumj (A[i,j] * B[j,k]) 1/16/16 Bill Howe, Data Science, Autumn 2012 82
  • 83. 1/16/16 Bill Howe, Data Science, Autumn 2012 83 AB X A B = •  One reducer per output cell •  •  Each reducer computes: Ai, jBj,k j ∑
  • 84. CLUSTER COMPUTING Large number of commodity servers, connected by high speed, commodity network Rack: holds a small number of servers Data center: holds many racks CSE 344 – Winter 2012 84
  • 85. CLUSTER COMPUTING Massive parallelism: •  100s, or 1000s, or 10000s servers •  Many hours Failure: •  If medium-time-between-failure is 1 year •  Then 10000 servers have one failure / hour slide src: Dan Suciu and Magda Balazinska 85
  • 86. DISTRIBUTED FILE SYSTEM (DFS) For very large files:TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (≥3), on different racks, for fault tolerance Implementations: •  Google’s DFS: GFS, proprietary •  Hadoop’s DFS: HDFS, open source slide src: Dan Suciu and Magda Balazinska 86
  • 88. HADOOPVS.RDBMS Comparison of 3 systems •  Hadoop •  Vertica (a column-oriented database) •  DBMS-X (a row-oriented database) •  rhymes with “schmoracle” Qualitative •  Programming model, ease of setup, features, etc. Quantitative •  Data loading, different types of queries 1/16/16 Bill Howe, eScience Institute 88 Pavlo et al.2009
  • 89. GREP TASK •  Find 3-byte pattern in 100-byte record •  1 match per 10,000 records •  Data set: •  10-byte unique key, 90-byte value •  1TB spread across 25, 50, or 100 nodes •  10 billion records •  Original MR Paper (Dean et al. 2004) 89
  • 90. GREP TASK LOADING RESULTS 90 Seconds
  • 91. GREP TASK EXECUTION RESULTS 91 Seconds
  • 92. SELECTION TASK 92 SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X 1 GB / node
  • 93. ANALYTICAL TASKS •  Simple web programming schema •  Data set •  600k HTML Documents (6GB/node) •  155 million UserVisit records (20GB/node) •  18 million Rankings records (1GB/node) 93
  • 94. AGGREGATE TASK •  Simple query to find adRevenue by IP prefix 94 SELECT SUBSTR(sourceIP, 1, 7), SUM (adRevenue) FROM userVisits GROUP BY SUBSTR (sourceIP, 1, 7)
  • 96. JOIN TASK •  Find the sourceIP that generated the most adRevenue along with its average pageRank •  Implementations: •  DMPSs – complex SQL using temporary table •  MapReduce – Three separate MR programs 96
  • 97. JOIN TASK 97 SELECT INTO TempsourceIP, AVG(pageRank)as avgPageRank, SUM(adRevenue)as totalRevenue FROM RankingsAS R , UserVisitsAS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenueDESC LIMIT 1;
  • 99. PROBLEMSWITH THIS ANALYSIS? Other ways to avoid sequential scans? Fault-tolerance in large clusters? Tasks that cannot be expressed as queries? 1/16/16 Bill Howe, UW 99
  • 100. GOOGLE’S RESPONSE:CLUSTER SIZE •  Largest known database installations •  Greenplum – 96 nodes – 4.5 PB (eBay)[1] •  Teradata – 72 nodes – 2+PB (eBay)[1] •  Largest known MR installations: •  Hadoop – 3658 nodes – 1 PB (Yahoo)[2] •  Hive – 600+ nodes – 2.5 PB (Facebook)[3] [1] eBay’s two enormous data warehouses – April 30th, 2009 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ [2] Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds – May 11th, 2009 https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html [3] Hive – A Petabyte Scale DataWarehouse using Hadoop – June 10th, 2009 https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919 100
  • 101. CONCLUDING REMARKS •  What can MapReduce learn from Databases? •  Declarative languages are a good thing •  Schemas are important •  What can Databases learn from MapReduce? •  Query fault-tolerance •  Support for in situ data •  Embrace open-source 101
  • 102. SLIDES CAN BE FOUND AT: TEACHINGDATASCIENCE.ORG