SlideShare una empresa de Scribd logo
1 de 49
Interac(ve	
  Big	
  data	
  analysis	
  
Viet-­‐Trung	
  Tran	
  
1	
  
MapReduce	
  wordcount	
  
2	
  
MR	
  –	
  batch	
  processing	
  
•  Long	
  running	
  job	
  
– latency	
  between	
  running	
  the	
  job	
  and	
  geBng	
  the	
  
answer	
  
•  Lot	
  of	
  computa(ons	
  
•  Specific	
  language	
  
3	
  
Example	
  Problem	
  
•  Jane	
  works	
  as	
  an	
  
analyst	
  at	
  an	
  e-­‐
commerce	
  company	
  
•  How	
  does	
  she	
  figure	
  
out	
  good	
  targe(ng	
  
segments	
  for	
  the	
  next	
  
marke(ng	
  campaign?	
  
•  She	
  has	
  some	
  ideas	
  
and	
  lots	
  of	
  data	
  
User	
  	
  
profiles	
  
Transac.on	
  
informa.on	
  
Access	
  
logs	
  
4	
  
Solving	
  the	
  problems?	
  
All	
  compiled	
  to	
  Map	
  Reduce	
  jobs	
  
5	
  
Dremel:	
  interac(ve	
  analysis	
  of	
  
web-­‐scale	
  datasets	
  
Melnik	
  et.	
  al,	
  Google	
  inc	
  
[VLDB	
  2010]	
  
6	
  
What	
  is	
  Dremel?	
  
•  Near	
  real	
  (me	
  interac(ve	
  analysis	
  (instead	
  batch	
  
processing).	
  SQL-­‐like	
  query	
  language	
  
–  Trillion	
  record,	
  mul(-­‐terabyte	
  datasets	
  
•  Nested	
  data	
  with	
  a	
  column	
  storage	
  representa(on	
  
•  Serving	
  tree:	
  mul(-­‐level	
  execu(on	
  trees	
  for	
  query	
  
processing	
  
•  Interoperates	
  "in	
  place"	
  with	
  GFS,	
  Big	
  Table	
  
•  The	
  engine	
  behind	
  Google	
  BigQuery	
  
•  Builds	
  on	
  the	
  ideas	
  from	
  web	
  search	
  and	
  parallel	
  
DBMS.	
  
7	
  
•  Brand of power tools that primarily rely on
their speed as opposed to torque
•  Data analysis tool that uses speed instead
of raw power
Why call it Dremel
8	
  
Widely used inside Google
•  Analysis of crawled web
documents
•  Tracking install data for
applications on Android
Market
•  Crash reporting for Google
products
•  OCR results from Google
Books
•  Spam analysis
•  Debugging of map tiles on
Google Maps
•  Tablet migrations in
managed Bigtable instances
•  Results of tests run on
Google's distributed build
system
•  Disk I/O statistics for
hundreds of thousands of
disks
•  Resource monitoring for
jobs run in Google's data
centers
•  Symbols and dependencies
in Google's codebase
9	
  
Records vs. columns
A	
  
B	
  
C	
   D	
  
E	
  
*	
  
*	
  
*	
  
.	
  .	
  .	
  
.	
  .	
  .	
  
r1	
  
r2	
   r1	
  
r2	
  
r1	
  
r2	
  
r1	
  
r2	
  
Challenge: preserve structure,
reconstruct from a subset of fields
Read less,
cheaper
decompression
10	
  
Columnar	
  format	
  
•  Values	
  in	
  a	
  column	
  stored	
  next	
  to	
  one	
  another	
  
– Beher	
  compression	
  
– Range-­‐map:	
  save	
  min-­‐max	
  
•  Only	
  access	
  columns	
  par(cipa(ng	
  in	
  query	
  
•  Aggrega(ons	
  can	
  be	
  done	
  without	
  decoding	
  
11	
  
Nested data model
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
multiplicity:
12	
  
Column-striped representation
value r d
10 0 0
20 0 0
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.BackwardLinks.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
13	
  
Repetition and
definition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code
r: At what repeated field in the field's path
the value has repeated	
  
d: How many fields in paths that could be
undefined (opt. or rep.) are actually present	
  
record (r=0) has repeated	
  
r=2	
  r=1	
  
Language (r=2) has repeated	
  
(non-repeating)	
  
14	
  
Record assembly FSM	
  
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
15	
  
Record assembly FSM: example
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
16	
  
Reading two fields
DocId
Name.Language.Country1,2	
  
0	
  
0	
  
DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name
s1	
  
s2	
  
Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country
17	
  
Query processing
•  Optimized for select-project-aggregate
– Very common class of interactive queries
– Single scan
– Within-record and cross-record aggregation
•  Approximations: count(distinct), top-k
•  Joins, temp tables, UDFs/TVFs, etc.
18	
  
SQL dialect for nested data
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0
t1	
  
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str;
}
}
}
Output table	
   Output schema	
  
No record assembly during query processing	
  
19	
  
Serving tree
storage layer (e.g., GFS)
. . .	
  
. . .	
  
. . .	
  leaf servers
(with local
storage)	
  
intermediate
servers	
  
root server	
  
client	
  
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!" %" '" )" *" $!" $%" $'" $)"
histogram of
response times	
  
20	
  
Mul(-­‐level	
  serving	
  tree	
  
•  Parallelizes scheduling and aggregation
– Reduced fan-in
– Divide/conquer
– Better network utilization
•  Fault tolerance
21	
  
Example: count()
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (R11 UNION ALL R110)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM T11 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM T12 GROUP BY A
T12 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS c
FROM T31 GROUP BY A
T31 = {/gfs/1}
. . .	
  
0	
  
1	
  
3	
  
R11	
   R12	
  
Data access ops	
  
. . .	
  
. . .	
  
22	
  
Experiments
Table
name
Number of
records
Size (unrepl.,
compressed)
Number
of fields
Data
center
Repl.
factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
•  1 PB of real data
(uncompressed, non-replicated)
•  100K-800K tablets per table
•  Experiments run during business hours
23	
  
!"
#"
$"
%"
&"
'!"
'#"
'$"
'%"
'&"
#!"
'" #" (" $" )" %" *" &" +" '!"
Read from disk
columns	
  
records	
  
objects	
  
fromrecords	
  fromcolumns	
  
(a) read +
decompress	
  
(b) assemble
records	
  
(c) parse as
C++ objects	
  
(d) read +
decompress	
  
(e) parse as
C++ objects	
  
time (sec)	
  
number of fields	
  
Table partition: 375 MB (compressed), 300K rows, 125 columns	
  
2-4x overhead of
using records
10x speedup
using columnar
storage
24	
  
MR and Dremel execution
Sawzall program ran on MR:
num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-
count_words(input.txtField);!"
!#"
!##"
!###"
!####"
$%&'()*'+," $%&)*-./0," 1'(/(-"
execution time (sec) on 3000 nodes 	
  
SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1
Q1:	
  
87 TB	
   0.5 TB	
   0.5 TB	
  
MR overheads: launch jobs, schedule 0.5M tasks,
assemble records
Avg # of terms in txtField in 85 billion record table T1	
  
25	
  
Impact of serving tree depth
!"
#!"
$!"
%!"
&!"
'!"
(!"
)$" )%"
$"*+,+*-"
%"*+,+*-"
&"*+,+*-"
execution time (sec)	
  
SELECT country, SUM(item.amount) FROM T2

GROUP BY country
SELECT domain, SUM(item.amount) FROM T2

WHERE domain CONTAINS ’.net’

GROUP BY domain
Q2:
Q3:
40 billion nested items
(returns 100s of records) (returns 1M records)
26	
  
!"
#!"
$!!"
$#!"
%!!"
%#!"
$!!!" %!!!" &!!!" '!!!"
Scalability
execution time (sec)	
  
number of
leaf servers	
  
SELECT TOP(aid, 20), COUNT(*) FROM T4
Q5 on a trillion-row table T4:
27	
  
Interactive speed
!"
#"
$!"
$#"
%!"
%#"
&!"
$" $!" $!!" $!!!"
execution time
(sec)	
  
percentage of queries
Most queries complete under 10 sec
Monthly query workload
of one 3000-node Dremel
instance
28	
  
Observations
•  Possible to analyze large disk-resident datasets
interactively on commodity hardware
–  1T records, 1000s of nodes
•  MR can benefit from columnar storage just like a parallel
DBMS
–  But record assembly is expensive
–  Interactive SQL and MR can be complementary
•  Parallel DBMSes may benefit from serving tree
architecture just like search engines
29	
  
Vs.	
  MapReduce	
  
•  Scheduling	
  Model	
  
–  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u(liza(on	
  
–  Acquisi(on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
–  Map	
  comple(on	
  required	
  before	
  shuffle/reduce	
  
commencement	
  
–  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
–  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en(rely	
  before	
  the	
  next	
  one	
  
can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
–  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
–  Serializa(on	
  and	
  deserializa(on	
  are	
  required	
  between	
  execu(on	
  
phase	
  
30	
  
Apache	
  Drill	
  
31	
  
32	
  
33	
  
34	
  
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL	
  like	
  is	
  not	
  enough	
  
•  Fine	
  integra(on	
  with	
  exis(ng	
  BI	
  tools	
  
– Tableau,	
  SAP	
  
– Standard	
  ODBC/JDBC	
  driver	
  
35	
  
Working	
  data	
  
•  Flat	
  files	
  in	
  DFS	
  
– Complex	
  data	
  (thrif,	
  Avro,	
  protobuf)	
  
– Columnar	
  data	
  (Parquet,	
  ORC)	
  
– JSON	
  
– CSV,	
  TSV	
  
•  NoSQL	
  stores	
  
– Document	
  stores	
  
– Spare	
  data	
  
– Rela(onal-­‐like	
  
36	
  
37	
  
Flexible	
  schema	
  	
  
38	
  
Sample	
  query	
  
39	
  
40	
  
Nested	
  data	
  
•  Nested	
  data	
  as	
  first	
  class	
  en(ty	
  
– Similar	
  to	
  BigQuery	
  
– No	
  upfront	
  flahening	
  required	
  
– JSON,	
  BSON,	
  AVRO,	
  Protocol	
  buffers	
  
41	
  
Cross	
  data	
  source	
  queries	
  
•  Combilne	
  data	
  from	
  Files,	
  HBASE,	
  Hive	
  in	
  one	
  
single	
  query	
  
•  No	
  central	
  metadata	
  defini(ons	
  necessary	
  
42	
  
High	
  level	
  architecture	
  
•  Cluster	
  of	
  drillbits,	
  one	
  per	
  node,	
  designed	
  to	
  maximize	
  data	
  locality	
  
•  Form	
  a	
  distributed	
  query	
  processing	
  engine	
  
•  Zookeeper	
  for	
  cluster	
  membership	
  only	
  
•  Hazelcast	
  distributed	
  cache	
  for	
  query	
  plans,	
  metadata,	
  locality	
  informa(on	
  
•  Columnar	
  record	
  organiza(on	
  
•  No	
  dependency	
  on	
  other	
  execu(on	
  engines	
  (Mapreduce,	
  Tez,	
  Spark)	
  
43	
  
Basic	
  query	
  flow	
  
44	
  
Drillbit	
  modules	
  
•  SQL	
  parser	
  
•  Op(mizer	
  
•  execu(on	
  
•  Query	
  execu(on	
  
– source	
  query:	
  what	
  
– logical	
  plan:	
  what	
  
– physical	
  plan:	
  how	
  
– execu(on	
  plan:	
  where	
  
45	
  
46	
  
Op(mis(c	
  execu(on	
  
•  Short	
  running	
  query	
  
– No	
  checkpoints	
  
– Rerun	
  en(re	
  query	
  in	
  face	
  of	
  failure	
  
•  No	
  barriers	
  
•  No	
  persistence	
  
47	
  
Run(me	
  compila(on	
  
48	
  
Roadmap	
  
49	
  

Más contenido relacionado

La actualidad más candente

Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章moai kids
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientistsLambda Tree
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)moai kids
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 

La actualidad más candente (20)

14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Scalding
ScaldingScalding
Scalding
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Report
ReportReport
Report
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 

Destacado

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries posterNataly Blas
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espolLissy Rodriguez
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Paul Brown
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16Sightlines
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Dave McClure
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación químicadopamina mexico
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmmceny2
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio DigestivoLuciana Yohai
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultantTenforce
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. Liquis Design
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsTric Park
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and HappinessFaircom New York
 

Destacado (20)

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries poster
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espol
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
The Rules - SGS
The Rules - SGSThe Rules - SGS
The Rules - SGS
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación química
 
teaching methods
teaching methods teaching methods
teaching methods
 
Moving to the Right Side of Safety
Moving to the Right Side of SafetyMoving to the Right Side of Safety
Moving to the Right Side of Safety
 
God Is Forgiving
God Is ForgivingGod Is Forgiving
God Is Forgiving
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmm
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio Digestivo
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultant
 
Jvm mbeans jmxtran
Jvm mbeans jmxtranJvm mbeans jmxtran
Jvm mbeans jmxtran
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website.
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of Millions
 
Latin Dansları
Latin DanslarıLatin Dansları
Latin Dansları
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and Happiness
 
Torque
TorqueTorque
Torque
 

Similar a Interactive big data analytics

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 

Similar a Interactive big data analytics (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Handout3o
Handout3oHandout3o
Handout3o
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
User biglm
User biglmUser biglm
User biglm
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 

Más de Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processingViet-Trung TRAN
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposalsViet-Trung TRAN
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Viet-Trung TRAN
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 

Más de Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Último

Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 

Último (20)

Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 

Interactive big data analytics

  • 1. Interac(ve  Big  data  analysis   Viet-­‐Trung  Tran   1  
  • 3. MR  –  batch  processing   •  Long  running  job   – latency  between  running  the  job  and  geBng  the   answer   •  Lot  of  computa(ons   •  Specific  language   3  
  • 4. Example  Problem   •  Jane  works  as  an   analyst  at  an  e-­‐ commerce  company   •  How  does  she  figure   out  good  targe(ng   segments  for  the  next   marke(ng  campaign?   •  She  has  some  ideas   and  lots  of  data   User     profiles   Transac.on   informa.on   Access   logs   4  
  • 5. Solving  the  problems?   All  compiled  to  Map  Reduce  jobs   5  
  • 6. Dremel:  interac(ve  analysis  of   web-­‐scale  datasets   Melnik  et.  al,  Google  inc   [VLDB  2010]   6  
  • 7. What  is  Dremel?   •  Near  real  (me  interac(ve  analysis  (instead  batch   processing).  SQL-­‐like  query  language   –  Trillion  record,  mul(-­‐terabyte  datasets   •  Nested  data  with  a  column  storage  representa(on   •  Serving  tree:  mul(-­‐level  execu(on  trees  for  query   processing   •  Interoperates  "in  place"  with  GFS,  Big  Table   •  The  engine  behind  Google  BigQuery   •  Builds  on  the  ideas  from  web  search  and  parallel   DBMS.   7  
  • 8. •  Brand of power tools that primarily rely on their speed as opposed to torque •  Data analysis tool that uses speed instead of raw power Why call it Dremel 8  
  • 9. Widely used inside Google •  Analysis of crawled web documents •  Tracking install data for applications on Android Market •  Crash reporting for Google products •  OCR results from Google Books •  Spam analysis •  Debugging of map tiles on Google Maps •  Tablet migrations in managed Bigtable instances •  Results of tests run on Google's distributed build system •  Disk I/O statistics for hundreds of thousands of disks •  Resource monitoring for jobs run in Google's data centers •  Symbols and dependencies in Google's codebase 9  
  • 10. Records vs. columns A   B   C   D   E   *   *   *   .  .  .   .  .  .   r1   r2   r1   r2   r1   r2   r1   r2   Challenge: preserve structure, reconstruct from a subset of fields Read less, cheaper decompression 10  
  • 11. Columnar  format   •  Values  in  a  column  stored  next  to  one  another   – Beher  compression   – Range-­‐map:  save  min-­‐max   •  Only  access  columns  par(cipa(ng  in  query   •  Aggrega(ons  can  be  done  without  decoding   11  
  • 12. Nested data model message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   multiplicity: 12  
  • 13. Column-striped representation value r d 10 0 0 20 0 0 DocId value r d http://A 0 2 http://B 1 2 NULL 1 1 http://C 0 2 Name.Url value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code Name.Language.Country Links.BackwardLinks.Forward value r d us 0 3 NULL 2 2 NULL 1 1 gb 1 3 NULL 0 1 value r d 20 0 2 40 1 2 60 1 2 80 0 2 value r d NULL 0 1 10 0 2 30 1 2 13  
  • 14. Repetition and definition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code r: At what repeated field in the field's path the value has repeated   d: How many fields in paths that could be undefined (opt. or rep.) are actually present   record (r=0) has repeated   r=2  r=1   Language (r=2) has repeated   (non-repeating)   14  
  • 15. Record assembly FSM   message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels 15  
  • 16. Record assembly FSM: example Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' 16  
  • 17. Reading two fields DocId Name.Language.Country1,2   0   0   DocId: 10 Name Language Country: 'us' Language Name Name Language Country: 'gb' DocId: 20 Name s1   s2   Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country 17  
  • 18. Query processing •  Optimized for select-project-aggregate – Very common class of interactive queries – Single scan – Within-record and cross-record aggregation •  Approximations: count(distinct), top-k •  Joins, temp tables, UDFs/TVFs, etc. 18  
  • 19. SQL dialect for nested data Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0 t1   SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } } } Output table   Output schema   No record assembly during query processing   19  
  • 20. Serving tree storage layer (e.g., GFS) . . .   . . .   . . .  leaf servers (with local storage)   intermediate servers   root server   client   !" !#$" !#%" !#&" !#'" !#(" !#)" !" %" '" )" *" $!" $%" $'" $)" histogram of response times   20  
  • 21. Mul(-­‐level  serving  tree   •  Parallelizes scheduling and aggregation – Reduced fan-in – Divide/conquer – Better network utilization •  Fault tolerance 21  
  • 22. Example: count() SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM (R11 UNION ALL R110) GROUP BY A SELECT A, COUNT(B) AS c FROM T11 GROUP BY A T11 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T12 GROUP BY A T12 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T31 GROUP BY A T31 = {/gfs/1} . . .   0   1   3   R11   R12   Data access ops   . . .   . . .   22  
  • 23. Experiments Table name Number of records Size (unrepl., compressed) Number of fields Data center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× •  1 PB of real data (uncompressed, non-replicated) •  100K-800K tablets per table •  Experiments run during business hours 23  
  • 24. !" #" $" %" &" '!" '#" '$" '%" '&" #!" '" #" (" $" )" %" *" &" +" '!" Read from disk columns   records   objects   fromrecords  fromcolumns   (a) read + decompress   (b) assemble records   (c) parse as C++ objects   (d) read + decompress   (e) parse as C++ objects   time (sec)   number of fields   Table partition: 375 MB (compressed), 300K rows, 125 columns   2-4x overhead of using records 10x speedup using columnar storage 24  
  • 25. MR and Dremel execution Sawzall program ran on MR: num_recs: table sum of int; num_words: table sum of int; emit num_recs <- 1; emit num_words <- count_words(input.txtField);!" !#" !##" !###" !####" $%&'()*'+," $%&)*-./0," 1'(/(-" execution time (sec) on 3000 nodes   SELECT SUM(count_words(txtField)) / COUNT(*) FROM T1 Q1:   87 TB   0.5 TB   0.5 TB   MR overheads: launch jobs, schedule 0.5M tasks, assemble records Avg # of terms in txtField in 85 billion record table T1   25  
  • 26. Impact of serving tree depth !" #!" $!" %!" &!" '!" (!" )$" )%" $"*+,+*-" %"*+,+*-" &"*+,+*-" execution time (sec)   SELECT country, SUM(item.amount) FROM T2
 GROUP BY country SELECT domain, SUM(item.amount) FROM T2
 WHERE domain CONTAINS ’.net’
 GROUP BY domain Q2: Q3: 40 billion nested items (returns 100s of records) (returns 1M records) 26  
  • 27. !" #!" $!!" $#!" %!!" %#!" $!!!" %!!!" &!!!" '!!!" Scalability execution time (sec)   number of leaf servers   SELECT TOP(aid, 20), COUNT(*) FROM T4 Q5 on a trillion-row table T4: 27  
  • 28. Interactive speed !" #" $!" $#" %!" %#" &!" $" $!" $!!" $!!!" execution time (sec)   percentage of queries Most queries complete under 10 sec Monthly query workload of one 3000-node Dremel instance 28  
  • 29. Observations •  Possible to analyze large disk-resident datasets interactively on commodity hardware –  1T records, 1000s of nodes •  MR can benefit from columnar storage just like a parallel DBMS –  But record assembly is expensive –  Interactive SQL and MR can be complementary •  Parallel DBMSes may benefit from serving tree architecture just like search engines 29  
  • 30. Vs.  MapReduce   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u(liza(on   –  Acquisi(on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple(on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en(rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa(on  and  deserializa(on  are  required  between  execu(on   phase   30  
  • 32. 32  
  • 33. 33  
  • 34. 34  
  • 35. Full  SQL  –  ANSI  SQL  2003   •  SQL  like  is  not  enough   •  Fine  integra(on  with  exis(ng  BI  tools   – Tableau,  SAP   – Standard  ODBC/JDBC  driver   35  
  • 36. Working  data   •  Flat  files  in  DFS   – Complex  data  (thrif,  Avro,  protobuf)   – Columnar  data  (Parquet,  ORC)   – JSON   – CSV,  TSV   •  NoSQL  stores   – Document  stores   – Spare  data   – Rela(onal-­‐like   36  
  • 37. 37  
  • 40. 40  
  • 41. Nested  data   •  Nested  data  as  first  class  en(ty   – Similar  to  BigQuery   – No  upfront  flahening  required   – JSON,  BSON,  AVRO,  Protocol  buffers   41  
  • 42. Cross  data  source  queries   •  Combilne  data  from  Files,  HBASE,  Hive  in  one   single  query   •  No  central  metadata  defini(ons  necessary   42  
  • 43. High  level  architecture   •  Cluster  of  drillbits,  one  per  node,  designed  to  maximize  data  locality   •  Form  a  distributed  query  processing  engine   •  Zookeeper  for  cluster  membership  only   •  Hazelcast  distributed  cache  for  query  plans,  metadata,  locality  informa(on   •  Columnar  record  organiza(on   •  No  dependency  on  other  execu(on  engines  (Mapreduce,  Tez,  Spark)   43  
  • 45. Drillbit  modules   •  SQL  parser   •  Op(mizer   •  execu(on   •  Query  execu(on   – source  query:  what   – logical  plan:  what   – physical  plan:  how   – execu(on  plan:  where   45  
  • 46. 46  
  • 47. Op(mis(c  execu(on   •  Short  running  query   – No  checkpoints   – Rerun  en(re  query  in  face  of  failure   •  No  barriers   •  No  persistence   47