GalvanizeU Seattle: Eleven Almost-Truisms About Data

Eleven Almost-Truisms
About Data
2015-07-24 • Seattle
Paco Nathan, @pacoid 
O’Reilly Learning

Set and Setting:
Almost a Dozen Almost-Truisms about Data … 
to consider when embarking on a journey  
into Data Science
There are a number preconceptions about
working with data at scale, where the realities
beg to differ
We’ll crank this number up to eleven – even
though the actual number is of course much
larger, but that’s perhaps for another day

Almost a Dozen Almost-Truisms about Data … 
to consider when embarking on a journey  
into Data Science
Let’s discuss some less-intuitive directions,
along with likely consequences and corollaries
This is not intended to prove a set of points,
rather to provide a set of launching points
Set and Setting:

The rates of data being stored and analyzed
jumped quite dramatically in the late 1990s  
to early 2000s … partly because storage
became incredibly cheap … partly because
internetworked machines suddenly started
producing much more machine data
Fifteen years later, the rates jump again, this
time by orders of magnitude … Because IoT
It’s almost like this thing has a pulse?
#01: Because Rates

In other words, to paraphrase von Schelling,
experience precedes analysis
Typically, we’re swimming in data, and we tend
to respond by struggling to understand its
structure and dynamics
That, in contrast to the myth that our analysis
drives data collection
#01: Because Rates

Four independent teams were working toward horizontal  
scale-out of workﬂows based on commodity hardware
This effort prepared the way for huge Internet successes during 
the 1997 holiday season…
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce on clusters of commodity hardware and the  
Apache Hadoop open source stack emerged from this context
#01: Because Rates – 1997 Q3 Inﬂection Point

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-
website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/
you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/
eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/
JeffDeanOnGoogleInfrastructure.aspx
#01: Because Rates – 1997 Q3 Inﬂection Point

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
#01: Because Rates – Circa 2001, post e-commerce success

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
“data products”
#01: Because Rates – Circa 2001, post e-commerce success

Primary sources for the notion:
Cleveland,W. S.,  
“Data Science: an Action Plan for Expanding  
the Technical Areas of the Field of Statistics,”  
International Statistical Review (2001), 69, 21-26.
http://cm.bell-labs.com/stat/doc/datascience.ps
Breiman L.,  
“Statistical modeling: the two cultures”,  
Statistical Science (2001), 16:199-231.
http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
#01: Because Rates –Whither Data Science?

Rashomon, the 1950 Japanese period drama  
by Akira Kurosawa, symbolizes a long-standing
tension in Statistics, one which Mark Twain
described ever so succinctly…
wikipedia.org/wiki/Rashomon:
“The ﬁlm is known for a plot device 
which involves various characters 
providing alternative, self-serving 
and contradictory versions of the 
same incident.”
#01: Because Rates – A Sea Change

Because IoT! (exabytes/day per sensor)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-
machine-and-then-uses-sensors-to-listen-to-it/
#01: Because Rates – A Sea Change, Redux

#02: Batch Defenestration
Batch Analytics
Going strong, since 1944 
Been there, done that

Businesses want to join the 21c.,  
and level up to streaming analytics
“I saw what you did … in batch,” 
now performed a zillion times faster
#02: Batch Defenestration – Infrastructure, Remodeled
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments

Tuning Spark Streaming forThroughput
Gerard Maas, 2014-12-22
virdata.com/tuning-spark/
#02: Batch Defenestration – “Team Apache”, $316.4M funding

Can Spark Streaming survive Chaos Monkey?
Bharat Venkat, Prasanna Padmanabhan,  
Antony Arokiasamy, Raju Uppalapati
techblog.netﬂix.com/2015/03/can-spark-
streaming-survive-chaos-monkey.html
#02: Batch Defenestration – Resiliency, at the edge of Comp Sci

Trending interests:
• electric cars
• organic farm-to-table cuisine
• permaculture
• sustainable urbanism
#03: Circa 1904

Speaking of batch windows…
The last century or two of statistics
represent an extremely huge mess
Let’s start the clock over, then move
forward into a more real-time near-future
#03: Circa 1904

#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
Probability got going, formally, in the 16th c. –  
although interesting mathematical estimations  
trace back to classical times
Arabs in the 9th c. used frequency analysis –  
later rediscovered by Europeans during the  
early Italian Renaissance
Statistics followed, originally more about what  
we might call demographics – through 18th c.

Laplace, Gauss, et al., bridged prob & stats in the  
late 18th c. using distributions (what we studied  
in Stats 101) to infer the probability of errors  
in estimates
Much of the 19th/20th c. work was about using
goodness of ﬁt tests, etc., justifying some distribution
• generally speaking, that require samples
• that, in turn, implies batch windows

While 19th/20th c. stats work focused on defensibility
21st c. work, w.r.t. Big Data apps, focuses more  
on predictability – plus there’s a shift in how we make
estimates…
BTW, doesn’t it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?

A fascinating, relatively new area pioneered by
relatively few people – e.g., Philippe Flajolet
Provides approximation with error bounds using
much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/
2012/05/01/probabilistic-structures-
web-analytics-data-mining/

algorithm use case example
Bloom Filter set membership code
MinHash set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code

E.g., ±4% could buy you two orders of magnitude
reduction in the required memory footprint for  
an analytics app
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale

IMO, many notions of “API” are illusions
Arguably, reductionist shell games
And that imposes limitations on how we
work, and even how we think…
#04: Your API is an Illusion

evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far in a workﬂow…
Results are shown in blue, while the real work  
is highlighted in red
#04: Your API is an Illusion –The Libraries: Alexandria, Redux

On the other hand, Physics
does well to teach modeling –
I like to hire physicists to work
on Data teams…
They tend to get the interdisciplinary aspects:  
got the math background, coding experience,  
generally good at systems engineering, etc.
Not saying we must all rush out to get Physics  
degrees – there’s something to be learned there,  
vital for the work and priorities ahead
#04: Your API is an Illusion –The Interzone

“The impact of computing extends far beyond 
science… affecting all aspects of our lives.  
To ﬂourish in today's world, everyone needs 
computational thinking.” – Jeannette Wing, CMU
Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU 
http://www.cs.cmu.edu/~CompThink/
Exploring ComputationalThinking @ Google 
https://www.google.com/edu/computational-thinking/
#04: Your API is an Illusion – Antidote: ComputationalThinking

Even so, do we really need to  
write code for WordCount  
10^N times?
#05: Code Inceptionism

Inceptionism: Going Deeper into  
Neural Networks 
Alexander Mordvintsev,  
Christopher Olah, Mike Tyka 
Google (2015-06-17)
googleresearch.blogspot.com/2015/06/
inceptionism-going-deeper-into-neural.html
Artiﬁcial Neural Networks have spurred remarkable recent
progress in image classiﬁcation and speech recognition. But
even though these are very useful tools based on well-known
mathematical methods, we actually understand surprisingly
little of why certain models work and others don’t. So let’s
take a look at some simple techniques for peeking inside
these networks.

Imagine data mining GitHub commit
histories of popular open source projects,
then applying genetic programming to
evolve patches for other OSS projects...  
 
In other words, brilliant:
Imagine data mining GitHub commit
histories of popular open source projects,
then apply genetic programming to evolve
patches for other OSS projects…  
in other words, brilliant:
Sidebar: Claire Le Goues, automating software repair
Claire Le Goues 
cmu.edu
GenProg:A Generic Method for Automatic
Software Repair 
Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer 
IEEE TSE (2012) 
www.cs.cmu.edu/~clegoues/
docs/legoues-tse-genprog12.pdf
We describe the algorithm and report experimental
results of its success on 16 programs totaling 1.25M
lines of C code and 120K lines of module code,
spanning eight classes of defects, in 357 seconds,  
on average.We analyze the generated repairs
qualitatively and quantitatively to demonstrate  
that the process efﬁciently produces evolved
programs that repair the defect, are not fragile  
input memorizations, and do not lead to serious
degradation in functionality.
GenProg:A Generic Method for  
Automatic Software Repair 
 
Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer 
IEEE TSE (2012) 
www.cs.cmu.edu/~clegoues/ docs/
legoues-tse-genprog12.pdf
We describe the algorithm and report experimental results
of its success on 16 programs totaling 1.25M lines of C code
and 120K lines of module code, spanning eight classes of
defects, in 357 seconds, on average.
 
We analyze the generated repairs qualitatively and
quantitatively to demonstrate that the process efﬁciently
produces evolved programs that repair the defect, are not
fragile input memorizations, and do not lead to serious
degradation in functionality.

Are databases going extinct?
Distributed file systems that can be accessed
as column stores are generally quite useful
There’s an old saying in Computer Science:  
it’s difficult to distinguish a really good file
system from a database, and vice versa
#06: Database Extinction?

Original deﬁnitions for what became relational
databases had less to do with dedicated SQL
products, more similarity with something like
Spark SQL:
A relational model of data for  
large shared data banks 
Edgar Codd 
Communications of the ACM (1970) 
dl.acm.org/citation.cfm?id=362685

Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten

#07: “N Dims good, 2 Dims baa-d”

Consider: matrices, pivot tables, etc.
Our thinking about data representation  
is often quite two-dimensional…

• many real-world problems are often
represented as graphs
• graphs can generally be converted into sparse
matrices (bridge to linear algebra)
• eigenvectors find the stable points in  
a system defined by matrices – which  
may be more efficient to compute
• beyond simpler graphs, complex data  
may require work with tensors

Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
v
u
w
x

We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based  
on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0

An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory

Tensors are a good way to handle time-
series geo-spatially distributed linked data
with lots of N-dimensional attributes
In other words, potentially a general case  
for handling much of the data that we’re
likely to encounter

Although tensor factorization is considered
problematic, it may provide more general
case solutions:
TheTensor Renaissance in Data Science 
Anima Anandkumar @UC Irvine 
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and  
Higher Order Markov Chains 
David Gleich @Purdue 
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains

There is Science … and there is Data
Data Science is largely about interdisciplinary
teams, largely about crossing boundaries
(organizational, cognitive) that might otherwise
preclude arriving at crucial insights –
In other words, about learning
It’s also about the repeatability and predictive
aspects of science, where workﬂows combine
people + automation
NB: may conﬂict with large portions of academia
which tend to decontextualize subjects
#08: Science … and Data

The Science in Data Science tends to rely on
the phenomenology and modeling of complex
systems (did we already mention Physics?)
Speaking of science and predictions, two
important works to include:
• Charles Sanders Peirce – one of the
most proliﬁc scientists in the US, and also
one of the most ﬁerce critics (abduction,
etc.)
• Karl Popper – who articulated some  
of the inherent risks of mixing “science”,
“history”, and politics

For excellent examples of Science and Data
together, see CodeNeuro, particularly for
use of notebooks:

#09: Learning Curves are Forever

Learning Curves are forever –  
the part you need to manage
more carefully than just about
anything else, especially within 
a social context
In some sense, this is essence of
Data Science: How well do you
learn?
Much of the risk in managing  
a Data Science team is about
budgeting for learning curve

In contrast, IT has a long history of
practicing a ﬂavor of engineering
“conservatism”: highly structured
process, strictly codiﬁed practices
People learn a few things well, then
avoid having to struggle with learning
many new things perpetually…
That leads to enormous teams and
low ROI, among other badness
scale➞
complexity➞

ThrowYour Life a Curve 
Whitney Johnson
blogs.hbr.org/johnson/2012/09/
throw-your-life-a-curve.html
Aggressively Pro-Active Learning:
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efﬁciently

Education is more than just lessons, exams,
certiﬁcations, instructor evaluations, etc., …
though some tools would try to reduce it  
to that level
What’s even more interesting is to leverage
ML to understand the “distance” between
the learner and the subject material

#10: Books, not so much, sadly…

Speaking as a former alt bookstore owner…
Sadly, we don’t use books quite as much  
these days:
• above ~35: buy it on Kindle
• below ~35: watch it onYouTube

From a publisher perspective, consider
some of the risks:
• less people buy the titles
• search engines surface oh-so-much noise
• increasingly, it’s more difﬁcult for experts
to take time to author good content and
keep it updated
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments

However, it’s unlikely that Kindle, etc.,
represent the end-all-be-all of publishing…
Here’s an idea: your next “book” or
“video” should be able to compute
something useful

Interactive notebooks: Sharing the code
Helen Shen
Nature (2014-11-05)
nature.com/news/interactive-notebooks-
sharing-the-code-1.16261
#10: Books, not so much – Repeatable Science

Embracing Jupyter Notebooks at O'Reilly 
Andrew Odewahn, 2015-05-07
https://beta.oreilly.com/ideas/jupyter-at-oreilly
“O'Reilly Media is using our Atlas platform to  
make Jupyter Notebooks a ﬁrst class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, etc.
#10: Books, not so much – Something Borrowed, Something New

#10: Books, not so much – Something Borrowed, Something New

MOOCs have become popular, some are
quite useful … even so, these tend to have  
a very low completion rate
Don’t hold your breath waiting for MOOCs
to replace other modes of education
Learning generally requires a social context:
for reinforcement, peer insights/modeling,
and frankly some people really feel a need
to be given permission to learn
#11: A MOOCish Edumacation?

One problem with university study is that
disciplines tend to decontextualize
GalvanizeU is rare opportunity in that way:
accredited, with contextualized hands-on
experience

A signiﬁcant improvement may be
found in the notion of “ﬂipped”  
or inverted classrooms
For a good example, see:
Caltech Offers Online Course with  
Live Lectures in Machine Learning
Yaser Abu-Mostafa (2012-03-30)
http://www.caltech.edu/news/caltech-offers-online-
course-live-lectures-machine-learning-4248

So a good bit of advice about learning and
Data Science … is to invert your classrooms,
recontextualize, cross the boundaries to do
things that matter, and leverage the hands-on
social aspects of learning
Like here at GalvanizeU
Summary…

contact:
Just Enough Math
O’Reilly (2014)
justenoughmath.com 
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,  
events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark 
O’Reilly (2015) 
shop.oreilly.com/product/
0636920036807.do

After we’ve cleaned up data, formulated workﬂows
in terms of monoids, used graph representation, and
parallelized with a wealth of linear algebra, much of
the heavy-lifting that remains on the clusters is in
optimization
For example, deep learning @Google  
uses many layers of neural nets trained  
with gradient descent optimization
Taming LatencyVariability and Scaling Deep Learning 
Jeff Dean @Google (2013) 
youtu.be/S9twUcX1Zp0
Vector Quantization:

One advantage of quantum algorithms is  
to run large gradient descent problems in
constant time… Reworking high-ROI apps
to leverage lots of ML and large clusters,  
then SGD represents the datacenter cost
basis, notably that part that scales…
Want to slash costs exponentially?  
Plug in quantum for a game-changer, 
maybe
Fast quantum algorithm for  
numerical gradient estimation 
Stephen P. Jordan 
Phys. Rev. Lett. 95, 050501 (2005) 
arxiv.org/abs/quant-ph/0405146 dwavesys.com

Proposal: let’s drop clusters of quantum
devices into lunar polar craters, so we  
can handle massive vector quantization
workloads
• micro-kelvin environs
• near perpetual sunlight  
for energy sources
• park routers at L4
• approx. $15B to ﬁnance,  
i.e., ~6 days DoD budget

We’ll just put this here…  
a couple o’ Googly projects in progress:
qCraft: Quantum Physics In Minecraft 
plus.google.com/u/
1/+QuantumAILab/posts/
grMbaaDGChH
“We’re going back to the Moon. For good.”
lunar.xprize.org

GalvanizeU Seattle: Eleven Almost-Truisms About Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a GalvanizeU Seattle: Eleven Almost-Truisms About Data

Similar a GalvanizeU Seattle: Eleven Almost-Truisms About Data (20)

Más de Paco Nathan

Más de Paco Nathan (9)

Último

Último (20)

GalvanizeU Seattle: Eleven Almost-Truisms About Data