Big Data: The Magic to Attain New Heights

Big Data: The Magic to
Attain New Heights
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic

Data Scientist
in Core Data
Science Team
Office Live,
WebApps,
Office Online
Cosmos,
AutoPilot,
Local,
Shopping
About Ken
Kanban and
Data Science
series on
LinkedIn
EaaSy&MVQ
– Everything as
a Service &
Minimum Viable
Quality
Write Books and Blog
and some fiction

I have a lot of love in my life

Getting hands
dirty in the data

Taking on Sudden
Infant Death
Syndrome

So, My son
gets this kids
“Magic Kit in
a Box” for his
8th birthday

Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain
“Big Data” Search Trends

The Venue
Your Big Data Infrastructure

Common Design Patterns
Good Paper to Read
IDC: Six Patterns of Big Data and Analytics Adoption:
The Importance of the Information Architecture
Ingest
From Services, IOT, Apps
Via Streams
Into Storage
Process
Build Pipelines
Reduce, Transform, Join
Pipe out
Analyze
From Services, IOT, Apps
Via Streams
Into Storage

Azure Model
Cindy Gross – Technical Fellow: Big Data and Cloud
Twitter: @SQLCindy cindyg@NealAnalytics.com
Ingest
Process
Analyze

Hybrid: Azure and Hadoop Model
Ingest Process Analyze

Amazon Model

Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Personally Identifiable
Information (PII)
Management very critical.
Data Driven Quality (DDQ)
and big data pipelines will
need a cloud platform
Superfast pipeline typically
(not always) bypasses cloud.
Also void of PII.
Big Data & ML Model Orchestration

Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Big Data & ML Model Orchestration

User Segmentation Approaches
• Risk Tolerance Model
• Users Segment themselves
• Opt in for greater risk with a reward in mind
• Profile Based
• Usage behaviors
• new vs. power users
• Browser type
• Connection Type
• Device and Device OS

Ring 2 External Beta
UsersRing 2: Company
& NDA
Balancing Speed and Risk with Rings
Ring 1: My Team
Ring 4: Everyone
Ring 0: Buddy Build
Red Line demarks disclosure risk
and possible loss of patent rights
Risk Tolerance
is highest
No desire
for risk

Date
Security
Protection, Privacy, Fraud

Office 365 Advanced Threat Protection
Big Data Only Solution
Safe Link is powered by
Cloud Exchange & Bing data
AI Model powered by data
from thousands of
companies and attachments

Short lived identifiers
Increase
transparency and
control for users
Build privacy into the
OS and all apps

How the
Windows Store
Security Team
made the
Insights Leap

App Store Data Architecture
App Certification
and Analysis
Pipeline
Store Services Log
and Telemetry
Bing Spam and
Malware
Windows Services Safety
Platform
(MSA, SmartScreen, Etc..)
MMPC/Spynet
Network IPs
File Hashes
PhotoDNA
Strings
API Called
User Install Data
Ratings and
Reviews
Purchases
Geographic Data
Account
Reputation
Bad URLs
Botnet infected
Clients
Cosmos Storage
and Compute
BTW this
was not Big
Data

NoName was Learning basic DS
Look at how I did this k-means
clustering and found these weird
outliers in buying circles from Dev
accounts created the same week and
same IP address
Check it out, I found this guys FB
page. We have his picture!
NoName and I were Spitballing Ideas

Bad Dev
‘N’
Bad Dev
‘N’
Fraud Network Identification
Bad Dev 1
Payment Instruments
App Similarity
Social Networks
3rd party app stores
Bad Dev 2
XXXDeveloper
Created 40 Different Store Developer Accounts and 100s of Apps
App Metadata
(URL, Websites)
Developer Watering
Holes
Shared Fraudulent Payment Instruments
Bad Dev
‘N’
New Identity
Metadata
Shared Fraudulent Payment Instruments
App Similarity
App Similarity

Date
Foundation
Tools and Skills for Big Data

The Big Red Switch
This used to require humans

Speed is your
friend because…

Six week coding milestone
Code churn is
cumulative
Imagine this as part of a
larger multi-layered
project
Layer 1
Layer 2
Layer 3
• Tightly coupled layers
• Long stabilization phase
• Complicated end-to-end integration
Sim-ship increases
risk
Maximum point of
instability is at end of
milestone
Code Churn Example 1

Code Churn Example 2 (Continuous
Deployment)
Layer 1
Layer 2
Layer 3
• Risk per release decreases because of more
incremental change
• You still must be careful of Risk within
Production but…
• Total risk over time can be less with
incremental change
Rapid release cadence
(weekly or daily)
Max Risk is Production
Layer N

As Speed Accelerates
Up Front & Post Deploy Testing Decreases

Measures = Test Cases
• We do Measures
• What is a post release test
case?
• Automation validates the
golden path
• We measure the golden
path
• Measures are the same
as test cases
• Monitor the golden path

>1.5*IQR = Outlier = Bug (probably)
• What is a Test Case?
• What I expect to happen vs.
What does happen
• A Test Case is Binary
• Measures can observe
success and fail
• Measures have history of
pass fail
• When pass or fail drift from
standard expected rates we
find outliers
• Outliers are often bugs

Rings + Speed + Data = Success
• When speed increases the need for telemetry increases
• The rings model provides a buffer

Flighting and
A/B testing
are mostly the
same thing

Runtime Flags =
Continuous Deployment

Generic Service Stack
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
DefaultPath
Production
Traffic
Front door servers for logging
and access management
UX rendering layers
Identity or authentication layers
Persistent data layers

Runtime Flags Example 1
Side-by-Side Deployments
Runtime Flags
• Flags direct traffic through the stack
• Used to test vNext before full
release
Layer A vCurrent
Layer B vCurrent
Service Layer C
DefaultPath
Runtime
Production
Traffic
Test or Forked
Traffic
Runtime
RuntimeRuntime
Layer B vNext

Runtime Flags Example 2
N Test Environments
Layer A vCurrent
Layer B vCurrent
Service Layer C
Production
Traffic
Test
Case
Checkin
Tests
DefaultPath
Runtime
Runtime
Runtime
Runtime
Layer A Test Path
Layer B Test Path

Apps as a Service: Facebook
How Facebook secretly redesigned its iPhone app with
your help
…a system for creating alternate versions… within the
native app.
The team could then turn on certain new features for a
subset of its users, directly,
…a system of "different types of Legos... and see the
results on the server in real time."
From article on The Verge by Dieter Bohn September 18, 2013

All
Magicians
need an
Assistant

Visualization
Machine Learning
Data Scientist Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data Exploration
http://www.datasciencecentral.com/profiles/blogs/difference-
between-data-engineers-and-data-scientists
Typical Industry Staffing

Blended Role for Agile
Visualization
Machine Learning
Data Scientist/Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data
Exploration

LDA vs PCA vs A13
before stratified
sampling
Backlog Doing Validation Done
MLADS ARPD
Rehearsal
Submit Abstract to
Strata + Hadoop World
Edge Experiment 1
Data Processing
Edge Experiment 2
Customer Sat and Post
Sales Monetization
Factors Analysis
Install Base Decay Rate
estimation using
Baysian Model
Friday Review Slides
for Edge Experiment 1
Edge Experiment 1
Insights Analysis
Top Enterprise DSAT
list from textual
analysis
Business Entity Graph
with DUNS, Domain
Name, & TaxIDs
Open Source Entity
Graph visualization
technology research
Submit Paper to
Informs 2016
ARPD V3 Model with
FFF
MLADS ARPD Slides
Draft 1
Device Lifetime Value
(LTV) model 2
Process and Culture impact Retention
• Kanban for Project Management
• Balance long and short term impact
• Participate in Industry papers and reviews

Trying Again & Again
Advantages and
Disadvantages of
the counting
culture

KPIs drive companies and behavior

The 5 Vs of Big Data
Nine months ago there were only three Vs
Variety VelocityVolume Verify
Verification – managing data quality and access control at all points
Value

Must Count More
Counting More Granular
Make it go up and to the
right
Is vs Likely
Business Impact is a
Given
Drives behavior
(especially if tied to
compensation)

Viable
Possible Features
Minimum + Viable
Good features to test the
users responses
Bad user experience. Too
minimal a set or wrong set of
features. Will not engage users
enough to gain valuable
insights
The product you want to
build but to deliver all
features will take too long
Wasted work adding features
that do not add critical value for
winning and retaining customers
Minimum
MVP in a Nutshell

Possible Data
Viable
Model should provide
enough coverage that it can
be used for core insights.
Many models try to include all data
and large numbers of attributes but
that slows down innovation
If precision is too low then the
model can’t be trusted for even
first level insights.
Minimum
More features can increase
complexity without
significant improvement in
precision and recall
Minimum Viable Model (MVM)
Possible Features
Minimum + Viable
An Ideal MVM uses a modest
amount of data, implements
a relatively simple initial
algorithm, has good
precision (we aim for 98% or
more) and enough recall to
be used for core insights.

Keep your eye on the target
The goal is not
to get a bulls eye
every time
The goal is to
get the data and
Learn

Big Data: The Magic to Attain New Heights

Recomendados

Recomendados

Más contenido relacionado

Similar a Big Data: The Magic to Attain New Heights

Similar a Big Data: The Magic to Attain New Heights (20)

Más de TEST Huddle

Más de TEST Huddle (20)

Último

Último (20)

Big Data: The Magic to Attain New Heights

Notas del editor