This document discusses how big data and data science can be used to attain new heights, likening it to magic. It provides an overview of Ken Johnston's background and experiences in data science. It then discusses six keys to a "big" magic show with big data: trying multiple times, addressing issues with over-counting, experimentation techniques like A/B testing, infrastructure for big data, tools and skills, and security, privacy and fraud protection. The document emphasizes the importance of an assistant to help the data scientist or data engineer with various tasks.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Big Data: The Magic to Attain New Heights
1. Big Data: The Magic to
Attain New Heights
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic
2. Data Scientist
in Core Data
Science Team
Office Live,
WebApps,
Office Online
Cosmos,
AutoPilot,
Local,
Shopping
About Ken
Kanban and
Data Science
series on
LinkedIn
EaaSy&MVQ
– Everything as
a Service &
Minimum Viable
Quality
Write Books and Blog
and some fiction
16. Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain
“Big Data” Search Trends
@rkjohnston #DataMagic
18. Common Design Patterns
Good Paper to Read
IDC: Six Patterns of Big Data and Analytics Adoption:
The Importance of the Information Architecture
Ingest
From Services, IOT, Apps
Via Streams
Into Storage
Process
Build Pipelines
Reduce, Transform, Join
Pipe out
Analyze
From Services, IOT, Apps
Via Streams
Into Storage
19. Azure Model
Cindy Gross – Technical Fellow: Big Data and Cloud
Twitter: @SQLCindy cindyg@NealAnalytics.com
Ingest
Process
Analyze
23. Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Personally Identifiable
Information (PII)
Management very critical.
Data Driven Quality (DDQ)
and big data pipelines will
need a cloud platform
Superfast pipeline typically
(not always) bypasses cloud.
Also void of PII.
Big Data & ML Model Orchestration
@rkjohnston #DataMagic
24. Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Big Data & ML Model Orchestration
Ingest Process Analyze
@rkjohnston #DataMagic
25. User Segmentation Approaches
• Risk Tolerance Model
• Users Segment themselves
• Opt in for greater risk with a reward in mind
• Profile Based
• Usage behaviors
• new vs. power users
• Browser type
• Connection Type
• Device and Device OS
@rkjohnston #DataMagic
26. Ring 2 External Beta
UsersRing 2: Company
& NDA
Balancing Speed and Risk with Rings
Ring 1: My Team
Ring 4: Everyone
Ring 0: Buddy Build
Red Line demarks disclosure risk
and possible loss of patent rights
Risk Tolerance
is highest
No desire
for risk
@rkjohnston #DataMagic
29. Office 365 Advanced Threat Protection
Big Data Only Solution
Safe Link is powered by
Cloud Exchange & Bing data
AI Model powered by data
from thousands of
companies and attachments
@rkjohnston #DataMagic
33. App Store Data Architecture
App Certification
and Analysis
Pipeline
Store Services Log
and Telemetry
Bing Spam and
Malware
Windows Services Safety
Platform
(MSA, SmartScreen, Etc..)
MMPC/Spynet
Network IPs
File Hashes
PhotoDNA
Strings
API Called
User Install Data
Ratings and
Reviews
Purchases
Geographic Data
Account
Reputation
Bad URLs
Botnet infected
Clients
Cosmos Storage
and Compute
BTW this
was not Big
Data
34. NoName was Learning basic DS
Look at how I did this k-means
clustering and found these weird
outliers in buying circles from Dev
accounts created the same week and
same IP address
Check it out, I found this guys FB
page. We have his picture!
NoName and I were Spitballing Ideas
35. Bad Dev
‘N’
Bad Dev
‘N’
Fraud Network Identification
Bad Dev 1
Payment Instruments
App Similarity
Social Networks
3rd party app stores
Bad Dev 2
XXXDeveloper
Created 40 Different Store Developer Accounts and 100s of Apps
App Metadata
(URL, Websites)
Developer Watering
Holes
Shared Fraudulent Payment Instruments
Bad Dev
‘N’
New Identity
Metadata
Shared Fraudulent Payment Instruments
App Similarity
App Similarity
41. Six week coding milestone
Code churn is
cumulative
Imagine this as part of a
larger multi-layered
project
Layer 1
Layer 2
Layer 3
• Tightly coupled layers
• Long stabilization phase
• Complicated end-to-end integration
Sim-ship increases
risk
Maximum point of
instability is at end of
milestone
Code Churn Example 1
@rkjohnston #DataMagic
42. Code Churn Example 2 (Continuous
Deployment)
Layer 1
Layer 2
Layer 3
• Risk per release decreases because of more
incremental change
• You still must be careful of Risk within
Production but…
• Total risk over time can be less with
incremental change
Rapid release cadence
(weekly or daily)
Max Risk is Production
Layer N
@rkjohnston #DataMagic
45. Measures = Test Cases
• We do Measures
• What is a post release test
case?
• Automation validates the
golden path
• We measure the golden
path
• Measures are the same
as test cases
• Monitor the golden path
@rkjohnston #DataMagic
46. >1.5*IQR = Outlier = Bug (probably)
• What is a Test Case?
• What I expect to happen vs.
What does happen
• A Test Case is Binary
• Measures can observe
success and fail
• Measures have history of
pass fail
• When pass or fail drift from
standard expected rates we
find outliers
• Outliers are often bugs
47. Rings + Speed + Data = Success
• When speed increases the need for telemetry increases
• The rings model provides a buffer
@rkjohnston #DataMagic
51. Generic Service Stack
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
DefaultPath
Production
Traffic
Front door servers for logging
and access management
UX rendering layers
Identity or authentication layers
Persistent data layers
@rkjohnston #DataMagic
52. Runtime Flags Example 1
Side-by-Side Deployments
Service UX Front Door
Service Auth/Identity
Runtime Flags
• Flags direct traffic through the stack
• Used to test vNext before full
release
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
DefaultPath
Runtime
Production
Traffic
Test or Forked
Traffic
Runtime
RuntimeRuntime
Layer B vNext
53. Runtime Flags Example 2
N Test Environments
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
Production
Traffic
Test
Case
Checkin
Tests
DefaultPath
Runtime
Runtime
Runtime
Runtime
Layer A Test Path
Layer B Test Path
54. Apps as a Service: Facebook
How Facebook secretly redesigned its iPhone app with
your help
…a system for creating alternate versions… within the
native app.
The team could then turn on certain new features for a
subset of its users, directly,
…a system of "different types of Legos... and see the
results on the server in real time."
From article on The Verge by Dieter Bohn September 18, 2013
@rkjohnston #DataMagic
56. Visualization
Machine Learning
Data Scientist Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data Exploration
http://www.datasciencecentral.com/profiles/blogs/difference-
between-data-engineers-and-data-scientists
Typical Industry Staffing
57. Blended Role for Agile
Visualization
Machine Learning
Data Scientist/Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data
Exploration
@rkjohnston #DataMagic
58. LDA vs PCA vs A13
before stratified
sampling
Backlog Doing Validation Done
MLADS ARPD
Rehearsal
Submit Abstract to
Strata + Hadoop World
Edge Experiment 1
Data Processing
Edge Experiment 2
Customer Sat and Post
Sales Monetization
Factors Analysis
Install Base Decay Rate
estimation using
Baysian Model
Friday Review Slides
for Edge Experiment 1
Edge Experiment 1
Insights Analysis
Top Enterprise DSAT
list from textual
analysis
Business Entity Graph
with DUNS, Domain
Name, & TaxIDs
Open Source Entity
Graph visualization
technology research
Submit Paper to
Informs 2016
ARPD V3 Model with
FFF
MLADS ARPD Slides
Draft 1
Device Lifetime Value
(LTV) model 2
Process and Culture impact Retention
• Kanban for Project Management
• Balance long and short term impact
• Participate in Industry papers and reviews
@rkjohnston #DataMagic
59. Trying Again & Again
Advantages and
Disadvantages of
the counting
culture
61. The 5 Vs of Big Data
Nine months ago there were only three Vs
Variety VelocityVolume Verify
Verification – managing data quality and access control at all points
Value
62. Must Count More
Counting More Granular
Make it go up and to the
right
Is vs Likely
Business Impact is a
Given
Drives behavior
(especially if tied to
compensation)
63. Viable
Possible Features
Minimum + Viable
Good features to test the
users responses
Bad user experience. Too
minimal a set or wrong set of
features. Will not engage users
enough to gain valuable
insights
The product you want to
build but to deliver all
features will take too long
Wasted work adding features
that do not add critical value for
winning and retaining customers
Minimum
MVP in a Nutshell
64. Possible Data
Viable
Model should provide
enough coverage that it can
be used for core insights.
Many models try to include all data
and large numbers of attributes but
that slows down innovation
If precision is too low then the
model can’t be trusted for even
first level insights.
Minimum
More features can increase
complexity without
significant improvement in
precision and recall
Minimum Viable Model (MVM)
Possible Features
Minimum + Viable
An Ideal MVM uses a modest
amount of data, implements
a relatively simple initial
algorithm, has good
precision (we aim for 98% or
more) and enough recall to
be used for core insights.
65. Keep your eye on the target
The goal is not
to get a bulls eye
every time
The goal is to
get the data and
Learn
67. Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain
“Big Data” Search Trends
@rkjohnston #DataMagic
68. Big Data: The Magic to
Attain New Heights
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic
Notas del editor
They aren’t afraid to get their hands dirty in the data.
They are uniquely gifted at connecting the dots.
Through data they make original and deep insights.
I have to tell them al the time just how amazing they are.
My son gets this magic kit in a box. Within an hour of playing with it he comes to tell me how he’s going to be a magician and we have to throw a magic show.
I thought I’d use his idea of creating a magic show as a way to talk about the magic of data science.
Presenter guidance:
Share how we think about the data platform in the cloud. Today, we’ll specifically talk about SQL in a VM (briefly), SQL DB, DocumentDB, HBase on HDInsight, and Tables/Blobs. There are lots of other adjacent services such as Redis Cache, Event Hubs, HDInsight, Azure ML, Data Factory, Stream Analytics that will not be addressed in this deck.
Slide talk track:
The top row is Power BI – you’re making decisions based on data
The middle row is ML, Stream Analytics, HDInsight, and Data Factory – processing and making sense of the data
The bottom row is where you ingest and store data -
With Azure, organizations have access to a whole range of services that allow them to use the right tool for the right job when developing applications.
In the cloud, organizations can collect and manage data in the form in which it’s born and store it in the form that best suits an application’s needs.
Clients: Common Library but support multiple OS.
Front End: Telemetry and debug data come through Front End.
PII Scrubbing: Happens at client and again upon ingestion.
Cloud Platform: large scale, many developers, shared structured data. Cloud allows for elastic scaling
APIs and Query Service: Allows access to refined data. Often data is piped to a SQL Server for KPIs and deep analysis
Databases and Reporting Services: Deep analysis is usually done with tools like R Studio and Power Pivot for visualization. Dashboards monitor well known KPIs but are not insights.
Clients: Common Library but support multiple OS.
Front End: Telemetry and debug data come through Front End.
PII Scrubbing: Happens at client and again upon ingestion.
Cloud Platform: large scale, many developers, shared structured data. Cloud allows for elastic scaling
APIs and Query Service: Allows access to refined data. Often data is piped to a SQL Server for KPIs and deep analysis
Databases and Reporting Services: Deep analysis is usually done with tools like R Studio and Power Pivot for visualization. Dashboards monitor well known KPIs but are not insights.
Ring 0: Buddy Build – Build may not have been checked in, pass component to buddy developer
Ring 1: My Team – Should pass Unit and check-in tests
Ring 2: Company and NDA – Pushing to these users is based upon quality gates and telemetry measures. Further progression all telemetry based.
Ring 3: External Beta Users – Release based upon telemetry results. Release is metered by % and device models.
Ring 4: Everyone –
Product is available to for general adoption but may still use metered rollout.
Rings 2-4: leverage rolling deployments (small % at a time) with metrics to stop and roll back
Volume – How much data do you have and how much do you really need
Variety – What Data sources do you have and how can they be combined for more value
Velocity – Speed of data to insight impacts how you use it
Verification – managing data quality and access control at all points
Value – Big Data can be expensive and must produce valuable insights