Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor

Waking the Data Scientist @ 2am:
Detect Model Degradation on Production Models
with Amazon SageMaker Endpoints & Model Monitor
Chris Fregly
Developer Advocate @ AWS
AI and Machine Learning https://datascienceonaws.com
github.com/data-science-on-aws
@cfregly
linkedin.com/in/cfregly

Who am I?
• Former Netflix, Databricks
• Organizer Advanced Kubeflow
Meetup (Globally)
• Co-Author @ Data Science on AWS (O’Reilly 2021)

Data Science on AWS – Book Outline
https://www.datascienceonaws.com/

Amazon SageMaker
A fully managed service that covers the entire Machine Learning workflow

Amazon SageMaker
re:Invent 2019 announcements
First fully integrated
development
environment (IDE) for
machine learning
Amazon SageMaker
Studio
Enhanced notebook experience
with quick-start &
easy collaboration
Amazon SageMaker Notebooks
Automatic debugging,
analysis, and alerting
Amazon SageMaker
Debugger
Experiment management system
to organize, track, & compare
thousands of experiments
Amazon SageMaker Experiments
monitoring to detect
deviation in quality & take
corrective actions
Amazon SageMaker
Model Monitor
Automatic generation
of ML models with
full visibility & control
Amazon SageMaker
Autopilot

Amazon SageMaker
focus of this session
First fully integrated
development
environment (IDE) for
machine learning
Amazon SageMaker
Studio
Enhanced notebook experience
with quick-start &
easy collaboration
Amazon SageMaker Notebooks
Automatic debugging,
analysis, and alerting
Amazon SageMaker
Debugger
Experiment management system
to organize, track, & compare
thousands of experiments
Amazon SageMaker Experiments
Model monitoring to detect
deviation in quality & take
corrective actions
Amazon SageMaker
Model Monitor
Automatic generation
of ML models with
full visibility & control
Amazon SageMaker
Autopilot

Debugging machine
learning training is
painful
Large neural networks
with many layers
Many connections
Computationally
intensive
Extraordinarily difficult
to inspect, debug, and
profile
the ‘black box’
+
+
=
Challenges with Machine Learning Training

Manually print debug
data
Manually analyze the debug
data
Use open source tools for
charting
Valuable data
scientist/ML practitioner
time wasted
+
+
=
Challenges with Machine Learning Training
Debugging machine
learning training is
painful

Example Issues While Training ML models
• Vanishing gradients
• Exploding gradients
• Loss not decreasing across steps
• Weight update ratios are either too small or too large
• Tensor values are all zeros
Debugging them is hard, even harder when running distributed training
All these issues impact on the learning process

An example: vanishing gradients
𝑥!
𝑥"
𝑥#
𝑤1
!,!
𝑤1
!,"
𝑤1
!,#
𝑤2
!,!
𝑤2
!,%
𝑤3
!,!
𝑤3
!,"
…
…
…
backpropagation
Weights update rule
𝑊!"# = 𝑊 − 𝜂 % ∇$ 𝐿
Intuition
Gradients vanish when they assume a very
small value à almost no weight update
during backpropagation
Why this happens? An example
𝜎 𝑧 =
1
1 + 𝑒!"
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝑤!
=
𝜕𝐿
𝜕𝑜𝑢𝑡𝑝𝑢𝑡
∗
𝜕𝑜𝑢𝑡𝑝𝑢𝑡
𝜕ℎ𝑖𝑑𝑑𝑒𝑛2
∗
∗
𝜕𝑤!
can be small
input hidden1 hidden2 output
𝜕𝐿
𝜕𝑧
=
𝜕𝐿
𝜕𝜎
𝜕𝜎
𝜕𝑧

An Example: XGBoost- Loss Not Decreasing
• Overfitting is a problem with non-linear algorithms such as XGBoost
• By monitoring the performance of the loss over the last number of steps, training
can be completed early, by defining that the loss is not decreasing or not decreasing
at the expected rate.
• In this example training could be completed somewhere between 20 and 40 epochs

Automatic data
analysis
Relevant data
capture
Automatic error
detection
Faster training
Amazon SageMaker
Studio
integration
Debug data with no code
changes
Data is automatically
captured for analysis
Errors are automatically
detected and alerts are sent
Analyze and debug across
distributed clusters
Analyze & debug
from
Amazon SageMaker
Studio
Training data analysis, debugging, & alert generation
Introducing Amazon SageMaker Debugger

How does Amazon SageMaker Debugger Work?
Training in
progress
Analysis in
progress
Customer’s S3 Bucket
Amazon
CloudWatch Event
Amazon SageMaker
Amazon SageMaker
Studio Visualization
Amazon SageMaker
Notebook
Action à Stop the training
Action à Analyze using
Debugger SDK
Action à Visualize Tensors
using charts
• No code change is necessary to emit debug data with built in algorithms and custom training script
• Analysis occurs real time as data is emitted making real time alerts possible

Add Debugger to Training Job
Initialize your hook and
save tensors in specified
path
Initialize your rules.
These will read data for
analysis from the path
specified in the hook

Amazon SageMaker Model Monitor

Deploying a model is not the end.
You need to continuously monitor
models in production and iterate
Concept drift due to
divergence of data
Model performance can
change due to unknown
factors
Continuous monitoring involves a
lot of tooling and expense
Model monitoring is
cumbersome but critical
+
+
=

Introducing Amazon SageMaker Model Monitor
Automatic data
collection
Continuous
Monitoring
CloudWatch
Integration
Data collected from
endpoints is stored in
Amazon S3
Metrics emitted to
Amazon CloudWatch
make it easy to alarm
and automate corrective
actions
Continuous monitoring of models in production
Visual
Data analysis
Define a monitoring
schedule and detect
changes in quality against
a pre-defined baseline
See monitoring results,
data statistics, and
violation reports in
Amazon SageMaker
Studio; Analyze in
Notebooks
Flexibility
with rules
Use built-in rules to
detect data drift or write
your own rules for
custom analysis

1. Create/ Update Amazon SageMaker Endpoint
Amazon SageMaker
Training job
Model Amazon SageMaker
Endpoint
Applications

2: Enable Data Collection for SageMaker Endpoint
Amazon SageMaker
Training job
Endpoint
Applications
Requests,
predictions

s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}
/yyyy/mm/dd/hh/filename.jsonl
View data collected from endpoint
sagemaker/UC-DEMO-ModelMonitor/datacapture/UC-DEMO-xgb-churn-pred-model-
monitor-2019-12-01-21-09-29/AllTraffic/2019/12/01/21/28-45-917-ae917300-
350f-4482-ac73-4e838d9d6115.jsonl
sagemaker/UC-DEMO-ModelMonitor/datacapture/UC-DEMO-xgb-churn-pred-model-
monitor-2019-12-01-21-09-29/AllTraffic/2019/12/01/21/29-45-951-27c7035d-
87f8-45f9-9993-8008abc43aaa.jsonl

Example saved prediction request & response

3. Create baseline with train / validation dataset
Amazon SageMaker
Training job
Endpoint
Applications
Baseline statistics
and constraints
Requests,
predictions
Baseline
Processing Job

Under the hood
1. Amazon Model Monitor runs a ProcessingJob on your behalf
• On-demand, distributed job
• Fully managed – ideal for data processing and custom analysis
• Pay for duration for which the job runs
2. Analyzes the data collected
• SageMaker provides pre-built container for analysis
• Pre-built container runs Deequ on Spark
• Custom analysis also supported

Baselining results - Statistics
baselining/results/statistics.json

Baselining results – suggested constraints
baselining/results/constraints.json

4. Create a monitoring schedule
Amazon SageMaker
Training job
Endpoint
Applications
Scheduled
Monitoring Job
Baseline statistics
and constraints
Requests,
predictions
Baseline
Processing Job

Under the hood
1. Amazon Model Monitor runs ProcessingJob on your behalf at the
schedule you select –i.e. Monitoring Jobs
2. Analyzes the data collected using your choice of analysis container
(pre-built or custom)
Compares results against the baseline
Generates results for each Monitoring job
• Violations report for each job in Amazon S3
• Statistics report for data collected during the run
• Emits summary metrics and statistics to Amazon CloudWatch

5. View monitoring results
Amazon SageMaker
Training job
Endpoint
Applications
Scheduled
Monitoring Job
Results:
statistics
and violations
Baseline statistics
and constraints
Amazon
CloudWatch
metrics
Requests,
predictions
Baseline
Processing Job

6. Get alerted and take corrective actions
Amazon SageMaker
Training job
Endpoint
Applications
Scheduled
Monitoring Job
Results:
statistics
and violations
Baseline statistics
and constraints
Amazon
CloudWatch
metrics
Requests,
predictions
Baseline
Processing Job
Analysis of
results
Notifications
• Model updates
• Training data
updates
• Retraining

Take corrective actions
1. Set alarms in Amazon CloudWatch and triggers for retraining

SageMaker Model Monitor Summary
Amazon SageMaker
Training job
Endpoint
Applications
Scheduled
Monitoring Job
Results:
statistics
and violations
Baseline statistics
and constraints
Amazon
CloudWatch
metrics
Requests,
predictions
Baseline
Processing Job
Analysis of
results
Notifications
• Model updates
• Training data
updates
• Retraining

References
https://github.com/aws-samples/reinvent2019-aim362-sagemaker-debugger-model-monitor/
https://aws.amazon.com/blogs/machine-learning/detecting-and-analyzing-incorrect-model-
predictions-with-amazon-sagemaker-model-monitor-and-debugger/

Thank You!
Waking the Data Scientist @ 2am:
Detect Model Degradation on Production
Models with Amazon SageMaker Endpoints &
Model Monitor
Chris Fregly
Developer Advocate @ AWS
AI and Machine Learning https://datascienceonaws.com
github.com/data-science-on-aws
@cfregly
linkedin.com/in/cfregly

Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor

Similar a Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor (20)

Más de Chris Fregly

Más de Chris Fregly (20)

Último

Último (20)

Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor