The document discusses Amazon SageMaker Model Monitor and Debugger for monitoring machine learning models in production. SageMaker Model Monitor collects prediction data from endpoints, creates a baseline, and runs scheduled monitoring jobs to detect deviations from the baseline. It generates reports and metrics in CloudWatch. SageMaker Debugger helps debug training issues by capturing debug data with no code changes and providing real-time alerts and visualizations in Studio. Both services help detect model degradation and take corrective actions like retraining.
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor
1. Waking the Data Scientist @ 2am:
Detect Model Degradation on Production Models
with Amazon SageMaker Endpoints & Model Monitor
Chris Fregly
Developer Advocate @ AWS
AI and Machine Learning https://datascienceonaws.com
github.com/data-science-on-aws
@cfregly
linkedin.com/in/cfregly
2. Who am I?
• Former Netflix, Databricks
• Organizer Advanced Kubeflow
Meetup (Globally)
• Co-Author @ Data Science on AWS (O’Reilly 2021)
3. Data Science on AWS – Book Outline
https://www.datascienceonaws.com/
5. Amazon SageMaker
re:Invent 2019 announcements
First fully integrated
development
environment (IDE) for
machine learning
Amazon SageMaker
Studio
Enhanced notebook experience
with quick-start &
easy collaboration
Amazon SageMaker Notebooks
Automatic debugging,
analysis, and alerting
Amazon SageMaker
Debugger
Experiment management system
to organize, track, & compare
thousands of experiments
Amazon SageMaker Experiments
monitoring to detect
deviation in quality & take
corrective actions
Amazon SageMaker
Model Monitor
Automatic generation
of ML models with
full visibility & control
Amazon SageMaker
Autopilot
6. Amazon SageMaker
focus of this session
First fully integrated
development
environment (IDE) for
machine learning
Amazon SageMaker
Studio
Enhanced notebook experience
with quick-start &
easy collaboration
Amazon SageMaker Notebooks
Automatic debugging,
analysis, and alerting
Amazon SageMaker
Debugger
Experiment management system
to organize, track, & compare
thousands of experiments
Amazon SageMaker Experiments
Model monitoring to detect
deviation in quality & take
corrective actions
Amazon SageMaker
Model Monitor
Automatic generation
of ML models with
full visibility & control
Amazon SageMaker
Autopilot
8. Debugging machine
learning training is
painful
Large neural networks
with many layers
Many connections
Computationally
intensive
Extraordinarily difficult
to inspect, debug, and
profile
the ‘black box’
+
+
=
Challenges with Machine Learning Training
9. Manually print debug
data
Manually analyze the debug
data
Use open source tools for
charting
Valuable data
scientist/ML practitioner
time wasted
+
+
=
Challenges with Machine Learning Training
Debugging machine
learning training is
painful
10. Example Issues While Training ML models
• Vanishing gradients
• Exploding gradients
• Loss not decreasing across steps
• Weight update ratios are either too small or too large
• Tensor values are all zeros
Debugging them is hard, even harder when running distributed training
All these issues impact on the learning process
11. An example: vanishing gradients
𝑥!
𝑥"
𝑥#
𝑤1
!,!
𝑤1
!,"
𝑤1
!,#
𝑤2
!,!
𝑤2
!,%
𝑤3
!,!
𝑤3
!,"
…
…
…
backpropagation
Weights update rule
𝑊!"# = 𝑊 − 𝜂 % ∇$ 𝐿
Intuition
Gradients vanish when they assume a very
small value à almost no weight update
during backpropagation
Why this happens? An example
𝜎 𝑧 =
1
1 + 𝑒!"
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝑤!
=
𝜕𝐿
𝜕𝑜𝑢𝑡𝑝𝑢𝑡
∗
𝜕𝑜𝑢𝑡𝑝𝑢𝑡
𝜕ℎ𝑖𝑑𝑑𝑒𝑛2
∗
𝜕ℎ𝑖𝑑𝑑𝑒𝑛2
𝜕ℎ𝑖𝑑𝑑𝑒𝑛1
∗
𝜕ℎ𝑖𝑑𝑑𝑒𝑛1
𝜕𝑤!
can be small
input hidden1 hidden2 output
𝜕𝐿
𝜕𝑧
=
𝜕𝐿
𝜕𝜎
𝜕𝜎
𝜕𝑧
12. An Example: XGBoost- Loss Not Decreasing
• Overfitting is a problem with non-linear algorithms such as XGBoost
• By monitoring the performance of the loss over the last number of steps, training
can be completed early, by defining that the loss is not decreasing or not decreasing
at the expected rate.
• In this example training could be completed somewhere between 20 and 40 epochs
13. Automatic data
analysis
Relevant data
capture
Automatic error
detection
Faster training
Amazon SageMaker
Studio
integration
Debug data with no code
changes
Data is automatically
captured for analysis
Errors are automatically
detected and alerts are sent
Analyze and debug across
distributed clusters
Analyze & debug
from
Amazon SageMaker
Studio
Training data analysis, debugging, & alert generation
Introducing Amazon SageMaker Debugger
14. How does Amazon SageMaker Debugger Work?
Training in
progress
Analysis in
progress
Customer’s S3 Bucket
Amazon
CloudWatch Event
Amazon SageMaker
Amazon SageMaker
Studio Visualization
Amazon SageMaker
Notebook
Action à Stop the training
Action à Analyze using
Debugger SDK
Action à Visualize Tensors
using charts
• No code change is necessary to emit debug data with built in algorithms and custom training script
• Analysis occurs real time as data is emitted making real time alerts possible
15. Add Debugger to Training Job
Initialize your hook and
save tensors in specified
path
Initialize your rules.
These will read data for
analysis from the path
specified in the hook
17. Deploying a model is not the end.
You need to continuously monitor
models in production and iterate
Concept drift due to
divergence of data
Model performance can
change due to unknown
factors
Continuous monitoring involves a
lot of tooling and expense
Model monitoring is
cumbersome but critical
+
+
=
18. Introducing Amazon SageMaker Model Monitor
Automatic data
collection
Continuous
Monitoring
CloudWatch
Integration
Data collected from
endpoints is stored in
Amazon S3
Metrics emitted to
Amazon CloudWatch
make it easy to alarm
and automate corrective
actions
Continuous monitoring of models in production
Visual
Data analysis
Define a monitoring
schedule and detect
changes in quality against
a pre-defined baseline
See monitoring results,
data statistics, and
violation reports in
Amazon SageMaker
Studio; Analyze in
Notebooks
Flexibility
with rules
Use built-in rules to
detect data drift or write
your own rules for
custom analysis
27. Under the hood
1. Amazon Model Monitor runs a ProcessingJob on your behalf
• On-demand, distributed job
• Fully managed – ideal for data processing and custom analysis
• Pay for duration for which the job runs
2. Analyzes the data collected
• SageMaker provides pre-built container for analysis
• Pre-built container runs Deequ on Spark
• Custom analysis also supported
32. Under the hood
1. Amazon Model Monitor runs ProcessingJob on your behalf at the
schedule you select –i.e. Monitoring Jobs
2. Analyzes the data collected using your choice of analysis container
(pre-built or custom)
Compares results against the baseline
Generates results for each Monitoring job
• Violations report for each job in Amazon S3
• Statistics report for data collected during the run
• Emits summary metrics and statistics to Amazon CloudWatch
34. 5. View monitoring results
Amazon SageMaker
Training job
Model Amazon SageMaker
Endpoint
Applications
Scheduled
Monitoring Job
Results:
statistics
and violations
Baseline statistics
and constraints
Amazon
CloudWatch
metrics
Requests,
predictions
Baseline
Processing Job
35. 6. Get alerted and take corrective actions
Amazon SageMaker
Training job
Model Amazon SageMaker
Endpoint
Applications
Scheduled
Monitoring Job
Results:
statistics
and violations
Baseline statistics
and constraints
Amazon
CloudWatch
metrics
Requests,
predictions
Baseline
Processing Job
Analysis of
results
Notifications
• Model updates
• Training data
updates
• Retraining
39. Thank You!
Waking the Data Scientist @ 2am:
Detect Model Degradation on Production
Models with Amazon SageMaker Endpoints &
Model Monitor
Chris Fregly
Developer Advocate @ AWS
AI and Machine Learning https://datascienceonaws.com
github.com/data-science-on-aws
@cfregly
linkedin.com/in/cfregly