SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
MONITORING OF GPU USAGE
WITH TENSORFLOW MODEL TRAINING USING PROMETHEUS
Diane Feddema, Principal Software Engineer
Zak Hassan, Senior Software Engineer
#RED_HAT #AICOE #CTO_OFFICE
YOUR SPEAKERS
DIANE FEDDEMA
PRINCIPAL SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE
● Currently focused on developing and applying Data Science and Machine Learning techniques for performance
analysis, automating these analyses and displaying data in novel ways.
● Previously worked as a performance engineer at the National Center for Atmospheric Research, NCAR, working on
optimizations and tuning in parallel global climate models.
ZAK HASSAN
SENIOR SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE
● Leading the log anomaly detection project within the aiops team and building a user feedback service for improved
accuracy of machine learning predictions.
● Developing data science apps and working on improved observability of machine learning systems such as spark and
tensorflow.
#RED_HAT #AICOE #CTO_OFFICE
Outline
● Story
● Concepts
○ Comparing CPU vs GPU
○ What Is Cuda and anatomy of cuda on kubernetes
○ Monitoring GPU and custom metrics with pushgateway
○ TF with Prometheus integration
○ What is Tensorflow and Pytorch
○ A Pytorch example from MLPerf
○ Tensorflow Tracing
● Examples:
○ Running Jupyter (CPU, GPU, targeting specific gpu type)
○ Mounting Training data into notebook/tf job
○ Uses of Nvidia-smi
● Demo
○ Running Detectron on a Tesla V100 with Prometheus & Grafana
monitoring
“Design the factory like you
would design an advanced
computer… In fact use
engineers that are used to doing
that and have them work on
this.”
-- Elon Musk (2016)
https://youtu.be/f9uveu-c5us
Source: https://flic.kr/p/chEftd
• unlocking
phones
WHY IS DEEP LEARNING A BIG
DEAL ?
MobileOnline
• Netflix.com
• Amazon.com
• Targeted ads
Automotive
• self driving
• voice assistant
Source: https://bit.ly/2I8zIcs
Source: https://bit.ly/2HVCaUC
PARALLEL PROCESSING
MOST LANGUAGES
SUPPORT
● MODERN HARDWARE SUPPORT
EXECUTION OF PARALLEL
PROCESSES/THREADS AND HAVE APIS
TO SPAWN PROCESSES IN PARALLEL
● YOUR ONLY LIMITS IS HOW MANY CPU
CORES YOU HAVE ON YOUR MACHINE
● CPU USED TO BE A KEY COMPONENT OF
HPC
● GPU HAS DIFFERENT ARCHITECTURE &
# OF CORES
CPU
INSTRUCTION
MEMORY
DATA
MEMORY
Input/Output
ARITHMETRIC
LOGIC UNIT
CONTROL
UNIT
Project Thoth
Hardware accelerators
● GPU
○ CUDA
○ OpenCL
● TPU
Performance Goals
Latency
Decreased
Bandwidth
Increased
Throughput
Increased
WHAT IS CUDA?
PROPRIETARY TOOLING
● hardware/software for HPC
● prerequisite is that you have nvidia cuda supported graphics cards
● ML frameworks like tensorflow, theanos, pytorch utilize cuda for leveraging
hardware acceleration
● You may get a 10x faster performance for machine learning jobs by utilizing
cuda
ANATOMY OF A CUDA
WORKLOAD ON K8S
TENSORFLOW
CUDA LIBS
CONTAINER RUNTIME
NVIDIA LIBS
HOST OS
SERVER
/dev/nvidaX
GPU
CONTAINER
HARDWARE
JUPYTER
Cli monitoring tool
Nvidia-Smi
● Tool used to display
usage metrics on
what is running on
your gpu.
TFJob + Prometheus
PROMETHEUS
ALERT
MANAGER
PULL
PUSH
PUSH
GATEWAY
NOTIFICATION
EMAIL
MESSAGING
WEBHOOK
TENSORFLOW
JOBS
TRAINING
DATA
GPU NODE
EXPLORER
Idle GPU Alert
● Alert Manager can
notify:
○ slack chat notification
○ email
○ web hook
○ more
● Get notified when your
GPU isn’t being utilized
and shut down your
VM’s in the cloud to
save on cost.
groups:
- name: nvidia_gpu.rules
rules:
- alert: UnusedResources
expr: nvidia_gpu_duty_cycle == 0
for: 10m
labels:
severity: critical
annotations:
description: GPU is not being utilized you
should scale down your gpu node
summary: GPU Node isn't being utilized
Alert On Idle GPU
CPU vs GPU
CPU vs GPU
Jupyter +TF on CPU
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
restartPolicy: OnFailure
containers:
- name: jupyter-tf-gpu
image: "quay.io/zmhassan/fedora28:tensorflow-cpu-2.0.0-alpha0"
Jupyter+TF on GPU
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
restartPolicy: OnFailure
containers:
- name: jupyter-tf-gpu
image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter"
resources:
limits:
nvidia.com/gpu: 1
Specific GPU Node Target
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
containers:
- name: jupyter-tf-gpu
image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-v100
Relabel kubernetes node
kubectl label node <node_name> 
accelerator=nvidia-tesla-k80
# or
kubectl label node <node_name> 
accelerator=nvidia-tesla-v100
Mount Training Data
AzureDisk
GlusterFS
NFS
AzureFile
Gce Persistent Disk
Aws Elastic Block
Storage
CephFS
… more
Persistent Volume Claim
● Native k8s resource
● lets you access pv
● can be used to share
data cross different
pods.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 100Gi
Persistent Volume
● native k8s resource
● can be readonly,
readWriteOnce or
readwritemany
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
nfs:
server: 0.0.0.0
path: "/"
Mounting Training Data
● use persistent
volume claims to
access your data
● in this example we
us nfs but you can
choose another
type.
apiVersion: v1
kind: Pod
metadata:
name: jp-notebook
spec:
containers:
- name: jp-notebook
image: tensorflow/tensorflow:nightly-gpu-py3-jupyter
volumeMounts:
- name: my-pvc-nfs
mountPath: "/tf/data"
volumes:
- name: my-pvc-nfs
persistentVolumeClaim:
claimName: nfs
Additional Tips
● Kubernetes doesn’t support sharing gpu’s
● If your running in cloud you should look at
stopping your VM if there is no workloads
being used. Restart it when you need it. The
costs can add up.
● Use volumes to mount your data for training
and share it across your environment
Monitoring and Performance
of ML on GPUs
● Benchmarking ML on GPUs
○ Monitoring
○ Performance
● Example using MLperf together with Prometheus
and Grafana
● Computing requirements & why GPU’s for ML
Why do we need gpus to
solve these problems
● Neural Networks rely heavily on floating point matrix
multiplication
● These algorithms also require a lot of data to train
large memory (GBs) and high speed networks to
complete in a reasonable amount of time
● Faster Deep Learning training
Nvidia DGX-2
GPUGPU GPU GPU GPU GPU GPU GPU
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
GPUGPUGPUGPUGPUGPUGPUGPU
Source: Nvidia
V100V100 V100V100 V100 V100V100V100
V100V100 V100V100 V100 V100V100V100
Benchmarks in MLPerf
Application
Area
Vision Language Commerce
Reinforcement
Learning
Problem
Image classification
Object Detection (light weight and
heavy weight)
Translation Recommendations
Games
Go
Datasets
ImageNet
COCO
WMT
English-German
MovieLens-20M Go
Models
ResNet-50
Detectron
Transformer
OpenNMT
Neural Collaborative
Filtering
Mini Go
Metrics COCO mAp
Prediction accuracy
BLEU Prediction Accuracy
Prediction accuracy
Win/Loss
MLPerf Project Sponsors
University research contributors
Industry contributors
What is Tensorflow?
● Open source Python library used to implement
deep neural networks (released from Google in
2015)
● A machine learning framework
● Tools to write your own models in Python,
JavaScript or Swift
● Collection of datasets ready to use with tensorflow
● TF run in Eager and Graph mode
● TF can run on CPUs or GPUs
What is Pytorch?
● Python-based open source deep learning library
● Used to build Neural Networks
● Replacement for NumPy for use with GPUs
● Can run on CPUs or GPUs
● Uses GPUs to accelerate numerical computations
● Pytorch performs computations
85,000 Images
Identify 91 objects
Source: Cornell Project
COCO Dataset
Detectron - Example Output
MLPerf Results
[c
Source: Nvidia Developer News Dec 2018
MLPerf Results - Single Node
[c
Source: Nvidia Developer News Dec 2018
How to monitor gpus with
nvidia-smi
$ nvidia-smi
--query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.
link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,ut
ilization.memory,memory.total,memory.free,memory.used
--format=csv -l 5
Monitoring GPUs with nvidia-smi$ nvidia-smi
--query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gp
memory,memory.total,memory.free,memory.used --format=csv -l 5
2019/04/17 14:41:35.223, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB
2019/04/17 14:41:35.225, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 0 %, 32480 MiB, 14565 MiB, 17915 MiB
2019/04/17 14:41:35.227, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 0 %, 32480 MiB, 15773 MiB, 16707 MiB
2019/04/17 14:41:35.229, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 0 %, 32480 MiB, 14363 MiB, 18117 MiB
2019/04/17 14:41:35.231, Tesla V100-SXM2-32GB, 00000000:85:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 13363 MiB, 19117 MiB
2019/04/17 14:41:35.233, Tesla V100-SXM2-32GB, 00000000:86:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 14719 MiB, 17761 MiB
2019/04/17 14:41:35.234, Tesla V100-SXM2-32GB, 00000000:89:00.0, 418.40.04, P0, 3, 3, 49, 100 %, 0 %, 32480 MiB, 15861 MiB, 16619 MiB
2019/04/17 14:41:35.236, Tesla V100-SXM2-32GB, 00000000:8A:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 12317 MiB, 20163 MiB
2019/04/17 14:41:40.239, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB
2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 1 %, 32480 MiB, 14565 MiB, 17915 MiB
2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 1 %, 32480 MiB, 15773 MiB, 16707 MiB
2019/04/17 14:41:40.241, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 1 %, 32480 MiB, 14363 MiB, 18117 MiB
timestamp
pstate
driver_versionpci.bus.id
pcie.link.gen.current
utilization GPU [%]
memory.used [MB]
memory.free [MB]
memory.total [MB]
utilization memory [%]
temperature GPU
pcie.link.gen.max
name
How to log nvidia-smi metric
data (long/short term logging)
[cephagent@asgnode021 object_detection]$ nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr
--format=csv
index, timestamp, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz]
0, 2019/04/17 15:25:33.862, 68.71 W, 1530 MHz, 877 MHz, 1530 MHz
1, 2019/04/17 15:25:33.865, 77.53 W, 1530 MHz, 877 MHz, 1530 MHz
2, 2019/04/17 15:25:33.868, 74.54 W, 1530 MHz, 877 MHz, 1530 MHz
3, 2019/04/17 15:25:33.870, 146.91 W, 1530 MHz, 877 MHz, 1530 MHz
4, 2019/04/17 15:25:33.873, 143.57 W, 1530 MHz, 877 MHz, 1530 MHz
5, 2019/04/17 15:25:33.875, 76.06 W, 1530 MHz, 877 MHz, 1530 MHz
6, 2019/04/17 15:25:33.878, 77.58 W, 1530 MHz, 877 MHz, 1530 MHz
7, 2019/04/17 15:25:33.881, 74.15 W, 1530 MHz, 877 MHz, 1530 MHz
Tensorflow Tracing
import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline
shape = (5000, 5000)
device_name = "/gpu:0"
random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
random_matrix2 = tf.random_uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix2))
with tf.Session() as sess:
# add options to trace the session execution
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
result = sess.run(dot_operation, options=options, run_metadata=run_metadata)
print(result)
# Create the Timeline object and write it to a json file
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('timeline_01.json', 'w') as f:
f.write(chrome_trace)
Tensorflow Tracing
DEMO
Questions?

Más contenido relacionado

La actualidad más candente

ContainerConf 2022: Kubernetes is awesome - but...
ContainerConf 2022: Kubernetes is awesome - but...ContainerConf 2022: Kubernetes is awesome - but...
ContainerConf 2022: Kubernetes is awesome - but...
Nico Meisenzahl
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 

La actualidad más candente (20)

Running Scylla on Kubernetes with Scylla Operator
Running Scylla on Kubernetes with Scylla OperatorRunning Scylla on Kubernetes with Scylla Operator
Running Scylla on Kubernetes with Scylla Operator
 
ContainerConf 2022: Kubernetes is awesome - but...
ContainerConf 2022: Kubernetes is awesome - but...ContainerConf 2022: Kubernetes is awesome - but...
ContainerConf 2022: Kubernetes is awesome - but...
 
OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image Lifecycle
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Red Hat Openshift Fundamentals.pptx
Red Hat Openshift Fundamentals.pptxRed Hat Openshift Fundamentals.pptx
Red Hat Openshift Fundamentals.pptx
 
Infrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and AnsibleInfrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and Ansible
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Kubernetes Story - Day 1: Build and Manage Containers with Podman
Kubernetes Story - Day 1: Build and Manage Containers with PodmanKubernetes Story - Day 1: Build and Manage Containers with Podman
Kubernetes Story - Day 1: Build and Manage Containers with Podman
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf Weiser
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Presentation de NeuVector 5.0
Presentation de NeuVector 5.0Presentation de NeuVector 5.0
Presentation de NeuVector 5.0
 
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStorDelivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
On-Device AI
On-Device AIOn-Device AI
On-Device AI
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 

Similar a Monitoring of GPU Usage with Tensorflow Models Using Prometheus

GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 

Similar a Monitoring of GPU Usage with Tensorflow Models Using Prometheus (20)

Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Deep learning at scale in Azure
Deep learning at scale in AzureDeep learning at scale in Azure
Deep learning at scale in Azure
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Using apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelinesUsing apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelines
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
2016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v022016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v02
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Último (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

  • 1. MONITORING OF GPU USAGE WITH TENSORFLOW MODEL TRAINING USING PROMETHEUS Diane Feddema, Principal Software Engineer Zak Hassan, Senior Software Engineer #RED_HAT #AICOE #CTO_OFFICE
  • 2. YOUR SPEAKERS DIANE FEDDEMA PRINCIPAL SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE ● Currently focused on developing and applying Data Science and Machine Learning techniques for performance analysis, automating these analyses and displaying data in novel ways. ● Previously worked as a performance engineer at the National Center for Atmospheric Research, NCAR, working on optimizations and tuning in parallel global climate models. ZAK HASSAN SENIOR SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE ● Leading the log anomaly detection project within the aiops team and building a user feedback service for improved accuracy of machine learning predictions. ● Developing data science apps and working on improved observability of machine learning systems such as spark and tensorflow. #RED_HAT #AICOE #CTO_OFFICE
  • 3. Outline ● Story ● Concepts ○ Comparing CPU vs GPU ○ What Is Cuda and anatomy of cuda on kubernetes ○ Monitoring GPU and custom metrics with pushgateway ○ TF with Prometheus integration ○ What is Tensorflow and Pytorch ○ A Pytorch example from MLPerf ○ Tensorflow Tracing ● Examples: ○ Running Jupyter (CPU, GPU, targeting specific gpu type) ○ Mounting Training data into notebook/tf job ○ Uses of Nvidia-smi ● Demo ○ Running Detectron on a Tesla V100 with Prometheus & Grafana monitoring
  • 4. “Design the factory like you would design an advanced computer… In fact use engineers that are used to doing that and have them work on this.” -- Elon Musk (2016) https://youtu.be/f9uveu-c5us Source: https://flic.kr/p/chEftd
  • 5. • unlocking phones WHY IS DEEP LEARNING A BIG DEAL ? MobileOnline • Netflix.com • Amazon.com • Targeted ads Automotive • self driving • voice assistant
  • 8. PARALLEL PROCESSING MOST LANGUAGES SUPPORT ● MODERN HARDWARE SUPPORT EXECUTION OF PARALLEL PROCESSES/THREADS AND HAVE APIS TO SPAWN PROCESSES IN PARALLEL ● YOUR ONLY LIMITS IS HOW MANY CPU CORES YOU HAVE ON YOUR MACHINE ● CPU USED TO BE A KEY COMPONENT OF HPC ● GPU HAS DIFFERENT ARCHITECTURE & # OF CORES CPU INSTRUCTION MEMORY DATA MEMORY Input/Output ARITHMETRIC LOGIC UNIT CONTROL UNIT
  • 9.
  • 10.
  • 12. Hardware accelerators ● GPU ○ CUDA ○ OpenCL ● TPU
  • 14.
  • 15. WHAT IS CUDA? PROPRIETARY TOOLING ● hardware/software for HPC ● prerequisite is that you have nvidia cuda supported graphics cards ● ML frameworks like tensorflow, theanos, pytorch utilize cuda for leveraging hardware acceleration ● You may get a 10x faster performance for machine learning jobs by utilizing cuda
  • 16. ANATOMY OF A CUDA WORKLOAD ON K8S TENSORFLOW CUDA LIBS CONTAINER RUNTIME NVIDIA LIBS HOST OS SERVER /dev/nvidaX GPU CONTAINER HARDWARE JUPYTER
  • 17. Cli monitoring tool Nvidia-Smi ● Tool used to display usage metrics on what is running on your gpu.
  • 19. Idle GPU Alert ● Alert Manager can notify: ○ slack chat notification ○ email ○ web hook ○ more ● Get notified when your GPU isn’t being utilized and shut down your VM’s in the cloud to save on cost. groups: - name: nvidia_gpu.rules rules: - alert: UnusedResources expr: nvidia_gpu_duty_cycle == 0 for: 10m labels: severity: critical annotations: description: GPU is not being utilized you should scale down your gpu node summary: GPU Node isn't being utilized
  • 23. Jupyter +TF on CPU apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: restartPolicy: OnFailure containers: - name: jupyter-tf-gpu image: "quay.io/zmhassan/fedora28:tensorflow-cpu-2.0.0-alpha0"
  • 24. Jupyter+TF on GPU apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: restartPolicy: OnFailure containers: - name: jupyter-tf-gpu image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter" resources: limits: nvidia.com/gpu: 1
  • 25. Specific GPU Node Target apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: containers: - name: jupyter-tf-gpu image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter" resources: limits: nvidia.com/gpu: 1 nodeSelector: accelerator: nvidia-tesla-v100
  • 26. Relabel kubernetes node kubectl label node <node_name> accelerator=nvidia-tesla-k80 # or kubectl label node <node_name> accelerator=nvidia-tesla-v100
  • 27. Mount Training Data AzureDisk GlusterFS NFS AzureFile Gce Persistent Disk Aws Elastic Block Storage CephFS … more
  • 28. Persistent Volume Claim ● Native k8s resource ● lets you access pv ● can be used to share data cross different pods. kind: PersistentVolumeClaim apiVersion: v1 metadata: name: nfs spec: accessModes: - ReadWriteMany storageClassName: "" resources: requests: storage: 100Gi
  • 29. Persistent Volume ● native k8s resource ● can be readonly, readWriteOnce or readwritemany apiVersion: v1 kind: PersistentVolume metadata: name: nfs spec: capacity: storage: 100Gi accessModes: - ReadWriteMany nfs: server: 0.0.0.0 path: "/"
  • 30. Mounting Training Data ● use persistent volume claims to access your data ● in this example we us nfs but you can choose another type. apiVersion: v1 kind: Pod metadata: name: jp-notebook spec: containers: - name: jp-notebook image: tensorflow/tensorflow:nightly-gpu-py3-jupyter volumeMounts: - name: my-pvc-nfs mountPath: "/tf/data" volumes: - name: my-pvc-nfs persistentVolumeClaim: claimName: nfs
  • 31. Additional Tips ● Kubernetes doesn’t support sharing gpu’s ● If your running in cloud you should look at stopping your VM if there is no workloads being used. Restart it when you need it. The costs can add up. ● Use volumes to mount your data for training and share it across your environment
  • 32. Monitoring and Performance of ML on GPUs ● Benchmarking ML on GPUs ○ Monitoring ○ Performance ● Example using MLperf together with Prometheus and Grafana ● Computing requirements & why GPU’s for ML
  • 33. Why do we need gpus to solve these problems ● Neural Networks rely heavily on floating point matrix multiplication ● These algorithms also require a lot of data to train large memory (GBs) and high speed networks to complete in a reasonable amount of time ● Faster Deep Learning training
  • 34. Nvidia DGX-2 GPUGPU GPU GPU GPU GPU GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM GPUGPUGPUGPUGPUGPUGPUGPU Source: Nvidia V100V100 V100V100 V100 V100V100V100 V100V100 V100V100 V100 V100V100V100
  • 35. Benchmarks in MLPerf Application Area Vision Language Commerce Reinforcement Learning Problem Image classification Object Detection (light weight and heavy weight) Translation Recommendations Games Go Datasets ImageNet COCO WMT English-German MovieLens-20M Go Models ResNet-50 Detectron Transformer OpenNMT Neural Collaborative Filtering Mini Go Metrics COCO mAp Prediction accuracy BLEU Prediction Accuracy Prediction accuracy Win/Loss
  • 36. MLPerf Project Sponsors University research contributors Industry contributors
  • 37. What is Tensorflow? ● Open source Python library used to implement deep neural networks (released from Google in 2015) ● A machine learning framework ● Tools to write your own models in Python, JavaScript or Swift ● Collection of datasets ready to use with tensorflow ● TF run in Eager and Graph mode ● TF can run on CPUs or GPUs
  • 38. What is Pytorch? ● Python-based open source deep learning library ● Used to build Neural Networks ● Replacement for NumPy for use with GPUs ● Can run on CPUs or GPUs ● Uses GPUs to accelerate numerical computations ● Pytorch performs computations
  • 39. 85,000 Images Identify 91 objects Source: Cornell Project COCO Dataset
  • 41. MLPerf Results [c Source: Nvidia Developer News Dec 2018
  • 42. MLPerf Results - Single Node [c Source: Nvidia Developer News Dec 2018
  • 43. How to monitor gpus with nvidia-smi $ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie. link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,ut ilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
  • 44. Monitoring GPUs with nvidia-smi$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gp memory,memory.total,memory.free,memory.used --format=csv -l 5 2019/04/17 14:41:35.223, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB 2019/04/17 14:41:35.225, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 0 %, 32480 MiB, 14565 MiB, 17915 MiB 2019/04/17 14:41:35.227, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 0 %, 32480 MiB, 15773 MiB, 16707 MiB 2019/04/17 14:41:35.229, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 0 %, 32480 MiB, 14363 MiB, 18117 MiB 2019/04/17 14:41:35.231, Tesla V100-SXM2-32GB, 00000000:85:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 13363 MiB, 19117 MiB 2019/04/17 14:41:35.233, Tesla V100-SXM2-32GB, 00000000:86:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 14719 MiB, 17761 MiB 2019/04/17 14:41:35.234, Tesla V100-SXM2-32GB, 00000000:89:00.0, 418.40.04, P0, 3, 3, 49, 100 %, 0 %, 32480 MiB, 15861 MiB, 16619 MiB 2019/04/17 14:41:35.236, Tesla V100-SXM2-32GB, 00000000:8A:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 12317 MiB, 20163 MiB 2019/04/17 14:41:40.239, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB 2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 1 %, 32480 MiB, 14565 MiB, 17915 MiB 2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 1 %, 32480 MiB, 15773 MiB, 16707 MiB 2019/04/17 14:41:40.241, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 1 %, 32480 MiB, 14363 MiB, 18117 MiB timestamp pstate driver_versionpci.bus.id pcie.link.gen.current utilization GPU [%] memory.used [MB] memory.free [MB] memory.total [MB] utilization memory [%] temperature GPU pcie.link.gen.max name
  • 45. How to log nvidia-smi metric data (long/short term logging) [cephagent@asgnode021 object_detection]$ nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv index, timestamp, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz] 0, 2019/04/17 15:25:33.862, 68.71 W, 1530 MHz, 877 MHz, 1530 MHz 1, 2019/04/17 15:25:33.865, 77.53 W, 1530 MHz, 877 MHz, 1530 MHz 2, 2019/04/17 15:25:33.868, 74.54 W, 1530 MHz, 877 MHz, 1530 MHz 3, 2019/04/17 15:25:33.870, 146.91 W, 1530 MHz, 877 MHz, 1530 MHz 4, 2019/04/17 15:25:33.873, 143.57 W, 1530 MHz, 877 MHz, 1530 MHz 5, 2019/04/17 15:25:33.875, 76.06 W, 1530 MHz, 877 MHz, 1530 MHz 6, 2019/04/17 15:25:33.878, 77.58 W, 1530 MHz, 877 MHz, 1530 MHz 7, 2019/04/17 15:25:33.881, 74.15 W, 1530 MHz, 877 MHz, 1530 MHz
  • 46. Tensorflow Tracing import tensorflow as tf import numpy as np from tensorflow.python.client import timeline shape = (5000, 5000) device_name = "/gpu:0" random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) random_matrix2 = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix2)) with tf.Session() as sess: # add options to trace the session execution options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() result = sess.run(dot_operation, options=options, run_metadata=run_metadata) print(result) # Create the Timeline object and write it to a json file fetched_timeline = timeline.Timeline(run_metadata.step_stats) chrome_trace = fetched_timeline.generate_chrome_trace_format() with open('timeline_01.json', 'w') as f: f.write(chrome_trace)
  • 48. DEMO