SlideShare una empresa de Scribd logo
1 de 20
Distributed TensorFlow
on Kubernetes
資訊與通訊研究所 蔣是文 Mac Chiang
Copyright 2017 ITRI 工業技術研究院
Agenda
• Kubernetes Introduction
• Scheduling GPUs on Kubernetes
• Distributed TensorFlow Introduction
• Running Distributed TensorFlow on Kubernetes
• Experience Sharing
• Summary
2
Copyright 2017 ITRI 工業技術研究院
What’s Kubernetes
• “Kubernetes” is Greek for captain or pilot
• Experiences from Google and design by Goolge
• Kubernetes is a production-grade, open-source platform that
orchestrates the placement (scheduling) and execution of
application containers within and across computer clusters.
• Masters manage the cluster and the nodes are used to host the
running applications.
3
Copyright 2017 ITRI 工業技術研究院
Why Kubernetes
4
• Automatic binpacking
• Horizontal scaling
• Automated rollouts and rollback
• Service monitor
• Self-healing
• Service discovery and load balancing
• 100% Open source, written in Go
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes
• Nvidia drivers are installed
• Turned on alpha feature gate Accelerators across the
system
▪ --feature-gates="Accelerators=true“
• Nodes must be using docker engine as the container
runtime
5
Copyright 2017 ITRI 工業技術研究院
Scheduling GPUs on Kubernetes (cont.)
6
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
-
name: gpu-container-1
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs
-
name: gpu-container-2
image: gcr.io/google_containers/pause:2.0
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
Copyright 2017 ITRI 工業技術研究院
Scheduling on Different GPU Versions
7
• Labeling nodes with GPU HW type
• Specify the GPU types via Node Affinity rules
Tesla P100
Node1 Node2
2 * K80 1 * P100
Tesla K80
gpu:k80 gpu:p100
Copyright 2017 ITRI 工業技術研究院
Access to CUDA libraries from Docker
nvidia-driver
libcuda.so
8
Copyright 2017 ITRI 工業技術研究院
TensorFlow
9
• Originally developed by the Google Brain Team
within Google's Machine Intelligence research
organization
• An open source software library for numerical
computation using data flow graphs
• Nodes in the graph represent mathematical
operations, while the graph edges represent the
multidimensional data arrays (tensors)
communicated between them.
• Support one or more CPUs or GPUs in a desktop,
server, or mobile device with a single API
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow
10
http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow (cont.)
11
• Replication
▪ In-graph
▪ Between-graph
• Training
▪ Asynchronous
▪ Synchronous
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S
12
• TensorFlow ecosystem
▪ https://github.com/tensorflow/ecosystem
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
13
• Prepare codes for distributed training
▪ Flags for configuring the task
▪ Construct the cluster and start the server
▪ Set the device before graph construction
# Flags for configuring the task
flags.DEFINE_integer("task_index", None,
"Worker task index, should be >= 0. task_index=0 is "
"the master worker task the performs the variable "
"initialization.")
flags.DEFINE_string("ps_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("worker_hosts", None,
"Comma-separated list of hostname:port pairs")
flags.DEFINE_string("job_name", None, "job name: worker or ps")
# Construct the cluster and start the server
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({
"ps": ps_spec,
"worker": worker_spec})
server = tf.train.Server(
cluster, job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)):
# Construct the TensorFlow graph.
# Run the TensorFlow graph.
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
14
• Build docker image
▪ Prepare Dockerfile
▪ Build docker image
docker build -t <image_name>:v1 -f Dockerfile .
docker build -t macchiang/mnist:v7 -f Dockerfile .
docker push <image_name>:v1 Push image to docker hub
docker push macchiang/mnist:v7
FROM tensorflow/tensorflow:latest-gpu
COPY mnist_replicatensorflow/tensorflow.py /
ENTRYPOINT ["python", "/mnist_replica.py"]
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
15
• My revised history
▪ https://hub.docker.com/r/macchiang/mnist/tags/
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
16
• Specify parameters in jinja template file
▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir
▪ You may optionally specify credential_secret_name and
credential_secret_key if you need to read and write to Google Cloud
Storage
• Generate K8S YAML and create services and pods
▪ python render_template.py mnist.yaml.jinja | kubectl create -f -
command:
- "/root/inception/bazel-bin/inception/imagenet_distributed_train"
args:
- "--data_dir=/data/raw-data"
- "--task_index=0"
- "--job_name=worker“
- "--worker_hosts=inception-worker-0:5000,inception-worker-1:5000“
- "--ps_hosts=inception-ps-0:5000"
Copyright 2017 ITRI 工業技術研究院
Distributed TensorFlow on K8S (cont.)
17
Worker0
Worker1
Service Worker0
Service Worker1
:5000
PS0
Service PS0
:5000
:5000
Training Data
NFS
Training Result
NFS
Read
Write
Copyright 2017 ITRI 工業技術研究院
Distributed Tensorflow with CPUs
18
Container Orchestration Platform
35 Nodes
ImageNet Data
(145GB)
NFS
Training Result
NFS
2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
48GB Memory
• Inception Model
▪ Spent 9.23 days
35 Containers
Rethinking the Inception Architecture for Computer Vision
Copyright 2017 ITRI 工業技術研究院
Summary
• Kubernetes
▪ Production-grade container orchestration platform
▪ GPU resources management
a. Nvidia GPU only now
b. In Kuberntest 1.8, you can use NVIDIA device plugin.
» https://github.com/NVIDIA/k8s-device-plugin
• Kubernetes + Distributed TensorFlow
▪ Easy to build the distributed training cluster
▪ Leverage Kubernetes advantages
a. Restart failed container
b. Monitoring
c. Scheduling
19
Thank you!
macchiang@itri.org.tw
Kubernetes Taiwan User Group

Más contenido relacionado

La actualidad más candente

Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibilityDocker, Inc.
 
Gordon's secret session kubernetes on windows
Gordon's secret session   kubernetes on windowsGordon's secret session   kubernetes on windows
Gordon's secret session kubernetes on windowsDocker, Inc.
 
Kubernetes 架構與虛擬化之差異
Kubernetes 架構與虛擬化之差異Kubernetes 架構與虛擬化之差異
Kubernetes 架構與虛擬化之差異inwin stack
 
How to Achieve Canary Deployment on Kubernetes
How to Achieve Canary Deployment on KubernetesHow to Achieve Canary Deployment on Kubernetes
How to Achieve Canary Deployment on KubernetesHanLing Shen
 
Beyond Ingresses - Better Traffic Management in Kubernetes
Beyond Ingresses - Better Traffic Management in KubernetesBeyond Ingresses - Better Traffic Management in Kubernetes
Beyond Ingresses - Better Traffic Management in KubernetesMark McBride
 
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes WorkloadsAWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes WorkloadsAWS Summits
 
Kubernetes and Hybrid Deployments
Kubernetes and Hybrid DeploymentsKubernetes and Hybrid Deployments
Kubernetes and Hybrid DeploymentsSandeep Parikh
 
Containerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudContainerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama
 
Introduction kubernetes 2017_12_24
Introduction kubernetes 2017_12_24Introduction kubernetes 2017_12_24
Introduction kubernetes 2017_12_24Sam Zheng
 
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-step
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-stepSetting up CI/CD pipeline with Kubernetes and Kublr step-by-step
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-stepOleg Chunikhin
 
Network plugins for kubernetes
Network plugins for kubernetesNetwork plugins for kubernetes
Network plugins for kubernetesinwin stack
 
Enterprise Kubernetes from Canonical
Enterprise Kubernetes from CanonicalEnterprise Kubernetes from Canonical
Enterprise Kubernetes from CanonicalDustin Kirkland
 
KubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeAcademy
 
Learn kubernetes in 90 minutes
Learn kubernetes in 90 minutesLearn kubernetes in 90 minutes
Learn kubernetes in 90 minutesLarry Cai
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliabilityOleg Chunikhin
 
Integration kubernetes with docker private registry
Integration kubernetes with docker private registryIntegration kubernetes with docker private registry
Integration kubernetes with docker private registryHungWei Chiu
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container enginebrendandburns
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeAcademy
 

La actualidad más candente (20)

Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
Gordon's secret session kubernetes on windows
Gordon's secret session   kubernetes on windowsGordon's secret session   kubernetes on windows
Gordon's secret session kubernetes on windows
 
Kubernetes 架構與虛擬化之差異
Kubernetes 架構與虛擬化之差異Kubernetes 架構與虛擬化之差異
Kubernetes 架構與虛擬化之差異
 
How to Achieve Canary Deployment on Kubernetes
How to Achieve Canary Deployment on KubernetesHow to Achieve Canary Deployment on Kubernetes
How to Achieve Canary Deployment on Kubernetes
 
Beyond Ingresses - Better Traffic Management in Kubernetes
Beyond Ingresses - Better Traffic Management in KubernetesBeyond Ingresses - Better Traffic Management in Kubernetes
Beyond Ingresses - Better Traffic Management in Kubernetes
 
Kubernetes basics and hands on exercise
Kubernetes basics and hands on exerciseKubernetes basics and hands on exercise
Kubernetes basics and hands on exercise
 
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes WorkloadsAWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
 
How Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph betterHow Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph better
 
Kubernetes and Hybrid Deployments
Kubernetes and Hybrid DeploymentsKubernetes and Hybrid Deployments
Kubernetes and Hybrid Deployments
 
Containerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudContainerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the Cloud
 
Introduction kubernetes 2017_12_24
Introduction kubernetes 2017_12_24Introduction kubernetes 2017_12_24
Introduction kubernetes 2017_12_24
 
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-step
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-stepSetting up CI/CD pipeline with Kubernetes and Kublr step-by-step
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-step
 
Network plugins for kubernetes
Network plugins for kubernetesNetwork plugins for kubernetes
Network plugins for kubernetes
 
Enterprise Kubernetes from Canonical
Enterprise Kubernetes from CanonicalEnterprise Kubernetes from Canonical
Enterprise Kubernetes from Canonical
 
KubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautifulKubeCon EU 2016: Killing containers to make weather beautiful
KubeCon EU 2016: Killing containers to make weather beautiful
 
Learn kubernetes in 90 minutes
Learn kubernetes in 90 minutesLearn kubernetes in 90 minutes
Learn kubernetes in 90 minutes
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
 
Integration kubernetes with docker private registry
Integration kubernetes with docker private registryIntegration kubernetes with docker private registry
Integration kubernetes with docker private registry
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container engine
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to Kubernetes
 

Destacado

DNN平台建置分享
DNN平台建置分享DNN平台建置分享
DNN平台建置分享inwin stack
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用inwin stack
 
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on KubernetesBuild the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetesinwin stack
 
Virtualization inside kubernetes
Virtualization inside kubernetesVirtualization inside kubernetes
Virtualization inside kubernetesinwin stack
 
容器革命的「利」與「必」
容器革命的「利」與「必」 容器革命的「利」與「必」
容器革命的「利」與「必」 inwin stack
 
Build your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeBuild your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeinwin stack
 

Destacado (6)

DNN平台建置分享
DNN平台建置分享DNN平台建置分享
DNN平台建置分享
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用
 
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on KubernetesBuild the Blockchain as service (BaaS) Using Ethereum on Kubernetes
Build the Blockchain as service (BaaS) Using Ethereum on Kubernetes
 
Virtualization inside kubernetes
Virtualization inside kubernetesVirtualization inside kubernetes
Virtualization inside kubernetes
 
容器革命的「利」與「必」
容器革命的「利」與「必」 容器革命的「利」與「必」
容器革命的「利」與「必」
 
Build your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource typeBuild your own kubernetes apiserver and resource type
Build your own kubernetes apiserver and resource type
 

Similar a Distributed tensorflow on kubernetes

Kubernetes and devops
Kubernetes and devopsKubernetes and devops
Kubernetes and devopsmacchiang
 
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...Edge AI and Vision Alliance
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java OperatorAnthony Dahanne
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...Edge AI and Vision Alliance
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗William Yeh
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...Edge AI and Vision Alliance
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Mesosphere Inc.
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxmacchiang
 
Docker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuDocker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuITCamp
 
The path to a serverless-native era with Kubernetes
The path to a serverless-native era with KubernetesThe path to a serverless-native era with Kubernetes
The path to a serverless-native era with Kubernetessparkfabrik
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Daniel Bryant
 
RTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIRTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIJoel W. King
 
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Cisco DevNet
 
The App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxThe App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxNebulaworks
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Patrick Chanezon
 

Similar a Distributed tensorflow on kubernetes (20)

Kubernetes and devops
Kubernetes and devopsKubernetes and devops
Kubernetes and devops
 
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
 
NFV features in kubernetes
NFV features in kubernetesNFV features in kubernetes
NFV features in kubernetes
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 
Enabling NFV features in kubernetes
Enabling NFV features in kubernetesEnabling NFV features in kubernetes
Enabling NFV features in kubernetes
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linux
 
Docker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex VranceanuDocker adventures in Continuous Delivery - Alex Vranceanu
Docker adventures in Continuous Delivery - Alex Vranceanu
 
The path to a serverless-native era with Kubernetes
The path to a serverless-native era with KubernetesThe path to a serverless-native era with Kubernetes
The path to a serverless-native era with Kubernetes
 
Moby KubeCon 2017
Moby KubeCon 2017Moby KubeCon 2017
Moby KubeCon 2017
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"
 
RTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACIRTP NPUG: Ansible Intro and Integration with ACI
RTP NPUG: Ansible Intro and Integration with ACI
 
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
Microservices & Serverless Architecture Principles Applied - Cisco Live Orlan...
 
The App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxThe App Developer's Kubernetes Toolbox
The App Developer's Kubernetes Toolbox
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
 

Más de inwin stack

Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutionsinwin stack
 
Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計inwin stack
 
當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Functioninwin stack
 
運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發inwin stack
 
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩inwin stack
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案inwin stack
 
An Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native JourneyAn Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native Journeyinwin stack
 
維運Kubernetes的兩三事
維運Kubernetes的兩三事維運Kubernetes的兩三事
維運Kubernetes的兩三事inwin stack
 
Serverless framework on kubernetes
Serverless framework on kubernetesServerless framework on kubernetes
Serverless framework on kubernetesinwin stack
 
Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】inwin stack
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變inwin stack
 
以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境inwin stack
 
Setup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes FederationSetup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes Federationinwin stack
 
基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riffinwin stack
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
Extend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API ServerExtend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API Serverinwin stack
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用inwin stack
 
淺談 Kubernetes於大數據生態系的相關開發近況
淺談 Kubernetes於大數據生態系的相關開發近況淺談 Kubernetes於大數據生態系的相關開發近況
淺談 Kubernetes於大數據生態系的相關開發近況inwin stack
 

Más de inwin stack (18)

Migrating to Cloud Native Solutions
Migrating to Cloud Native SolutionsMigrating to Cloud Native Solutions
Migrating to Cloud Native Solutions
 
Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計Cloud Native 下的應用網路設計
Cloud Native 下的應用網路設計
 
當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function當電子發票遇見 Google Cloud Function
當電子發票遇見 Google Cloud Function
 
運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發運用高效、敏捷全新平台極速落實雲原生開發
運用高效、敏捷全新平台極速落實雲原生開發
 
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩The last mile of digital transformation AI大眾化:數位轉型的最後一哩
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
 
An Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native JourneyAn Open, Open source way to enable your Cloud Native Journey
An Open, Open source way to enable your Cloud Native Journey
 
維運Kubernetes的兩三事
維運Kubernetes的兩三事維運Kubernetes的兩三事
維運Kubernetes的兩三事
 
Serverless framework on kubernetes
Serverless framework on kubernetesServerless framework on kubernetes
Serverless framework on kubernetes
 
Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】Train.IO 【第六期-OpenStack 二三事】
Train.IO 【第六期-OpenStack 二三事】
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
 
以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境以 Kubernetes 部屬 Spark 大數據計算環境
以 Kubernetes 部屬 Spark 大數據計算環境
 
Setup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes FederationSetup Hybrid Clusters Using Kubernetes Federation
Setup Hybrid Clusters Using Kubernetes Federation
 
基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff基於 K8S 開發的 FaaS 專案 - riff
基於 K8S 開發的 FaaS 專案 - riff
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Extend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API ServerExtend the Kubernetes API with CRD and Custom API Server
Extend the Kubernetes API with CRD and Custom API Server
 
利用K8S實現高可靠應用
利用K8S實現高可靠應用利用K8S實現高可靠應用
利用K8S實現高可靠應用
 
淺談 Kubernetes於大數據生態系的相關開發近況
淺談 Kubernetes於大數據生態系的相關開發近況淺談 Kubernetes於大數據生態系的相關開發近況
淺談 Kubernetes於大數據生態系的相關開發近況
 

Último

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Último (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Distributed tensorflow on kubernetes

  • 2. Copyright 2017 ITRI 工業技術研究院 Agenda • Kubernetes Introduction • Scheduling GPUs on Kubernetes • Distributed TensorFlow Introduction • Running Distributed TensorFlow on Kubernetes • Experience Sharing • Summary 2
  • 3. Copyright 2017 ITRI 工業技術研究院 What’s Kubernetes • “Kubernetes” is Greek for captain or pilot • Experiences from Google and design by Goolge • Kubernetes is a production-grade, open-source platform that orchestrates the placement (scheduling) and execution of application containers within and across computer clusters. • Masters manage the cluster and the nodes are used to host the running applications. 3
  • 4. Copyright 2017 ITRI 工業技術研究院 Why Kubernetes 4 • Automatic binpacking • Horizontal scaling • Automated rollouts and rollback • Service monitor • Self-healing • Service discovery and load balancing • 100% Open source, written in Go
  • 5. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes • Nvidia drivers are installed • Turned on alpha feature gate Accelerators across the system ▪ --feature-gates="Accelerators=true“ • Nodes must be using docker engine as the container runtime 5
  • 6. Copyright 2017 ITRI 工業技術研究院 Scheduling GPUs on Kubernetes (cont.) 6 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container-1 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs - name: gpu-container-2 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
  • 7. Copyright 2017 ITRI 工業技術研究院 Scheduling on Different GPU Versions 7 • Labeling nodes with GPU HW type • Specify the GPU types via Node Affinity rules Tesla P100 Node1 Node2 2 * K80 1 * P100 Tesla K80 gpu:k80 gpu:p100
  • 8. Copyright 2017 ITRI 工業技術研究院 Access to CUDA libraries from Docker nvidia-driver libcuda.so 8
  • 9. Copyright 2017 ITRI 工業技術研究院 TensorFlow 9 • Originally developed by the Google Brain Team within Google's Machine Intelligence research organization • An open source software library for numerical computation using data flow graphs • Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. • Support one or more CPUs or GPUs in a desktop, server, or mobile device with a single API
  • 10. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow 10 http://www.inwinstack.com/index.php/zh/2017/04/17/tensorflow-on-kubernetes/
  • 11. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow (cont.) 11 • Replication ▪ In-graph ▪ Between-graph • Training ▪ Asynchronous ▪ Synchronous
  • 12. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S 12 • TensorFlow ecosystem ▪ https://github.com/tensorflow/ecosystem
  • 13. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 13 • Prepare codes for distributed training ▪ Flags for configuring the task ▪ Construct the cluster and start the server ▪ Set the device before graph construction # Flags for configuring the task flags.DEFINE_integer("task_index", None, "Worker task index, should be >= 0. task_index=0 is " "the master worker task the performs the variable " "initialization.") flags.DEFINE_string("ps_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("worker_hosts", None, "Comma-separated list of hostname:port pairs") flags.DEFINE_string("job_name", None, "job name: worker or ps") # Construct the cluster and start the server ps_spec = FLAGS.ps_hosts.split(",") worker_spec = FLAGS.worker_hosts.split(",") cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec}) server = tf.train.Server( cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) if FLAGS.job_name == "ps": server.join() with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, Cluster=cluster)): # Construct the TensorFlow graph. # Run the TensorFlow graph.
  • 14. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 14 • Build docker image ▪ Prepare Dockerfile ▪ Build docker image docker build -t <image_name>:v1 -f Dockerfile . docker build -t macchiang/mnist:v7 -f Dockerfile . docker push <image_name>:v1 Push image to docker hub docker push macchiang/mnist:v7 FROM tensorflow/tensorflow:latest-gpu COPY mnist_replicatensorflow/tensorflow.py / ENTRYPOINT ["python", "/mnist_replica.py"]
  • 15. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 15 • My revised history ▪ https://hub.docker.com/r/macchiang/mnist/tags/
  • 16. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 16 • Specify parameters in jinja template file ▪ name, image, worker_replicas, ps_replicas, script, data_dir, and train_dir ▪ You may optionally specify credential_secret_name and credential_secret_key if you need to read and write to Google Cloud Storage • Generate K8S YAML and create services and pods ▪ python render_template.py mnist.yaml.jinja | kubectl create -f - command: - "/root/inception/bazel-bin/inception/imagenet_distributed_train" args: - "--data_dir=/data/raw-data" - "--task_index=0" - "--job_name=worker“ - "--worker_hosts=inception-worker-0:5000,inception-worker-1:5000“ - "--ps_hosts=inception-ps-0:5000"
  • 17. Copyright 2017 ITRI 工業技術研究院 Distributed TensorFlow on K8S (cont.) 17 Worker0 Worker1 Service Worker0 Service Worker1 :5000 PS0 Service PS0 :5000 :5000 Training Data NFS Training Result NFS Read Write
  • 18. Copyright 2017 ITRI 工業技術研究院 Distributed Tensorflow with CPUs 18 Container Orchestration Platform 35 Nodes ImageNet Data (145GB) NFS Training Result NFS 2* Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 48GB Memory • Inception Model ▪ Spent 9.23 days 35 Containers Rethinking the Inception Architecture for Computer Vision
  • 19. Copyright 2017 ITRI 工業技術研究院 Summary • Kubernetes ▪ Production-grade container orchestration platform ▪ GPU resources management a. Nvidia GPU only now b. In Kuberntest 1.8, you can use NVIDIA device plugin. » https://github.com/NVIDIA/k8s-device-plugin • Kubernetes + Distributed TensorFlow ▪ Easy to build the distributed training cluster ▪ Leverage Kubernetes advantages a. Restart failed container b. Monitoring c. Scheduling 19

Notas del editor

  1. https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-intro/
  2. 去年四⽉中旬 Google 釋出 TensorFlow 0.8,新增加分散式運算能⼒,使 TensorFlow 可在數百台的機器上執⾏ 訓練程序,建立各種機器學習模型,將原本要耗費數天或數個星期的模型訓練過程縮短到數⼩時 而TensorFlow的工作(Job)可拆成多個相同功能的任務(Task),這些工作又分成Parameter server與Worker,兩者功能說明如下: Parameter server:主要根據梯度更新變數,並儲存於tf.Variable,可解釋為僅儲存模型的變數,並存放Variable副本。 Worker:通常稱為計算節點,主要執行密集型的Graph運算資源,並根據變數運算梯度,亦能儲存Graph副本。 • Client:是⽤於建立 TensorFlow 計算 Graph,並建立與叢集進⾏互動的tensorflow::Session ⾏ 程,⼀般由 Python 或 C++ 實作,單⼀客⼾端可以同時連接多個 TF 伺服器連接,同時也能被 多個 TF 伺服器連接. • Master Service:是⼀個 RPC 服務⾏程,⽤來遠端連線⼀系列分散式裝置,主要提供 tensorflow::Session介⾯,並負責透過 Worker Service 與⼯作的任務進⾏溝通. • Worker Service:是⼀個可以使⽤本地裝置(CPU 或 GPU)對部分 Graph 進⾏運算的 RPC 邏 輯,透過 worker_service.proto 介⾯來實作,所有 TensorFlow 伺服器均包含了 Worker Service 邏輯
  3. 在TensorFlow中启动分布式深度学习模型训练任务也有两种模式。一种为In-graph replication。在这种模式下神经网络的参数会都保存在同一个TensorFlow计算图中,只有计算会分配到不同计算服务器。另一种为Between-graph replication,这种模式下所有的计算服务器也会创建参数,但参数会通过统一的方式分配到参数服务器。因为In-graph replication处理海量数据的能力稍弱,所以Between-graph replication是一个更加常用的模式。 In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker. Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:psas before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker. 深度学习模型常用的有两种分布式训练方式。一种是同步更新,另一种是异步更新。如上面的ppt所示,在同步更新模式下,所有服务器都会统一读取参数的取值,计算参数梯度,最后再统一更新。而在异步更新模式下,不同服务器会自己读取参数,计算梯度并更新参数,而不需要与其他服务器同步。同步更新的最大问题在于,不同服务器需要同步完成所有操作,于是快的服务器需要等待慢的服务器,资源利用率会相对低一些。而异步模式可能会使用陈旧的梯度更新参数导致训练的效果受到影响。不同的更新模式各有优缺点,很难统一的说哪一个更好,需要具体问题具体分析 Asynchronous training. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above. Synchronous training. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using thetf.train.SyncReplicasOptimizer).