This document discusses Kubeflow operators and how they enable Kubeflow to support multiple machine learning frameworks like TensorFlow, PyTorch, MXNet, and Chainer. It explains that operators and custom resource definitions (CRDs) allow ML jobs to be defined and managed for different frameworks. It provides examples of how jobs are defined for TensorFlow using TFJobs and for Chainer using ChainerJobs. It also summarizes how operators work by expanding the custom resources into Kubernetes objects like pods, services, and statefulsets.
6. How? ➔ Operators and CRDs !!
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
kind: CustomResourceDefinition
…
spec:
kind: MyKind
What is CRD !?
6
Operator
What is Operator !?
7. What is CRD !?
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
kind: MyKind
metadata:
name: my-name
kind: CustomResourceDefinition
…
spec:
kind: MyKind
Custom Resource
Definition
Custom Resource
7
8. What is Operator !?
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
kind: MyKind
metadata:
name: my-name
Custom Resource
& Cluster State
Cluster State
Operator
8
9. Kubeflow’s multi ML framework support
apiVersion: kubeflow.org/v1alpha*
kind: **Job
...
Operator
CRDs
TFJob
PyTorchJob
MPIJob
MXJob
Caffe2Job
ChainerJob
Operators
tf-opeartor
pytorch-operator
mpi-operator
mxnet-operator
caffe2-operator
chainer-operator
kssonnet packages
examples
pytorch-job
mpi-job
mxnet-job
_no pkg for caffe2_
chainer-job
* mpi-operator supports horovod jobs * examples package contains TFJob
9Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
10. Kubeflow’s multi ML framework support
apiVersion: kubeflow.org/v1alpha*
kind: **Job
...
Operator
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
CRDs
TFJob
PyTorchJob
MPIJob
MXJob
Caffe2Job
ChainerJob
Operators
tf-opeartor
pytorch-operator
mpi-operator
mxnet-operator
caffe2-operator
chainer-operator
kssonnet packages
examples
pytorch-job
mpi-job
mxnet-job
_no pkg for caffe2_
chainer-job
* mpi-operator supports horovod jobs * examples package contains TFJob
10
All the CRDs support
single-node and multi-nodes
machine learning jobs
15. Two Different Distributed Training Job Styles
Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Parameter Servers Style All-Reduce Style
Parameter servers
● calc gradient avgs
● send them back to Workers
Workers
● train (calc gradients) in parallel
● send them to parameter servers
Workers
● train (calc gradients) in parallel
● exchange them each other
15
16. Two Different Distributed Training Job Styles
Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Parameter Servers Style All-Reduce Style
Parameter servers
● calc gradient avgs
● send them back to Workers
Workers
● train (calc gradients) in parallel
● send them to parameter servers
Workers
● train (calc gradients) in parallel
● exchange them each other
HORO
VOD
16
17. TFJob structure (Parameter Server style)
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
spec:
tfReplicaSpecs:
cleanPodPolicy: ... # controls deletion of pods when a job terminates (Running, All, None)
Chief: … # orchestrating training and performing tasks like checkpointing the model
Evaluator: … # compute evaluation metrics as the model is trained
Ps: … # parameter servers
Worker: # the actual work of training the model. worker 0 might also act as the chief
replicas: ... # number of replicas
restartPolicy: # behaviour when they exit. (Always, OnFailure, ExitCode, Never)
template: … # PodTemplate
c.f. https://www.kubeflow.org/docs/guides/components/tftraining/
17
18. Pod
Pod
Pod
Pod
Anatomy of TFJobs
tf-operator k8s
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
TFJob
Pod
Pod
Pod
Pod
● expand TFJob to bear Pods and Service
● retry when pods exits by restartPolicy
● clean up pods when job finished by
cleanPodPolicy
Service
18
19. ChainerJob structure (All-Reduce style)
apiVersion: kubeflow.org/v1alpha2
kind: ChainerJob
spec:
backend: mpi # defines the protocol to initiate process groups (only ‘mpi’ is supported now)
master: # initiate and orchestrate distributed job
activeDeadlineSeconds: # the same with Jobspec
backoffLimit: # the same with Jobspec
...
workerSets: # a set of workerSet (for defining heterogeneous workers)
workerSetName: # your own workerSet name
replicas: # number of replicas of workerSet
mpiConfig: # you can define number of slot for each worker
template: # PodTemplate
c.f. https://www.kubeflow.org/docs/guides/components/chainer/
19
20. Anatomy of ChainerJob
● expand ChainerJob to ConfigMap, Job,
Service and StatefulSets
● fault tolerancy borrow from Job and StatefulSets
● scale down when job finished for cleanup
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
chainer-operator
ChainerJob
Pod
Job
PodPodPodPod
k8s
Service
StatefulSets
ConfiMap
20
21. Icons made by Eucalyp, rom www.flaticon.com is licensed by CC 3.0 BY 21
Demo Time!!demo script