mirror of
https://github.com/clearlinux/clear-linux-documentation.git
synced 2026-06-30 01:35:59 +00:00
2a95b25526
* adding pytorch section * updated doc Minor nits and corrections * scripts: Code Block rst parser Signed-off-by: John Andersen <john.s.andersen@intel.com> * limited to bash code blocks and added description to the header * added ability to create a list of code blocks to parse * adding pytorch section * updating PyTorch image link & name * updating docker image link and adding PyTorch benchmarking link * Sphinx syntax corrections; reorganizes contents/toctree; line-edits. Signed-off-by: Michael Vincerra <michael.vincerra@intel.com> * updating link to PyTorch benchmarks
324 lines
10 KiB
ReStructuredText
324 lines
10 KiB
ReStructuredText
.. _dlrs:
|
||
|
||
Deep Learning Reference Stack
|
||
#############################
|
||
|
||
This tutorial shows you how to run benchmarking workloads in |CL-ATTR| using
|
||
TensorFlow\* or PyTorch\* with the Deep Learning Reference Stack. We also
|
||
cover using Kubeflow for multi-node benchmarking.
|
||
|
||
.. contents::
|
||
:local:
|
||
:depth: 1
|
||
|
||
The Deep Learning Reference Stack is available in four versions:
|
||
|
||
* `Eigen`_, which includes `TensorFlow`_ optimized for Intel® architecture.
|
||
* `Intel MKL-DNN`_, which includes the TensorFlow framework optimized using
|
||
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives.
|
||
* `PyTorch with OpenBLAS`_, which includes PyTorch with OpenBlas.
|
||
* `PyTorch with Intel MKL-DNN`_, which includes PyTorch optimized using
|
||
Intel® Math Kernel Library (Intel® MKL)and Intel MKL-DNN.
|
||
|
||
Release notes
|
||
*************
|
||
|
||
* View current `release notes`_ for the Deep Learning Reference Stack.
|
||
* View current `TensorFlow benchmark results`_ for the Deep Learning
|
||
Reference Stack with TensorFlow.
|
||
* View current `PyTorch benchmark results`_ for the Deep Learning Reference
|
||
Stack with PyTorch.
|
||
|
||
.. note::
|
||
|
||
Performance test numbers in the Deep Learning Reference Stack were obtained using `runc` as the runtime.
|
||
|
||
Prerequisites
|
||
*************
|
||
|
||
* |CL| installed on host system. :ref:`Install <bare-metal-install>`
|
||
* `containers-basic` bundle
|
||
* `cloud-native-basic` bundle
|
||
|
||
In |CL|, `containers-basic` provides Docker\*, which is required for
|
||
TensorFlow and PyTorch benchmarking. Use the :command:`swupd` utility to
|
||
check if `containers-basic` and `cloud-native-basic` are present:
|
||
|
||
.. code-block:: bash
|
||
|
||
sudo swupd bundle-list
|
||
|
||
If you need to install the `containers-basic` or `cloud-native-basic`, enter:
|
||
|
||
.. code-block:: bash
|
||
|
||
sudo swupd bundle-add containers-basic cloud-native-basic
|
||
|
||
To ensure that Kubernetes is correctly installed and configured, follow
|
||
:ref:`kubernetes`.
|
||
|
||
We have validated these steps against the following software package
|
||
versions:
|
||
|
||
* |CL| 26240--lowest version permissible.
|
||
* Docker 18.06.1
|
||
* Kubernetes 1.11.3
|
||
* Go 1.11.12
|
||
|
||
TensorFlow single and multi-node benchmarks
|
||
*******************************************
|
||
|
||
This section describes running the `TensorFlow benchmarks`_ in single node.
|
||
For multi-node testing, replicate these steps for each node. These steps
|
||
provide a template to run other benchmarks, provided that they can invoke
|
||
TensorFlow.
|
||
|
||
#. Download either the `Eigen`_ or the `Intel MKL-DNN`_ docker image
|
||
from `Docker Hub`_.
|
||
|
||
#. Run the image with Docker:
|
||
|
||
.. code-block:: bash
|
||
|
||
docker run --name <image name> --rm -i -t <clearlinux/
|
||
stacks-dlrs-TYPE> bash
|
||
|
||
.. note::
|
||
|
||
Launching the docker image with the :command:`-i` argument will put
|
||
you into interactive mode within the container. You will enter the
|
||
following commands in the running container.
|
||
|
||
#. Clone the benchmark repository:
|
||
|
||
.. code-block:: bash
|
||
|
||
docker exec -t <docker_name> bash -c 'git clone http://github.com/tensorflow/benchmarks -b cnn_tf_v1.12_compatible'
|
||
|
||
#. Next, execute the benchmark script to run the benchmark.
|
||
|
||
.. code-block:: bash
|
||
|
||
docker exec -i <docker_name> bash -c 'python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --device=cpu --model=resnet50 --data_format=NHWC '.
|
||
|
||
.. note::
|
||
|
||
You can replace the model with one of your choice supported by the
|
||
TensorFlow benchmarks.
|
||
|
||
PyTorch single and multi-node benchmarks
|
||
****************************************
|
||
|
||
This section describes running the `PyTorch benchmarks`_ for Caffe2 in
|
||
single node. We will be looking at validating the Caffe2 APIs with the
|
||
official benchmarks, but the same process applies for other cases.
|
||
|
||
#. Download either the `PyTorch with OpenBLAS`_ or the `PyTorch with Intel
|
||
MKL-DNN`_ docker image
|
||
from `Docker Hub`_.
|
||
|
||
#. Run the image with Docker:
|
||
|
||
.. code-block:: bash
|
||
|
||
docker run --name <image name> --rm -i -t <clearlinux/stacks-dlrs-TYPE> bash
|
||
|
||
.. note::
|
||
|
||
Launching the docker image with the :command:`-i` argument will put
|
||
you into interactive mode within the container. You will enter the
|
||
following commands in the running container.
|
||
|
||
#. Clone the benchmark repository:
|
||
|
||
.. code-block:: bash
|
||
|
||
git clone https://github.com/pytorch/pytorch.git
|
||
|
||
#. Next, execute the benchmark script to run the benchmark.
|
||
|
||
.. code-block:: bash
|
||
|
||
cd pytorch/caffe2/python
|
||
python convnet_benchmarks.py --batch_size 32 \
|
||
--cpu \
|
||
--model AlexNet
|
||
|
||
Kubeflow multi-node benchmarks
|
||
******************************
|
||
|
||
The benchmark workload will run in a Kubernetes cluster. We will use
|
||
`Kubeflow`_ for the Machine Learning workload deployment on three nodes.
|
||
|
||
Kubernetes setup
|
||
================
|
||
|
||
Follow the instructions in the :ref:`kubernetes` tutorial to get set up on
|
||
|CL|. The kubernetes community also has
|
||
`instructions for creating a cluster`_.
|
||
|
||
Kubernetes networking
|
||
=====================
|
||
|
||
We used `flannel`_ as the network provider for these tests. If you are
|
||
comfortable with another network layer, refer to the Kubernetes
|
||
`networking documentation`_ for setup.
|
||
|
||
Images
|
||
======
|
||
|
||
We need to add `launcher.py` to our docker image to include the Deep
|
||
Learning Reference Stack and put the benchmarks repo in the correct
|
||
location. From the docker image, run the following:
|
||
|
||
.. code-block:: bash
|
||
|
||
mkdir -p /opt
|
||
git clone https://github.com/tensorflow/benchmarks.git /opt/tf-benchmarks
|
||
cp launcher.py /opt
|
||
chmod u+x /opt/*
|
||
|
||
Your entry point now becomes "/opt/launcher.py".
|
||
|
||
This will build an image which can be consumed directly by TFJob from
|
||
kubeflow. We are working to create these images as part of our release
|
||
cycle.
|
||
|
||
ksonnet\*
|
||
=========
|
||
|
||
Kubeflow uses ksonnet\* to manage deployments, so we need to install that
|
||
before setting up Kubeflow.
|
||
|
||
Since Clear Linux version 27550, the ksonnet was added to the bundle
|
||
cloud-native-basic. But if using old versions (not recommended), please
|
||
manually install the ksonnet as below.
|
||
|
||
On |CL|, follow these steps:
|
||
|
||
.. code-block:: bash
|
||
|
||
swupd bundle-add go-basic-dev
|
||
export GOPATH=$HOME/go
|
||
export PATH=$PATH:$GOPATH/bin
|
||
go get github.com/ksonnet/ksonnet
|
||
cd $GOPATH/src/github.com/ksonnet/ksonnet
|
||
make install
|
||
|
||
After the ksonnet installation is complete, ensure that binary `ks` is
|
||
accessible across the environment.
|
||
|
||
Kubeflow
|
||
========
|
||
|
||
Once you have Kubernetes running on your nodes, you can setup `Kubeflow`_ by
|
||
following these instructions from their `quick start guide`_.
|
||
|
||
.. code-block:: bash
|
||
|
||
export KUBEFLOW_SRC=$HOME/kflow
|
||
export KUBEFLOW_TAG="v0.4.1"
|
||
export KFAPP="kflow_app"
|
||
export K8S_NAMESPACE="kubeflow"
|
||
|
||
mkdir ${KUBEFLOW_SRC}
|
||
cd ${KUBEFLOW_SRC}
|
||
ks init ${KFAPP}
|
||
cd ${KFAPP}
|
||
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${KUBEFLOW_TAG}/kubeflow
|
||
ks pkg install kubeflow/common
|
||
ks pkg install kubeflow/tf-training
|
||
|
||
Now you have all the required kubeflow packages, and you can deploy the primary one for our purposes: tf-job-operator.
|
||
|
||
.. code-block:: bash
|
||
|
||
ks env rm default
|
||
kubectl create namespace ${K8S_NAMESPACE}
|
||
ks env add default --namespace "${K8S_NAMESPACE}"
|
||
ks generate tf-job-operator tf-job-operator
|
||
ks apply default -c tf-job-operator
|
||
|
||
This creates the CustomResourceDefinition(CRD) endpoint to launch a TFJob.
|
||
|
||
Run a TFJob
|
||
***********
|
||
|
||
#. Select this link for the `ksonnet registries for deploying TFJobs`_.
|
||
|
||
#. Install the TFJob componets as follows:
|
||
|
||
.. code-block:: bash
|
||
|
||
ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
|
||
|
||
ks pkg install dlrs-tfjob/dlrs-bench
|
||
|
||
#. Export the image name you'd like to use for the deployment:
|
||
|
||
.. code-block:: bash
|
||
|
||
export DLRS_IMAGE=<docker_name>
|
||
|
||
.. note::
|
||
|
||
Replace <docker_name> with the image name you specified in previous steps.
|
||
|
||
#. Next, generate Kubernetes manifests for the workloads and apply them to
|
||
create and run them using these commands
|
||
|
||
.. code-block:: bash
|
||
|
||
ks generate dlrs-resnet50 dlrsresnet50 --name=dlrsresnet50 --image=${DLRS_IMAGE}
|
||
ks generate dlrs-alexnet dlrsalexnet --name=dlrsalexnet --image=${DLRS_IMAGE}
|
||
ks apply default -c dlrsresnet50
|
||
ks apply default -c dlrsalexnet
|
||
|
||
This will replicate and deploy three test setups in your Kubernetes cluster.
|
||
|
||
Results of Running this Tutorial
|
||
********************************
|
||
|
||
You need to parse the logs of the Kubernetes pod to get the performance
|
||
numbers. The pods will still be around post completion and will be in
|
||
‘Completed’ state. You can get the logs from any of the pods to inspect the
|
||
benchmark results. More information about `Kubernetes logging`_ is available
|
||
from the Kubernetes community.
|
||
|
||
.. _TensorFlow: https://www.tensorflow.org/
|
||
|
||
.. _Kubeflow: https://www.kubeflow.org/
|
||
|
||
.. _Docker Hub: https://hub.docker.com/
|
||
|
||
.. _TensorFlow benchmarks: https://www.tensorflow.org/guide/performance/benchmarks
|
||
|
||
.. _PyTorch benchmarks: https://github.com/pytorch/pytorch/blob/master/caffe2/python/convnet_benchmarks.py
|
||
|
||
.. _instructions for creating a cluster: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
|
||
|
||
.. _flannel: https://github.com/coreos/flannel
|
||
|
||
.. _networking documentation: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/#pod-network
|
||
|
||
.. _quick start guide: https://www.kubeflow.org/docs/started/getting-started/
|
||
|
||
.. _Eigen: https://hub.docker.com/r/clearlinux/stacks-dlrs-oss/
|
||
|
||
.. _Intel MKL-DNN: https://hub.docker.com/r/clearlinux/stacks-dlrs-mkl/
|
||
|
||
.. _PyTorch with OpenBLAS: https://hub.docker.com/r/clearlinux/stacks-pytorch-oss
|
||
|
||
.. _PyTorch with Intel MKL-DNN: https://hub.docker.com/r/clearlinux/stacks-pytorch-mkl
|
||
|
||
.. _release notes: https://github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs
|
||
|
||
.. _ksonnet registries for deploying TFJobs: https://github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
|
||
|
||
.. _Kubernetes logging: https://kubernetes.io/docs/concepts/cluster-administration/logging/
|
||
|
||
.. _TensorFlow benchmark results: https://clearlinux.org/stacks/deep-learning-reference-stack
|
||
|
||
.. _PyTorch benchmark results: https://clearlinux.org/stacks/deep-learning-reference-stack-pytorch
|
||
|