diff --git a/source/clear-linux/tutorials/dlrs/dlrs.rst b/source/clear-linux/tutorials/dlrs/dlrs.rst index 4fe7e62d..7af186c9 100644 --- a/source/clear-linux/tutorials/dlrs/dlrs.rst +++ b/source/clear-linux/tutorials/dlrs/dlrs.rst @@ -3,81 +3,95 @@ Deep Learning Reference Stack ############################# -This tutorial shows you how to run benchmarking workloads in |CL-ATTR| using -TensorFlow\* or PyTorch\* with the Deep Learning Reference Stack. We also -cover using Kubeflow for multi-node benchmarking. +This tutorial describes how to run benchmarking workloads for TensorFlow\*, +PyTorch\*, and Kubeflow in |CL-ATTR| using the Deep Learning Reference Stack. + .. contents:: :local: :depth: 1 -The Deep Learning Reference Stack is available in five versions: +Overview +******** -* `Intel MKL-DNN-VNNI`_, which is optimized using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives and introduces support for AVX-512 Vector Neural Network Instructions (VNNI). -* `Intel MKL-DNN`_, which includes the TensorFlow framework optimized using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives. +We created the Deep Learning Reference Stack to help AI developers deliver the +best experience on Intel® Architecture. This stack reduces complexity common +with deep learning software components, provides flexibility for customized +solutions, and enables you to quickly prototype and deploy Deep Learning +workloads. Use this tutorial to run benchmarking workloads on your solution. + +The Deep Learning Reference Stack is available in the following versions: + +* `Intel MKL-DNN-VNNI`_, which is optimized using Intel® Math Kernel Library + for Deep Neural Networks (Intel® MKL-DNN) primitives and introduces support + for Intel® AVX-512 Vector Neural Network Instructions (VNNI). +* `Intel MKL-DNN`_, which includes the TensorFlow framework optimized using + Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives. * `Eigen`_, which includes `TensorFlow`_ optimized for Intel® architecture. * `PyTorch with OpenBLAS`_, which includes PyTorch with OpenBlas. -* `PyTorch with Intel MKL-DNN`_, which includes PyTorch optimized using Intel® Math Kernel Library (Intel® MKL)and Intel MKL-DNN. +* `PyTorch with Intel MKL-DNN`_, which includes PyTorch optimized using Intel® + Math Kernel Library (Intel® MKL) and Intel MKL-DNN. .. note:: - To take advantage of the AVX-512 and VNNI functionality with the Deep Learning Reference Stack, please use the following hardware: - * AVX 512 images requires an Intel® Xeon® Scalable Platform - * VNNI requires a Second-Generation Intel® Xeon® Scalable Platform + To take advantage of the Intel® AVX-512 and VNNI functionality with the Deep + Learning Reference Stack, you must use the following hardware: + + * Intel® AVX-512 images require an Intel® Xeon® Scalable Platform + * VNNI requires a 2nd generation Intel® Xeon® Scalable Platform -Release notes -************* +Stack features +============== -* View current `release notes`_ for the Deep Learning Reference Stack V3. -* View current `PyTorch benchmark results`_ for the Deep Learning Reference Stack with PyTorch, DLRS V2. -* View current `TensorFlow benchmark results`_ for the first release of the Deep Learning Reference Stack with TensorFlow. -* Go to the `github release notes`_ for the latest release. +* Deep Learning Reference Stack `V3.0 release announcement`_. +* Deep Learning Reference Stack v2.0 including current `PyTorch benchmark results`_. +* Deep Learning Reference Stack v1.0 including current `TensorFlow benchmark results`_. +* `Release notes on Github\*`_ for the latest release of Deep Learning Reference Stack. .. note:: - Performance test numbers in the Deep Learning Reference Stack were obtained using `runc` as the runtime. + Performance test results for the Deep Learning Reference Stack were + obtained using `runc` as the runtime. Prerequisites -************* +============= -* |CL| installed on host system. :ref:`Install ` -* `containers-basic` bundle -* `cloud-native-basic` bundle +* :ref:`Install ` |CL| on your host system. +* :command:`containers-basic` bundle +* :command:`cloud-native-basic` bundle -In |CL|, `containers-basic` provides Docker\*, which is required for +In |CL|, :command:`containers-basic` includes Docker\*, which is required for TensorFlow and PyTorch benchmarking. Use the :command:`swupd` utility to -check if `containers-basic` and `cloud-native-basic` are present: +check if :command:`containers-basic` and :command:`cloud-native-basic` are present: .. code-block:: bash sudo swupd bundle-list -If you need to install the `containers-basic` or `cloud-native-basic`, enter: +To install the :command:`containers-basic` or :command:`cloud-native-basic` bundles, enter: .. code-block:: bash sudo swupd bundle-add containers-basic cloud-native-basic -Note that docker is not started upon installation of the containers-basic bundle. To start docker, enter: - +Docker is not started upon installation of the :command:`containers-basic` +bundle. To start Docker, enter: .. code-block:: bash sudo systemctl start docker +To ensure that Kubernetes is correctly installed and configured, follow the +instructions in :ref:`kubernetes`. +Version compatibility +===================== -To ensure that Kubernetes is correctly installed and configured, follow -:ref:`kubernetes`. +We validated these steps against the following software package versions: - - -We have validated these steps against the following software package -versions: - -* |CL| 26240--lowest version permissible. +* |CL| 26240 (Lower version not supported.) * Docker 18.06.1 * Kubernetes 1.11.3 * Go 1.11.12 @@ -90,7 +104,7 @@ For multi-node testing, replicate these steps for each node. These steps provide a template to run other benchmarks, provided that they can invoke TensorFlow. -#. Download either the `Eigen`_ or the `Intel MKL-DNN`_ docker image +#. Download either the `Eigen`_ or the `Intel MKL-DNN`_ Docker image from `Docker Hub`_. #. Run the image with Docker: @@ -102,9 +116,9 @@ TensorFlow. .. note:: - Launching the docker image with the :command:`-i` argument will put - you into interactive mode within the container. You will enter the - following commands in the running container. The following commands are executed within the scope of the container. + Launching the Docker image with the :command:`-i` argument starts + interactive mode within the container. Enter the following commands in + the running container. #. Clone the benchmark repository in the container: @@ -112,7 +126,7 @@ TensorFlow. git clone http://github.com/tensorflow/benchmarks -b cnn_tf_v1.12_compatible -#. Next, execute the benchmark script to run the benchmark. +#. Execute the benchmark script: .. code-block:: bash @@ -127,12 +141,10 @@ PyTorch single and multi-node benchmarks **************************************** This section describes running the `PyTorch benchmarks`_ for Caffe2 in -single node. We will be looking at validating the Caffe2 APIs with the -official benchmarks, but the same process applies for other cases. +single node. #. Download either the `PyTorch with OpenBLAS`_ or the `PyTorch with Intel - MKL-DNN`_ docker image - from `Docker Hub`_. + MKL-DNN`_ Docker image from `Docker Hub`_. #. Run the image with Docker: @@ -142,17 +154,17 @@ official benchmarks, but the same process applies for other cases. .. note:: - Launching the docker image with the :command:`-i` argument will put - you into interactive mode within the container. You will enter the - following commands in the running container. + Launching the Docker image with the :command:`-i` argument starts + interactive mode within the container. Enter the following commands in + the running container. #. Clone the benchmark repository: .. code-block:: bash - git clone https://github.com/pytorch/pytorch.git + git clone https://github.com/pytorch/pytorch.git -#. Next, execute the benchmark script to run the benchmark. +#. Execute the benchmark script: .. code-block:: bash @@ -164,29 +176,29 @@ official benchmarks, but the same process applies for other cases. Kubeflow multi-node benchmarks ****************************** -The benchmark workload will run in a Kubernetes cluster. We will use +The benchmark workload runs in a Kubernetes cluster. The tutorial uses `Kubeflow`_ for the Machine Learning workload deployment on three nodes. Kubernetes setup ================ Follow the instructions in the :ref:`kubernetes` tutorial to get set up on -|CL|. The kubernetes community also has +|CL|. The Kubernetes community also has `instructions for creating a cluster`_. Kubernetes networking ===================== -We used `flannel`_ as the network provider for these tests. If you are -comfortable with another network layer, refer to the Kubernetes +We used `flannel`_ as the network provider for these tests. If you +prefer a different network layer, refer to the Kubernetes `networking documentation`_ for setup. Images ====== -We need to add `launcher.py` to our docker image to include the Deep +You must add `launcher.py` to the Docker image to include the Deep Learning Reference Stack and put the benchmarks repo in the correct -location. From the docker image, run the following: +location. From the Docker image, run the following: .. code-block:: bash @@ -195,21 +207,19 @@ location. From the docker image, run the following: cp launcher.py /opt chmod u+x /opt/* -Your entry point now becomes "/opt/launcher.py". +Your entry point becomes: :file:`/opt/launcher.py` -This will build an image which can be consumed directly by TFJob from -kubeflow. We are working to create these images as part of our release -cycle. +This builds an image that can be consumed directly by TFJob from Kubeflow. ksonnet\* ========= -Kubeflow uses ksonnet\* to manage deployments, so we need to install that +Kubeflow uses ksonnet\* to manage deployments, so you must install it before setting up Kubeflow. -Since Clear Linux version 27550, the ksonnet was added to the bundle -cloud-native-basic. But if using old versions (not recommended), please -manually install the ksonnet as below. +ksonnet was added to the :command:`cloud-native-basic` bundle in |CL| version 27550. If +you are using an older |CL| version (not recommended), you must manually +install ksonnet as described below. On |CL|, follow these steps: @@ -228,8 +238,8 @@ accessible across the environment. Kubeflow ======== -Once you have Kubernetes running on your nodes, you can setup `Kubeflow`_ by -following these instructions from their `quick start guide`_. +Once you have Kubernetes running on your nodes, set up `Kubeflow`_ by +following these instructions from the `quick start guide`_. .. code-block:: bash @@ -246,7 +256,7 @@ following these instructions from their `quick start guide`_. ks pkg install kubeflow/common ks pkg install kubeflow/tf-training -Now you have all the required kubeflow packages, and you can deploy the primary one for our purposes: tf-job-operator. +Next, deploy the primary package for our purposes: tf-job-operator. .. code-block:: bash @@ -256,22 +266,22 @@ Now you have all the required kubeflow packages, and you can deploy the primary ks generate tf-job-operator tf-job-operator ks apply default -c tf-job-operator -This creates the CustomResourceDefinition(CRD) endpoint to launch a TFJob. +This creates the CustomResourceDefinition (CRD) endpoint to launch a TFJob. Run a TFJob -*********** +=========== #. Select this link for the `ksonnet registries for deploying TFJobs`_. - #. Install the TFJob componets as follows: +#. Install the TFJob components as follows: - .. code-block:: bash + .. code-block:: bash - ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob + ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob - ks pkg install dlrs-tfjob/dlrs-bench + ks pkg install dlrs-tfjob/dlrs-bench -#. Export the image name you'd like to use for the deployment: +#. Export the image name to use for the deployment: .. code-block:: bash @@ -281,8 +291,7 @@ Run a TFJob Replace with the image name you specified in previous steps. -#. Next, generate Kubernetes manifests for the workloads and apply them to - create and run them using these commands +#. Generate Kubernetes manifests for the workloads and apply them using these commands: .. code-block:: bash @@ -291,13 +300,13 @@ Run a TFJob ks apply default -c dlrsresnet50 ks apply default -c dlrsalexnet -This will replicate and deploy three test setups in your Kubernetes cluster. +This replicates and deploys three test setups in your Kubernetes cluster. -Results of Running this Tutorial +Results of running this tutorial ================================ -You need to parse the logs of the Kubernetes pod to get the performance -numbers. The pods will still be around post completion and will be in +You must parse the logs of the Kubernetes pod to retrieve performance +data. The pods will still exist post-completion and will be in ‘Completed’ state. You can get the logs from any of the pods to inspect the benchmark results. More information about `Kubernetes logging`_ is available from the Kubernetes community. @@ -305,19 +314,22 @@ from the Kubernetes community. Use Jupyter Notebook ******************** -We will use the `PyTorch with OpenBLAS`_ container image for these steps. Once it is downloaded, run the docker image with :command:`-p` to specify the shared port between the container and the host. For this example we will use port 8888. +This example uses the `PyTorch with OpenBLAS`_ container image. After it is +downloaded, run the Docker image with :command:`-p` to specify the shared port +between the container and the host. This example uses port 8888. .. code-block:: bash - docker run --name pytorchtest --rm -i -t -p 8888:8888 clearlinux/stacks-pytorch-oss bash + docker run --name pytorchtest --rm -i -t -p 8888:8888 clearlinux/stacks-pytorch-oss bash -After you've started the container, you can launch the Jupyter Notebook. This command is executed inside the container image. +After you start the container, launch the Jupyter Notebook. This +command is executed inside the container image. .. code-block:: bash - jupyter notebook --ip 0.0.0.0 --no-browser --allow-root + jupyter notebook --ip 0.0.0.0 --no-browser --allow-root -Once the notebook has loaded, you will see output similar to the following: +After the notebook has loaded, you will see output similar to the following: .. code-block:: console @@ -325,13 +337,15 @@ Once the notebook has loaded, you will see output similar to the following: Or copy and paste one of these URLs: http://(846e526765e3 or 127.0.0.1):8888/?token=6357dbd072bea7287c5f0b85d31d70df344f5d8843fbfa09 -From your host system, or any system that can access the host's IP address, start a web browser with the following. If you are not running the browser on the host system, replace :command:`127.0.0.1` with the IP address of the host. +From your host system, or any system that can access the host's IP address, +start a web browser with the following. If you are not running the browser on +the host system, replace :command:`127.0.0.1` with the IP address of the host. .. code-block:: bash http://127.0.0.1:8888/?token=6357dbd072bea7287c5f0b85d31d70df344f5d8843fbfa09 -Your browser will display the following: +Your browser displays the following: .. figure:: figures/dlrs-fig-1.png :scale: 50 % @@ -340,7 +354,7 @@ Your browser will display the following: Figure 1: :guilabel:`Jupyter Notebook` -To create a new notebook, click on :guilabel:`New` and select :guilabel:`Python 3` +To create a new notebook, click :guilabel:`New` and select :guilabel:`Python 3`. .. figure:: figures/dlrs-fig-2.png :scale: 50% @@ -348,7 +362,7 @@ To create a new notebook, click on :guilabel:`New` and select :guilabel:`Python Figure 2: Create a new notebook -You will be presented with a new, blank notebook, with a cell ready for input. +A new, blank notebook is displayed, with a cell ready for input. .. figure:: figures/dlrs-fig-3.png :scale: 50% @@ -357,12 +371,12 @@ You will be presented with a new, blank notebook, with a cell ready for input. To verify that PyTorch is working, copy the following snippet into the blank cell, and run the cell. - .. code-block:: console +.. code-block:: console - from __future__ import print_function - import torch - x = torch.rand(5, 3) - print(x) + from __future__ import print_function + import torch + x = torch.rand(5, 3) + print(x) .. figure:: figures/dlrs-fig-4.png :scale: 50% @@ -374,10 +388,19 @@ When you run the cell, your output will look something like this: :scale: 50% :alt: code output -You can continue working in this notebook, or you can download existing notebooks to take advantage of the Deep Learning Reference Stack's optimized deep learning frameworks. More information on `Jupyter Notebook`_. - +You can continue working in this notebook, or you can download existing +notebooks to take advantage of the Deep Learning Reference Stack's optimized +deep learning frameworks. Refer to `Jupyter Notebook`_ for details. +Related topics +************** +* Deep Learning Reference Stack `V3.0 release announcement`_ +* `TensorFlow benchmarks`_ +* `PyTorch benchmarks`_ +* `Kubeflow`_ +* :ref:`kubernetes` tutorial +* `Jupyter Notebook`_ .. _TensorFlow: https://www.tensorflow.org/ @@ -408,7 +431,7 @@ You can continue working in this notebook, or you can download existing notebook .. _Intel MKL-DNN-VNNI: https://hub.docker.com/r/clearlinux/stacks-dlrs-mkl-vnni -.. _release notes: https://clearlinux.org/stacks/deep-learning-reference-stack-v3 +.. _V3.0 release announcement: https://clearlinux.org/stacks/deep-learning-reference-stack-v3 .. _ksonnet registries for deploying TFJobs: https://github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob @@ -420,4 +443,4 @@ You can continue working in this notebook, or you can download existing notebook .. _Jupyter Notebook: https://jupyter.org/ -.. _github release notes: https://github.com/clearlinux/dockerfiles/blob/master/stacks/dlrs/releasenote.md +.. _Release notes on Github\*: https://github.com/clearlinux/dockerfiles/blob/master/stacks/dlrs/releasenote.md