mirror of
https://github.com/clearlinux/clear-linux-documentation.git
synced 2026-06-30 01:35:59 +00:00
Completed TCS editorial review (#490)
* Completed TCS editorial review * Resolved feedback. Signed-off-by: MCamp859 <maryx.camp@intel.com>
This commit is contained in:
committed by
michael vincerra
parent
de8f97102c
commit
046cfd2bd2
@@ -3,81 +3,95 @@
|
||||
Deep Learning Reference Stack
|
||||
#############################
|
||||
|
||||
This tutorial shows you how to run benchmarking workloads in |CL-ATTR| using
|
||||
TensorFlow\* or PyTorch\* with the Deep Learning Reference Stack. We also
|
||||
cover using Kubeflow for multi-node benchmarking.
|
||||
This tutorial describes how to run benchmarking workloads for TensorFlow\*,
|
||||
PyTorch\*, and Kubeflow in |CL-ATTR| using the Deep Learning Reference Stack.
|
||||
|
||||
|
||||
.. contents::
|
||||
:local:
|
||||
:depth: 1
|
||||
|
||||
The Deep Learning Reference Stack is available in five versions:
|
||||
Overview
|
||||
********
|
||||
|
||||
* `Intel MKL-DNN-VNNI`_, which is optimized using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives and introduces support for AVX-512 Vector Neural Network Instructions (VNNI).
|
||||
* `Intel MKL-DNN`_, which includes the TensorFlow framework optimized using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives.
|
||||
We created the Deep Learning Reference Stack to help AI developers deliver the
|
||||
best experience on Intel® Architecture. This stack reduces complexity common
|
||||
with deep learning software components, provides flexibility for customized
|
||||
solutions, and enables you to quickly prototype and deploy Deep Learning
|
||||
workloads. Use this tutorial to run benchmarking workloads on your solution.
|
||||
|
||||
The Deep Learning Reference Stack is available in the following versions:
|
||||
|
||||
* `Intel MKL-DNN-VNNI`_, which is optimized using Intel® Math Kernel Library
|
||||
for Deep Neural Networks (Intel® MKL-DNN) primitives and introduces support
|
||||
for Intel® AVX-512 Vector Neural Network Instructions (VNNI).
|
||||
* `Intel MKL-DNN`_, which includes the TensorFlow framework optimized using
|
||||
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives.
|
||||
* `Eigen`_, which includes `TensorFlow`_ optimized for Intel® architecture.
|
||||
* `PyTorch with OpenBLAS`_, which includes PyTorch with OpenBlas.
|
||||
* `PyTorch with Intel MKL-DNN`_, which includes PyTorch optimized using Intel® Math Kernel Library (Intel® MKL)and Intel MKL-DNN.
|
||||
* `PyTorch with Intel MKL-DNN`_, which includes PyTorch optimized using Intel®
|
||||
Math Kernel Library (Intel® MKL) and Intel MKL-DNN.
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
To take advantage of the AVX-512 and VNNI functionality with the Deep Learning Reference Stack, please use the following hardware:
|
||||
* AVX 512 images requires an Intel® Xeon® Scalable Platform
|
||||
* VNNI requires a Second-Generation Intel® Xeon® Scalable Platform
|
||||
To take advantage of the Intel® AVX-512 and VNNI functionality with the Deep
|
||||
Learning Reference Stack, you must use the following hardware:
|
||||
|
||||
* Intel® AVX-512 images require an Intel® Xeon® Scalable Platform
|
||||
* VNNI requires a 2nd generation Intel® Xeon® Scalable Platform
|
||||
|
||||
|
||||
Release notes
|
||||
*************
|
||||
Stack features
|
||||
==============
|
||||
|
||||
* View current `release notes`_ for the Deep Learning Reference Stack V3.
|
||||
* View current `PyTorch benchmark results`_ for the Deep Learning Reference Stack with PyTorch, DLRS V2.
|
||||
* View current `TensorFlow benchmark results`_ for the first release of the Deep Learning Reference Stack with TensorFlow.
|
||||
* Go to the `github release notes`_ for the latest release.
|
||||
* Deep Learning Reference Stack `V3.0 release announcement`_.
|
||||
* Deep Learning Reference Stack v2.0 including current `PyTorch benchmark results`_.
|
||||
* Deep Learning Reference Stack v1.0 including current `TensorFlow benchmark results`_.
|
||||
* `Release notes on Github\*`_ for the latest release of Deep Learning Reference Stack.
|
||||
|
||||
.. note::
|
||||
|
||||
Performance test numbers in the Deep Learning Reference Stack were obtained using `runc` as the runtime.
|
||||
Performance test results for the Deep Learning Reference Stack were
|
||||
obtained using `runc` as the runtime.
|
||||
|
||||
Prerequisites
|
||||
*************
|
||||
=============
|
||||
|
||||
* |CL| installed on host system. :ref:`Install <bare-metal-install-desktop>`
|
||||
* `containers-basic` bundle
|
||||
* `cloud-native-basic` bundle
|
||||
* :ref:`Install <bare-metal-install-desktop>` |CL| on your host system.
|
||||
* :command:`containers-basic` bundle
|
||||
* :command:`cloud-native-basic` bundle
|
||||
|
||||
In |CL|, `containers-basic` provides Docker\*, which is required for
|
||||
In |CL|, :command:`containers-basic` includes Docker\*, which is required for
|
||||
TensorFlow and PyTorch benchmarking. Use the :command:`swupd` utility to
|
||||
check if `containers-basic` and `cloud-native-basic` are present:
|
||||
check if :command:`containers-basic` and :command:`cloud-native-basic` are present:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo swupd bundle-list
|
||||
|
||||
If you need to install the `containers-basic` or `cloud-native-basic`, enter:
|
||||
To install the :command:`containers-basic` or :command:`cloud-native-basic` bundles, enter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo swupd bundle-add containers-basic cloud-native-basic
|
||||
|
||||
Note that docker is not started upon installation of the containers-basic bundle. To start docker, enter:
|
||||
|
||||
Docker is not started upon installation of the :command:`containers-basic`
|
||||
bundle. To start Docker, enter:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo systemctl start docker
|
||||
|
||||
To ensure that Kubernetes is correctly installed and configured, follow the
|
||||
instructions in :ref:`kubernetes`.
|
||||
|
||||
Version compatibility
|
||||
=====================
|
||||
|
||||
To ensure that Kubernetes is correctly installed and configured, follow
|
||||
:ref:`kubernetes`.
|
||||
We validated these steps against the following software package versions:
|
||||
|
||||
|
||||
|
||||
We have validated these steps against the following software package
|
||||
versions:
|
||||
|
||||
* |CL| 26240--lowest version permissible.
|
||||
* |CL| 26240 (Lower version not supported.)
|
||||
* Docker 18.06.1
|
||||
* Kubernetes 1.11.3
|
||||
* Go 1.11.12
|
||||
@@ -90,7 +104,7 @@ For multi-node testing, replicate these steps for each node. These steps
|
||||
provide a template to run other benchmarks, provided that they can invoke
|
||||
TensorFlow.
|
||||
|
||||
#. Download either the `Eigen`_ or the `Intel MKL-DNN`_ docker image
|
||||
#. Download either the `Eigen`_ or the `Intel MKL-DNN`_ Docker image
|
||||
from `Docker Hub`_.
|
||||
|
||||
#. Run the image with Docker:
|
||||
@@ -102,9 +116,9 @@ TensorFlow.
|
||||
|
||||
.. note::
|
||||
|
||||
Launching the docker image with the :command:`-i` argument will put
|
||||
you into interactive mode within the container. You will enter the
|
||||
following commands in the running container. The following commands are executed within the scope of the container.
|
||||
Launching the Docker image with the :command:`-i` argument starts
|
||||
interactive mode within the container. Enter the following commands in
|
||||
the running container.
|
||||
|
||||
#. Clone the benchmark repository in the container:
|
||||
|
||||
@@ -112,7 +126,7 @@ TensorFlow.
|
||||
|
||||
git clone http://github.com/tensorflow/benchmarks -b cnn_tf_v1.12_compatible
|
||||
|
||||
#. Next, execute the benchmark script to run the benchmark.
|
||||
#. Execute the benchmark script:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -127,12 +141,10 @@ PyTorch single and multi-node benchmarks
|
||||
****************************************
|
||||
|
||||
This section describes running the `PyTorch benchmarks`_ for Caffe2 in
|
||||
single node. We will be looking at validating the Caffe2 APIs with the
|
||||
official benchmarks, but the same process applies for other cases.
|
||||
single node.
|
||||
|
||||
#. Download either the `PyTorch with OpenBLAS`_ or the `PyTorch with Intel
|
||||
MKL-DNN`_ docker image
|
||||
from `Docker Hub`_.
|
||||
MKL-DNN`_ Docker image from `Docker Hub`_.
|
||||
|
||||
#. Run the image with Docker:
|
||||
|
||||
@@ -142,17 +154,17 @@ official benchmarks, but the same process applies for other cases.
|
||||
|
||||
.. note::
|
||||
|
||||
Launching the docker image with the :command:`-i` argument will put
|
||||
you into interactive mode within the container. You will enter the
|
||||
following commands in the running container.
|
||||
Launching the Docker image with the :command:`-i` argument starts
|
||||
interactive mode within the container. Enter the following commands in
|
||||
the running container.
|
||||
|
||||
#. Clone the benchmark repository:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
|
||||
#. Next, execute the benchmark script to run the benchmark.
|
||||
#. Execute the benchmark script:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -164,29 +176,29 @@ official benchmarks, but the same process applies for other cases.
|
||||
Kubeflow multi-node benchmarks
|
||||
******************************
|
||||
|
||||
The benchmark workload will run in a Kubernetes cluster. We will use
|
||||
The benchmark workload runs in a Kubernetes cluster. The tutorial uses
|
||||
`Kubeflow`_ for the Machine Learning workload deployment on three nodes.
|
||||
|
||||
Kubernetes setup
|
||||
================
|
||||
|
||||
Follow the instructions in the :ref:`kubernetes` tutorial to get set up on
|
||||
|CL|. The kubernetes community also has
|
||||
|CL|. The Kubernetes community also has
|
||||
`instructions for creating a cluster`_.
|
||||
|
||||
Kubernetes networking
|
||||
=====================
|
||||
|
||||
We used `flannel`_ as the network provider for these tests. If you are
|
||||
comfortable with another network layer, refer to the Kubernetes
|
||||
We used `flannel`_ as the network provider for these tests. If you
|
||||
prefer a different network layer, refer to the Kubernetes
|
||||
`networking documentation`_ for setup.
|
||||
|
||||
Images
|
||||
======
|
||||
|
||||
We need to add `launcher.py` to our docker image to include the Deep
|
||||
You must add `launcher.py` to the Docker image to include the Deep
|
||||
Learning Reference Stack and put the benchmarks repo in the correct
|
||||
location. From the docker image, run the following:
|
||||
location. From the Docker image, run the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -195,21 +207,19 @@ location. From the docker image, run the following:
|
||||
cp launcher.py /opt
|
||||
chmod u+x /opt/*
|
||||
|
||||
Your entry point now becomes "/opt/launcher.py".
|
||||
Your entry point becomes: :file:`/opt/launcher.py`
|
||||
|
||||
This will build an image which can be consumed directly by TFJob from
|
||||
kubeflow. We are working to create these images as part of our release
|
||||
cycle.
|
||||
This builds an image that can be consumed directly by TFJob from Kubeflow.
|
||||
|
||||
ksonnet\*
|
||||
=========
|
||||
|
||||
Kubeflow uses ksonnet\* to manage deployments, so we need to install that
|
||||
Kubeflow uses ksonnet\* to manage deployments, so you must install it
|
||||
before setting up Kubeflow.
|
||||
|
||||
Since Clear Linux version 27550, the ksonnet was added to the bundle
|
||||
cloud-native-basic. But if using old versions (not recommended), please
|
||||
manually install the ksonnet as below.
|
||||
ksonnet was added to the :command:`cloud-native-basic` bundle in |CL| version 27550. If
|
||||
you are using an older |CL| version (not recommended), you must manually
|
||||
install ksonnet as described below.
|
||||
|
||||
On |CL|, follow these steps:
|
||||
|
||||
@@ -228,8 +238,8 @@ accessible across the environment.
|
||||
Kubeflow
|
||||
========
|
||||
|
||||
Once you have Kubernetes running on your nodes, you can setup `Kubeflow`_ by
|
||||
following these instructions from their `quick start guide`_.
|
||||
Once you have Kubernetes running on your nodes, set up `Kubeflow`_ by
|
||||
following these instructions from the `quick start guide`_.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -246,7 +256,7 @@ following these instructions from their `quick start guide`_.
|
||||
ks pkg install kubeflow/common
|
||||
ks pkg install kubeflow/tf-training
|
||||
|
||||
Now you have all the required kubeflow packages, and you can deploy the primary one for our purposes: tf-job-operator.
|
||||
Next, deploy the primary package for our purposes: tf-job-operator.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -256,22 +266,22 @@ Now you have all the required kubeflow packages, and you can deploy the primary
|
||||
ks generate tf-job-operator tf-job-operator
|
||||
ks apply default -c tf-job-operator
|
||||
|
||||
This creates the CustomResourceDefinition(CRD) endpoint to launch a TFJob.
|
||||
This creates the CustomResourceDefinition (CRD) endpoint to launch a TFJob.
|
||||
|
||||
Run a TFJob
|
||||
***********
|
||||
===========
|
||||
|
||||
#. Select this link for the `ksonnet registries for deploying TFJobs`_.
|
||||
|
||||
#. Install the TFJob componets as follows:
|
||||
#. Install the TFJob components as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
.. code-block:: bash
|
||||
|
||||
ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
|
||||
ks registry add dlrs-tfjob github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
|
||||
|
||||
ks pkg install dlrs-tfjob/dlrs-bench
|
||||
ks pkg install dlrs-tfjob/dlrs-bench
|
||||
|
||||
#. Export the image name you'd like to use for the deployment:
|
||||
#. Export the image name to use for the deployment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -281,8 +291,7 @@ Run a TFJob
|
||||
|
||||
Replace <docker_name> with the image name you specified in previous steps.
|
||||
|
||||
#. Next, generate Kubernetes manifests for the workloads and apply them to
|
||||
create and run them using these commands
|
||||
#. Generate Kubernetes manifests for the workloads and apply them using these commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -291,13 +300,13 @@ Run a TFJob
|
||||
ks apply default -c dlrsresnet50
|
||||
ks apply default -c dlrsalexnet
|
||||
|
||||
This will replicate and deploy three test setups in your Kubernetes cluster.
|
||||
This replicates and deploys three test setups in your Kubernetes cluster.
|
||||
|
||||
Results of Running this Tutorial
|
||||
Results of running this tutorial
|
||||
================================
|
||||
|
||||
You need to parse the logs of the Kubernetes pod to get the performance
|
||||
numbers. The pods will still be around post completion and will be in
|
||||
You must parse the logs of the Kubernetes pod to retrieve performance
|
||||
data. The pods will still exist post-completion and will be in
|
||||
‘Completed’ state. You can get the logs from any of the pods to inspect the
|
||||
benchmark results. More information about `Kubernetes logging`_ is available
|
||||
from the Kubernetes community.
|
||||
@@ -305,19 +314,22 @@ from the Kubernetes community.
|
||||
Use Jupyter Notebook
|
||||
********************
|
||||
|
||||
We will use the `PyTorch with OpenBLAS`_ container image for these steps. Once it is downloaded, run the docker image with :command:`-p` to specify the shared port between the container and the host. For this example we will use port 8888.
|
||||
This example uses the `PyTorch with OpenBLAS`_ container image. After it is
|
||||
downloaded, run the Docker image with :command:`-p` to specify the shared port
|
||||
between the container and the host. This example uses port 8888.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
docker run --name pytorchtest --rm -i -t -p 8888:8888 clearlinux/stacks-pytorch-oss bash
|
||||
docker run --name pytorchtest --rm -i -t -p 8888:8888 clearlinux/stacks-pytorch-oss bash
|
||||
|
||||
After you've started the container, you can launch the Jupyter Notebook. This command is executed inside the container image.
|
||||
After you start the container, launch the Jupyter Notebook. This
|
||||
command is executed inside the container image.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
|
||||
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
|
||||
|
||||
Once the notebook has loaded, you will see output similar to the following:
|
||||
After the notebook has loaded, you will see output similar to the following:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@@ -325,13 +337,15 @@ Once the notebook has loaded, you will see output similar to the following:
|
||||
Or copy and paste one of these URLs:
|
||||
http://(846e526765e3 or 127.0.0.1):8888/?token=6357dbd072bea7287c5f0b85d31d70df344f5d8843fbfa09
|
||||
|
||||
From your host system, or any system that can access the host's IP address, start a web browser with the following. If you are not running the browser on the host system, replace :command:`127.0.0.1` with the IP address of the host.
|
||||
From your host system, or any system that can access the host's IP address,
|
||||
start a web browser with the following. If you are not running the browser on
|
||||
the host system, replace :command:`127.0.0.1` with the IP address of the host.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
http://127.0.0.1:8888/?token=6357dbd072bea7287c5f0b85d31d70df344f5d8843fbfa09
|
||||
|
||||
Your browser will display the following:
|
||||
Your browser displays the following:
|
||||
|
||||
.. figure:: figures/dlrs-fig-1.png
|
||||
:scale: 50 %
|
||||
@@ -340,7 +354,7 @@ Your browser will display the following:
|
||||
Figure 1: :guilabel:`Jupyter Notebook`
|
||||
|
||||
|
||||
To create a new notebook, click on :guilabel:`New` and select :guilabel:`Python 3`
|
||||
To create a new notebook, click :guilabel:`New` and select :guilabel:`Python 3`.
|
||||
|
||||
.. figure:: figures/dlrs-fig-2.png
|
||||
:scale: 50%
|
||||
@@ -348,7 +362,7 @@ To create a new notebook, click on :guilabel:`New` and select :guilabel:`Python
|
||||
|
||||
Figure 2: Create a new notebook
|
||||
|
||||
You will be presented with a new, blank notebook, with a cell ready for input.
|
||||
A new, blank notebook is displayed, with a cell ready for input.
|
||||
|
||||
.. figure:: figures/dlrs-fig-3.png
|
||||
:scale: 50%
|
||||
@@ -357,12 +371,12 @@ You will be presented with a new, blank notebook, with a cell ready for input.
|
||||
|
||||
To verify that PyTorch is working, copy the following snippet into the blank cell, and run the cell.
|
||||
|
||||
.. code-block:: console
|
||||
.. code-block:: console
|
||||
|
||||
from __future__ import print_function
|
||||
import torch
|
||||
x = torch.rand(5, 3)
|
||||
print(x)
|
||||
from __future__ import print_function
|
||||
import torch
|
||||
x = torch.rand(5, 3)
|
||||
print(x)
|
||||
|
||||
.. figure:: figures/dlrs-fig-4.png
|
||||
:scale: 50%
|
||||
@@ -374,10 +388,19 @@ When you run the cell, your output will look something like this:
|
||||
:scale: 50%
|
||||
:alt: code output
|
||||
|
||||
You can continue working in this notebook, or you can download existing notebooks to take advantage of the Deep Learning Reference Stack's optimized deep learning frameworks. More information on `Jupyter Notebook`_.
|
||||
|
||||
You can continue working in this notebook, or you can download existing
|
||||
notebooks to take advantage of the Deep Learning Reference Stack's optimized
|
||||
deep learning frameworks. Refer to `Jupyter Notebook`_ for details.
|
||||
|
||||
Related topics
|
||||
**************
|
||||
|
||||
* Deep Learning Reference Stack `V3.0 release announcement`_
|
||||
* `TensorFlow benchmarks`_
|
||||
* `PyTorch benchmarks`_
|
||||
* `Kubeflow`_
|
||||
* :ref:`kubernetes` tutorial
|
||||
* `Jupyter Notebook`_
|
||||
|
||||
|
||||
.. _TensorFlow: https://www.tensorflow.org/
|
||||
@@ -408,7 +431,7 @@ You can continue working in this notebook, or you can download existing notebook
|
||||
|
||||
.. _Intel MKL-DNN-VNNI: https://hub.docker.com/r/clearlinux/stacks-dlrs-mkl-vnni
|
||||
|
||||
.. _release notes: https://clearlinux.org/stacks/deep-learning-reference-stack-v3
|
||||
.. _V3.0 release announcement: https://clearlinux.org/stacks/deep-learning-reference-stack-v3
|
||||
|
||||
.. _ksonnet registries for deploying TFJobs: https://github.com/clearlinux/dockerfiles/tree/master/stacks/dlrs/kubeflow/dlrs-tfjob
|
||||
|
||||
@@ -420,4 +443,4 @@ You can continue working in this notebook, or you can download existing notebook
|
||||
|
||||
.. _Jupyter Notebook: https://jupyter.org/
|
||||
|
||||
.. _github release notes: https://github.com/clearlinux/dockerfiles/blob/master/stacks/dlrs/releasenote.md
|
||||
.. _Release notes on Github\*: https://github.com/clearlinux/dockerfiles/blob/master/stacks/dlrs/releasenote.md
|
||||
|
||||
Reference in New Issue
Block a user