- 测试机型: V100 x8
- nfs地址:38037492dc-pol25.cn-shanghai.nas.aliyuncs.com
Cluster | Alibaba Cloud Kubernetes. v1.16.9-aliyun.1 |
---|---|
ECS实例 | ECS 规格:ecs.gn6v-c10g1.20xlarge CPU:82核 |
分布式存储 | 容量型NAS |
软件版本: 0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6
- 通过主机目录实现挂载并不是推荐的使用方式,因为该方式依赖于Kubernetes意外的挂载点维护方式,实际上并不可靠,可能引发数据不一致的问题。
- 下载数据集
$ wget http://imagenet-tar.oss-cn-shanghai.aliyuncs.com/imagenet.tar.gz
- 解压数据集
$ tar -I pigz -xvf imagenet.tar.gz
-
在NFS Server中挂载数据集
-
将NFS挂载到主机目录上
$ sudo mount -t nfs -o vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport <YOUR_NFS_SERVER>:<YOUR_PATH_TO_DATASET> /mnt/nfs-imagenet
- 查看NFS是否已成功挂载
$ mount | grep nfs
<YOUR_NFS_SERVER>:<YOUR_PATH_TO_DATASET> on /mnt/nfs-imagenet type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.28,mountvers=3,mountport=2049,mountproto=tcp,local_lock=all,addr=192.168.1.28)
NOTE:
修改上述命令中的
<YOUR_NFS_SERVER>
和<YOUR_PATH_TO_DATASET>
为您的nfs server地址和挂载路径。
arena submit mpijob \
--name horovod-v2-nfs-hostpath-1x8-093000 \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data-dir /mnt/nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8
arena submit mpi \
--name horovod-v2-nfs-hostpath-4x8-092921 \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data-dir /mnt/nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
- 按照前述步骤完成NFS的挂载
- 部署Fluid加速NFS挂载的主机目录
$ cat <<EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
spec:
mounts:
- mountPoint: local:///mnt/nfs-imagenet
name: imagenet
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: In
values:
- Tesla-V100-SXM2-16GB
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: imagenet
spec:
replicas: 4
data:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /alluxio/ram
quota: 50Gi
high: "0.99"
low: "0.8"
EOF
NOTE:
mounts.mountPoint
通过local://
的前缀来指明要挂载的主机目录(e.g./mnt/nfs-imagenet
)spec.replicas
和dawnbench测试的worker数量保持一致。比如:单机八卡为1,四机八卡为4nodeSelectorTerms
作用是限制在有V100显卡的机器上部署数据集,此处应根据实验环境具体调节
$ kubectl create -f dataset.yaml
- 检查部署情况
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/imagenet 100Gi RWX Retain Bound default/imagenet 3h28m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/imagenet Bound imagenet 100Gi RWX 3h28m
arena submit mpi \
--name horovod-v2-nfs-fluid-1x8-093009 \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8
arena submit mpi \
--name horovod-v2-nfs-fluid-4x8-092910 \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
nfs-hostpath | fluid (cold) | fluid (warm) | |
---|---|---|---|
训练时间 | 4h20m36s | 4h21m56s | 4h2m16s |
1000步速度(images/second) | 2426.4 | 2467.2 | 8959.7 |
最终速度(images/second) | 8218.1 | 8219.8 | 8275.8 |
steps | 56300 | 56300 | 56300 |
Accuracy @ 5 | 0.9280 | 0.9288 | 0.9291 |
nfs-hostpath | fluid (cold) | fluid (warm) | |
---|---|---|---|
训练时间 | 2h9m21s | 1h40m15s | 1h29m55s |
1000步速度(images/second) | 3219.2 | 11067.2 | 21951.3 |
最终速度(images/second) | 15855.7 | 20964.4 | 21869.8 |
steps | 14070 | 14070 | 14070 |
Accuracy @ 5 | 0.9227 | 0.9232 | 0.9228 |
从测试结果来看,单机八卡通过Fluid加速效果并没有明显的效果,但是在四机八卡的场景下Fluid加速效果非常明显。 在热数据的场景下,可以缩短训练时间 (129-89)/129 = 31 %; 在冷数据场景下可以缩短训练时间 (129-100)/129 = 22 % 。 这是由于四机八卡下,NFS的带宽成为了瓶颈;而Fluid基于Alluxio提供了分布式缓存的P2P数据读取能力。