-
Notifications
You must be signed in to change notification settings - Fork 17
User Tutorial : Docker and Kubernetes
Docker provides a more flexible approach to cloud than block-booking virtual machines (VMs). When processing very large datasets there can be cost and efficiency reasons for first estimating your usage and then directly booking VMs which you will use to the full. For out of the box scalability, however, the approach outlined here is better.
What is Docker?
Docker encapsulates your code in a single thread. It is lightweight in that it does not contain the whole operating system in the way that a VM does. The default approach to parallel in distpy is to use Python's multiprocessing
, and the architecture level at which that happens is at the scope of a data-chunk. This means that we can simply encapsulate the workers
as separate Docker containers.
The files for this walkthrough can be found in the docker
directory.
- Copy the files
Dockerfile
anddistpy_docker.py
to a clean location. - Create a docker container using
docker build --no-cache -t distpy-python-app .
- Test your container using
docker run distpy-python-app
The output should look like (additional missing-file messages will appear on non-gpu systems):
Using TensorFlow backend.
usage: distpy-docker.py [-h] [-f FILE] [-c JSON] [-m MODULE]
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE write report to FILE
-c JSON, --config JSON read config from JSON
-m MODULE, --module MODULE select a processing module
This basic example supports 4 commands in the -m
option corresponding to the ingestion and processing high level workers used in the distpy system:
-m segy_ingest
-m ingest_h5
-m strainrate2summary
-m plotgenerator
This covers the main use cases. The -f
option requires the specific file, for example the sgy
file to be ingested.
The -c
option is where you supply the JSON configuration. This means that a single Docker image is all that is needed for all distpy workflows, and that as many containers as needed can be spawned from that image using a Kubernetes orchestration.
As a concrete example, you can test ingestion of a single SEGY file (see the Tutorials for details on projects and distpy JSON configuration)
docker run -v C:\NotBackedUp:/scratch distpy-python-app \
-f /scratch/myproject/sgy/test.sgy \
-c /scratch/myproject/config/docker_sgyConfig.json
-m segy_ingest
The -v
option that comes before the image name binds the local C:\NotBackedUp
folder on a Windows machine to the container's /scratch
, the mapping creates a shared space that all containers could use. The options after the image name are inputs to the container when it is generated and so constitute the instructions:
Configure distpy using docker_sgyConfig.json
Use this configuration to ingest the segy file test.sgy
This level of granularity means you would have one container for each of the SEGY files you want to ingest, followed by one container for each 1-second numpy file containing ingested strain-rate data. So the execution style is the same as achieved through the CASE00.py
example.
We replace the multiprocessing
by having a separate container where we previously had a separate thread.
The other half of the equation is replacing the controllers
that served up the separate threads and for this we use Kubernetes which orchestrates the lifecycles of the containers.
What is Kubernetes?
Kubernetes is an orchestrator for containers. Which means that it can take your containers and map them across your available hardware, tracking them and re-running any failed calculations, organizing everything. From the distpy perspective we need a flexible batch runner to fire off lots of instances of our containerized workers
. So we are interested in the Kubernetes concept of jobs run to completion.
Consider a basic pod that uses our Docker image
apiVersion : v1
kind: Pod
metadata:
name: distpy-job
spec:
containers:
- name: distpy-container
image: distpy-python-app
args: ["-f FILE","-c JSON","-m MODULE"]
volumeMounts:
- name: scratch
mountPath: "/scratch"
volumes:
- name: scratch
hostPath:
path: "C:\\NotBackedUp"
Referencing back to the command-line SEGY ingestion above, we can see the Window's drive mapped to /scratch
and the three arguments in their templated form.
data abs angle add analytic_signal argmax approx_vlf bounded_select broaden butter clip conj convolve copy correlate count_peaks data_load deconvolve destripe diff downsample dip_filter down_wave extract fft from_gpu gather gaussian geometric_mean gradient harmonic_mean hard_threshold ifft keras kmeans kurtosis lin_transform macro median_filter mean multiply multiple_calcs peak_to_peak rms_from_fft real rescale roll running_mean sobel soft_threshold sum skewness sta_lta std_dev to_gpu unwrap up_wave velocity_map velocity_mask virtual_cmp wiener write_npy write_witsml