Skip to content

Commit

Permalink
Merge pull request #281 from DataBiosphere/dev
Browse files Browse the repository at this point in the history
PR for 0.4.10 release
  • Loading branch information
wnojopra authored Dec 18, 2023
2 parents 8e97456 + 287f648 commit 254f3b0
Show file tree
Hide file tree
Showing 21 changed files with 227 additions and 78 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,7 +376,8 @@ If your script has dependent files, you can make them available to your script
by:

* Building a private Docker image with the dependent files and publishing the
image to a public site, or privately to Google Container Registry
image to a public site, or privately to Google Container Registry or
Artifact Registry
* Uploading the files to Google Cloud Storage

To upload the files to Google Cloud Storage, you can use the
Expand Down Expand Up @@ -465,8 +466,9 @@ local directory in a similar fashion to support your local development.

##### Mounting a Google Cloud Storage bucket

To have the `google-v2` or `google-cls-v2` provider mount a Cloud Storage bucket using
Cloud Storage FUSE, use the `--mount` command line flag:
To have the `google-v2` or `google-cls-v2` provider mount a Cloud Storage bucket
using [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse),
use the `--mount` command line flag:

--mount RESOURCES=gs://mybucket

Expand Down
12 changes: 7 additions & 5 deletions docs/code.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,10 @@ sites such as [Docker Hub](https://hub.docker.com/). Images can be pulled
from Docker Hub or any container registry:

```
--image debian:jessie # pull image implicitly from Docker hub.
--image gcr.io/PROJECT/IMAGE # pull from GCR registry.
--image quay.io/quay/ubuntu # pull from Quay.io.
--image debian:jessie # pull image implicitly from Docker hub
--image gcr.io/PROJECT/IMAGE # pull from Google Container Registry
--image us-central1.pkg.dev/PROJECT/REPO/IMAGE # pull from Artifact Registry
--image quay.io/quay/ubuntu # pull from Quay.io
```

When you have more than a single custom script to run or you have dependent
Expand All @@ -123,8 +124,9 @@ store it in a container registry.

A quick way to start using custom Docker images is to use Google Container
Builder which will build an image remotely and store it in the [Google Container
Registry](https://cloud.google.com/container-registry/docs/). Alternatively you
can build a Docker image locally and push it to a registry. See the
Registry](https://cloud.google.com/container-registry/docs)
or [Artifact Registry](https://cloud.google.com/artifact-registry/docs).
Alternatively you can build a Docker image locally and push it to a registry. See the
[FastQC example](../examples/fastqc) for a demonstration of both strategies.

For information on building Docker images, see the Docker documentation:
Expand Down
8 changes: 5 additions & 3 deletions docs/compute_resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@ A Compute Engine VM by default has both a public (external) IP address and a
private (internal) IP address. For batch processing, it is often the case that
no public IP address is necessary. If your job only accesses Google services,
such as Cloud Storage (inputs, outputs, and logging) and Google Container
Registry (your Docker image), then you can run your `dsub` job on VMs without a
public IP address.
Registry or Artifact Registry (your Docker image), then you can run your `dsub`
job on VMs without a public IP address.

For more information on Compute Engine IP addresses, see:

Expand Down Expand Up @@ -132,7 +132,9 @@ was assigned.**
The default `--image` used for `dsub` tasks is `ubuntu:14.04` which is pulled
from Dockerhub. For VMs that do not have a public IP address, set the `--image`
flag to a Docker image hosted by
[Google Container Registry](https://cloud.google.com/container-registry/docs).
[Google Container Registry](https://cloud.google.com/container-registry/docs) or
[Artifact Registry](https://cloud.google.com/artifact-registry/docs).

Google provides a set of
[Managed Base Images](https://cloud.google.com/container-registry/docs/managed-base-images)
in Container Registry that can be used as simple replacements for your tasks.
Expand Down
21 changes: 17 additions & 4 deletions docs/input_output.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,15 +256,28 @@ the name of the input parameter must comply with the

## Requester Pays

To access a Google Cloud Storage
[Requester Pays bucket](https://cloud.google.com/storage/docs/requester-pays),
you will need to specify a billing project. To do so, use the `dsub`
command-line option `--user-project`:
Unless specifically enabled, a Google Cloud Storage bucket is "owner pays"
for all requests. This includes
[network charges](https://cloud.google.com/vpc/network-pricing) for egress
(data downloads or copies to a different cloud region), as well as
[retrieval charges](https://cloud.google.com/storage/pricing#retrieval-pricing)
on files in "cold" storage classes, such as Nearline, Coldline, and Archive.

When [Requester Pays](https://cloud.google.com/storage/docs/requester-pays)
is enabled on a bucket, the requester must specify a Cloud project to which
charges can be billed. Use the `dsub` command-line option `--user-project`:

```
--user-project my-cloud-project
```

The user project specified will be passed for all GCS interactions, including:

- Logging
- Localization (inputs)
- Delocalization (outputs)
- Mount (gcs fuse)

## Unsupported path formats:

* GCS recursive wildcards (**) are not supported
Expand Down
2 changes: 1 addition & 1 deletion dsub/_dsub_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@
0.1.3.dev0 -> 0.1.3 -> 0.1.4.dev0 -> ...
"""

DSUB_VERSION = '0.4.9'
DSUB_VERSION = '0.4.10'
22 changes: 15 additions & 7 deletions dsub/commands/dsub.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,20 +180,26 @@ def get_credentials(args):


def _check_private_address(args):
"""If --use-private-address is enabled, ensure the Docker path is for GCR."""
"""If --use-private-address is enabled, Docker path must be for GCR or AR."""
if args.use_private_address:
image = args.image or DEFAULT_IMAGE
split = image.split('/', 1)
if len(split) == 1 or not split[0].endswith('gcr.io'):
if len(split) == 1 or not (
split[0].endswith('gcr.io') or split[0].endswith('pkg.dev')
):
raise ValueError(
'--use-private-address must specify a --image with a gcr.io host')
'--use-private-address must specify a --image with a gcr.io or'
' pkg.dev host'
)


def _check_nvidia_driver_version(args):
"""If --nvidia-driver-version is set, warn that it is ignored."""
if args.nvidia_driver_version:
print('***WARNING: The --nvidia-driver-version flag is deprecated and will '
'be ignored.')
print(
'***WARNING: The --nvidia-driver-version flag is deprecated and will '
'be ignored.'
)


def _google_cls_v2_parse_arguments(args):
Expand Down Expand Up @@ -360,8 +366,10 @@ def _parse_arguments(prog, argv):
parser.add_argument(
'--user-project',
help="""Specify a user project to be billed for all requests to Google
Cloud Storage (logging, localization, delocalization). This flag exists
to support accessing Requester Pays buckets (default: None)""")
Cloud Storage (logging, localization, delocalization, mounting).
This flag exists to support accessing Requester Pays buckets
(default: None)""",
)
parser.add_argument(
'--mount',
nargs='*',
Expand Down
31 changes: 23 additions & 8 deletions dsub/providers/google_v2_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,30 +296,43 @@ def _get_logging_env(self, logging_uri, user_project):
'USER_PROJECT': user_project,
}

def _get_mount_actions(self, mounts, mnt_datadisk):
def _get_mount_actions(self, mounts, mnt_datadisk, user_project):
"""Returns a list of two actions per gcs bucket to mount."""
actions_to_add = []
for mount in mounts:
bucket = mount.value[len('gs://'):]
mount_path = mount.docker_path

mount_command = (
['--billing-project', user_project] if user_project else []
)
mount_command.extend([
'--implicit-dirs',
'--foreground',
'-o ro',
bucket,
os.path.join(_DATA_MOUNT_POINT, mount_path),
])

actions_to_add.extend([
google_v2_pipelines.build_action(
name='mount-{}'.format(bucket),
enable_fuse=True,
run_in_background=True,
image_uri=_GCSFUSE_IMAGE,
mounts=[mnt_datadisk],
commands=[
'--implicit-dirs', '--foreground', '-o ro', bucket,
os.path.join(_DATA_MOUNT_POINT, mount_path)
]),
commands=mount_command,
),
google_v2_pipelines.build_action(
name='mount-wait-{}'.format(bucket),
enable_fuse=True,
image_uri=_GCSFUSE_IMAGE,
mounts=[mnt_datadisk],
commands=['wait',
os.path.join(_DATA_MOUNT_POINT, mount_path)])
commands=[
'wait',
os.path.join(_DATA_MOUNT_POINT, mount_path),
],
),
])
return actions_to_add

Expand Down Expand Up @@ -418,7 +431,9 @@ def _build_pipeline_request(self, task_view):
if job_resources.ssh:
optional_actions += 1

mount_actions = self._get_mount_actions(gcs_mounts, mnt_datadisk)
mount_actions = self._get_mount_actions(
gcs_mounts, mnt_datadisk, user_project
)
optional_actions += len(mount_actions)

user_action = 4 + optional_actions
Expand Down
4 changes: 3 additions & 1 deletion dsub/providers/local/runner.sh
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,9 @@ function configure_docker_if_necessary() {

# Check that the prefix is gcr.io or <location>.gcr.io
if [[ "${prefix}" == "gcr.io" ]] ||
[[ "${prefix}" == *.gcr.io ]]; then
[[ "${prefix}" == *.gcr.io ]] ||
[[ "${prefix}" == "pkg.dev" ]] ||
[[ "${prefix}" == *.pkg.dev ]] ; then
log_info "Ensuring docker auth is configured for ${prefix}"
gcloud --quiet auth configure-docker "${prefix}"
fi
Expand Down
24 changes: 12 additions & 12 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,28 @@
# dependencies for dsub, ddel, dstat
# Pin to known working versions to prevent episodic breakage from library
# version mismatches.
# This version list generated: 04/13/2023
# This version list generated: 12/07/2023
# direct dependencies
'google-api-python-client>=2.47.0,<=2.85.0',
'google-auth>=2.6.6,<=2.17.3',
'google-cloud-batch==0.10.0',
'google-api-python-client>=2.47.0,<=2.109.0',
'google-auth>=2.6.6,<=2.25.1',
'google-cloud-batch==0.17.5',
'python-dateutil<=2.8.2',
'pytz<=2023.3',
'pyyaml<=6.0',
'tenacity<=8.2.2',
'pyyaml<=6.0.1',
'tenacity<=8.2.3',
'tabulate<=0.9.0',
# downstream dependencies
'funcsigs==1.0.2',
'google-api-core>=2.7.3,<=2.11.0',
'google-auth-httplib2<=0.1.0',
'google-api-core>=2.7.3,<=2.15.0',
'google-auth-httplib2<=0.1.1',
'httplib2<=0.22.0',
'pyasn1<=0.4.8',
'pyasn1-modules<=0.2.8',
'pyasn1<=0.5.1',
'pyasn1-modules<=0.3.0',
'rsa<=4.9',
'uritemplate<=4.1.1',
# dependencies for test code
'parameterized<=0.8.1',
'mock<=4.0.3',
'parameterized<=0.9.0',
'mock<=5.1.0',
]


Expand Down
16 changes: 8 additions & 8 deletions test/integration/e2e_dstat.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ function verify_dstat_output() {

# Verify that that the jobs are found and are in the expected order.
# dstat sort ordering is by create-time (descending), so job 0 here should be the last started.
local first_job_name="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[0].job-name")"
local second_job_name="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[1].job-name")"
local third_job_name="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[2].job-name")"
local first_job_name="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[0].job-name")"
local second_job_name="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[1].job-name")"
local third_job_name="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[2].job-name")"

if [[ "${first_job_name}" != "${RUNNING_JOB_NAME_2}" ]]; then
1>&2 echo "Job ${RUNNING_JOB_NAME_2} not found in the correct location in the dstat output! "
Expand Down Expand Up @@ -87,8 +87,8 @@ function verify_dstat_google_provider_fields() {

for (( task=0; task < 3; task++ )); do
# Run the provider test.
local job_name="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].job-name")"
local job_provider="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider")"
local job_name="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].job-name")"
local job_provider="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider")"

# Validate provider.
if [[ "${job_provider}" != "${DSUB_PROVIDER}" ]]; then
Expand All @@ -99,7 +99,7 @@ function verify_dstat_google_provider_fields() {

# For google-cls-v2, validate that the correct "location" was used for the request.
if [[ "${DSUB_PROVIDER}" == "google-cls-v2" ]]; then
local op_name="$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${DSTAT_OUTPUT}" "[0].internal-id")"
local op_name="$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${DSTAT_OUTPUT}" "[0].internal-id")"

# The operation name format is projects/<project-number>/locations/<location>/operations/<operation-id>
local op_location="$(echo -n "${op_name}" | awk -F '/' '{ print $4 }')"
Expand Down Expand Up @@ -131,15 +131,15 @@ function verify_dstat_google_provider_fields() {
util::dstat_yaml_assert_boolean_field_equal "${dstat_out}" "[${task}].provider-attributes.preemptible" "false"

# Check that instance name is not empty
local instance_name=$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider-attributes.instance-name")
local instance_name=$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider-attributes.instance-name")
if [[ -z "${instance_name}" ]]; then
1>&2 echo " - FAILURE: Instance ${instance_name} for job ${job_name}, task $((task+1)) is empty."
1>&2 echo "${dstat_out}"
exit 1
fi

# Check zone exists and is expected format
local job_zone=$(python "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider-attributes.zone")
local job_zone=$(python3 "${SCRIPT_DIR}"/get_data_value.py "yaml" "${dstat_out}" "[${task}].provider-attributes.zone")
if ! [[ "${job_zone}" =~ ^[a-z]{1,4}-[a-z]{2,15}[0-9]-[a-z]$ ]]; then
1>&2 echo " - FAILURE: Zone ${job_zone} for job ${job_name}, task $((task+1)) not valid."
1>&2 echo "${dstat_out}"
Expand Down
43 changes: 43 additions & 0 deletions test/integration/e2e_io_mount_bucket_requester_pays.google-v2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

# Copyright 2023 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -o errexit
set -o nounset

# Test gcsfuse abilities.
#
# This test is designed to verify that named GCS bucket (mount)
# command-line parameters work correctly.
#
# The actual operation performed here is to mount to a bucket containing a BAM
# and compute its md5, writing it to <filename>.bam.md5.

readonly SCRIPT_DIR="$(dirname "${0}")"

# Do standard test setup
source "${SCRIPT_DIR}/test_setup_e2e.sh"

# Do io setup
source "${SCRIPT_DIR}/io_setup.sh"

echo "Launching pipeline..."

JOB_ID="$(io_setup::run_dsub_with_mount "gs://${DSUB_BUCKET_REQUESTER_PAYS}" "true")"
echo "JOB_ID = $JOB_ID"

# Do validation
io_setup::check_output
io_setup::check_dstat "${JOB_ID}" false "gs://${DSUB_BUCKET_REQUESTER_PAYS}"
4 changes: 2 additions & 2 deletions test/integration/e2e_local_relative_paths.local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ cd "${TEST_LOCAL_ROOT}"
# OUTPUT_FILE=outputs/relative/test.txt

readonly INPUT_PATH_RELATIVE="$(
python -c "import os; print(os.path.relpath('"${LOCAL_INPUTS}/relative"'));")"
python3 -c "import os; print(os.path.relpath('"${LOCAL_INPUTS}/relative"'));")"
readonly OUTPUT_PATH_RELATIVE="$(
python -c "import os; print(os.path.relpath('"${LOCAL_OUTPUTS}/relative"'));")"
python3 -c "import os; print(os.path.relpath('"${LOCAL_OUTPUTS}/relative"'));")"

readonly INPUT_TEST_FILE="${INPUT_PATH_RELATIVE}/test.txt"
readonly OUTPUT_TEST_FILE="${OUTPUT_PATH_RELATIVE}/test.txt"
Expand Down
4 changes: 3 additions & 1 deletion test/integration/io_setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ function io_setup::run_dsub_requester_pays() {
run_dsub \
--unique-job-id \
${IMAGE:+--image "${IMAGE}"} \
--user-project "$PROJECT_ID" \
--user-project "${PROJECT_ID}" \
--script "${SCRIPT_DIR}/script_io_test.sh" \
--env TASK_ID="task" \
--input INPUT_PATH="${REQUESTER_PAYS_INPUT_BAM_FULL_PATH}" \
Expand All @@ -151,10 +151,12 @@ readonly -f io_setup::run_dsub_requester_pays

function io_setup::run_dsub_with_mount() {
local mount_point="${1}"
local requester_pays="${2:-}"

run_dsub \
--unique-job-id \
${IMAGE:+--image "${IMAGE}"} \
${requester_pays:+--user-project "${PROJECT_ID}"} \
--script "${SCRIPT_DIR}/script_io_test.sh" \
--env TASK_ID="task" \
--output OUTPUT_PATH="${OUTPUTS}/task/*.md5" \
Expand Down
Loading

0 comments on commit 254f3b0

Please sign in to comment.