PublishDir directive taking a long time to transfer files between work directory and final directory using google lifesciences. #3131

dguin · 2022-08-20T00:00:37Z

dguin
Aug 20, 2022

Hello,

I have been using Nextflow for a while now but primarily on AWS, recently I switched to using it on google cloud. I noticed that once the jobs are finished the pipeline takes a really long time to transfer/copy files from the work directory it ran to the final destination folder. The work directory is in the same bucket as the final destination as well.
For reference, here is my publishdir directive:
publishDir params.alignments_outdir, mode: "copy"

And the log statements to highlight how long it took:

Jul-14 07:23:33.558 [main] INFO  nextflow.Session - Waiting files transfer to complete (437 files)
...
Jul-14 18:28:34.976 [main] DEBUG nextflow.Session - Waiting files transfer to complete (1 files)

For a total of ~11 hours. The biggest file in this group of files is about 6.3 GB and there is 154 of those and the rest are small .json files, a few MB each. The total file size being transferred is about a little less than 1 TB.

I have transferred that much data between folders in a bucket on google cloud before using the gsutil command and it takes nowhere near as much time. For example using a vm on google cloud to transfer files between 2 folders in the same bucket:
[1.4 GiB/ 2.1 TiB] 32% Done 3.3 GiB/s ETA 00:07:16
using the command
gsutil -m cp -r
so it should be much faster than the whopping 11 hours it took. I also believe that nextflow uses the gsutil cp command as well.

I was wondering maybe if there is a setting I am missing? I have looked around for solutions to this specific issue on gcloud but haven't found a solution that works. Here is my nextflow.config gls profile
gls { process.executor = "google-lifesciences" docker.enabled = true google.location = "us-west2" google.region = "us-west1" google.lifeSciences.cpuPlatform = "Intel Skylake" google.lifeSciences.bootDiskSize = "100.GB" google.storage.parallelThreadCount = 100 google.storage.maxParallelTransfers = 100 }
I was looking through the logs and I found this statement and I was wondering if its slow because it's limited to 4 transfers at a time but I cant find a setting to increase it to check. I added the "parallelThreadCount" and "maxParallelTransfers" options to my config but neither of these seem to change the filetransfer queue size.

Jul-14 03:48:18.471 [Task monitor] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=4; maxSize=4; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false

pditommaso · 2022-08-22T06:55:59Z

pditommaso
Aug 22, 2022
Maintainer

Yes, this is ugly. Which nextflow version are you using?

2 replies

dguin Aug 22, 2022
Author

N E X T F L O W ~ version 22.04.5 with DSL2

pditommaso Aug 23, 2022
Maintainer

@jorgeaguileraseqera we should look into if it's possible to bypass the stream-based copy as it has been done for S3

massung · 2022-10-28T12:41:37Z

massung
Oct 28, 2022

I'll add that I've just ran into this issue. It's been trying to publish 1 file (to S3) for the last 8 hours. This is a file that's only ~20 GB in size as I've run the pipeline on this exact same data before. I don't know if I should just kill and rerun the pipeline or let it keep going.

3 replies

pditommaso Oct 28, 2022
Maintainer

what nextflow version are you using?

massung Oct 28, 2022

N E X T F L O W  ~  version 22.10.0
Launching `./main.nf` [cheesy_gates] DSL1 - revision: 1db0a7fb3a
Downloading plugin [email protected]

pditommaso Oct 31, 2022
Maintainer

In the last version nextflow uses the AWS SDK lib for s3 data transfer. not sure how to improve this. Also, we have seen this can be heavily impacted by public vs private vpc connection when connecting to the bucket (the latter is much better).

hugo-cmd · 2022-11-17T15:44:55Z

hugo-cmd
Nov 17, 2022

Hi all, I run into the same issue when using nextflow version 22.10.0 in combination with azure batch executor. I also saw this line in my log files ; [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=36; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false

Is there a fix for this issue or a way to update these settings?

0 replies

massung · 2022-11-17T15:56:26Z

massung
Nov 17, 2022

An interesting data point...

We seem to be hitting this when the output is a directory, but not individual file globs. For example, the following will fail consistently (and take HOURS to fail):

output:
    path("${id}/outs/")

But this will succeed within minutes (it's the exact same set of files):

output:
    path("${id}/outs/analysis/*")
    path("${id}/outs/filtered_feature_bc_matrix/*")
    path("${id}/outs/raw_feature_bc_matrix/*")
    path("${id}/outs/*.html")
    path("${id}/outs/*.h5")
    path("${id}/outs/*.bam")
    path("${id}/outs/*.bam.bai")
    path("${id}/outs/*.csv")
    path("${id}/outs/*.cloupe")

0 replies

kristina-grigaityte · 2022-12-16T15:57:59Z

kristina-grigaityte
Dec 16, 2022

I am also encountering the same issue running the pipeline on AWS batch. The same pipeline did not have this issue a few months ago. When the pipeline completes in about 2h and needs to transfer around 80GB total size of multiple files, I'm getting a message Waiting for file transfers to complete (9 files) (there are way more than 9 files). Then 14h later i get a message:

WARN: Exiting before FileTransfer thread pool complete -- Some files may be lost
Completed at: 16-Dec-2022 10:01:09
Duration    : 14h 29m 54s

However the run is still on going at 20h.
I am using N E X T F L O W ~ version 22.10.2

I also tried changing outputs to individual file globs (which have worked for me in the other pipeline), but that did not seem to help here.

1 reply

KGillinder Jan 13, 2023

We are experiencing the exact same thing. Intriguing that we also have exactly 9 files waiting to transfer.

watsonar · 2023-01-20T14:41:48Z

watsonar
Jan 20, 2023

I am experiencing this same issue but am not writing to S3 (though am using an EC2 instance). My log message:

Jan-20 14:27:56.963 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (7 files)

With no processes being run for the last 11 hours, though there are still more processes to run in this step.

It looks like it finished process ~795/1328 before pausing. There are no files in the output directory, and definitely more than 7 that have been created in the work dir and need to be moved there (~795, to be exact). The files themselves aren't very large, most under a MB. I tried this before with a small test subset of samples and it worked just fine. I have plenty of space left on the disk.

My nextflow version is 22.10.1, with DSL 2.

1 reply

jspaezp Feb 7, 2024

Same here ... for me it is in AWS, so we can start talking about updating the tag of the discussion.

bentsherman · 2024-02-07T14:58:50Z

bentsherman
Feb 7, 2024
Maintainer

One thing you all can try is to enable virtual threads. See this blog post for details, but here's the gist:

Install Java 21
Install Nextflow 23.10 or later

Then virtual threads will be enabled automatically. I have done some benchmarks and found that this feature can significantly reduce the time to publish files at the end of the workflow, especially when copying from S3 to S3. Haven't done any benchmarks for Google Cloud Storage but there might be some benefit, worth trying in any case.

0 replies

yashd94 · 2024-03-05T21:42:53Z

yashd94
Mar 5, 2024

I've just run into this issue again, and have been facing it for the last few months. I tried switching to NF version 23.10.1 and upgrading Java to v21. According to this blog post, that should enable virtual threads by default.

However, this results in a OOM error from Java (see below). My processes don't even complete before being terminated.

Mar-05 20:20:35.215 [PublishDir-13] DEBUG nextflow.cloud.aws.nio.S3Client - S3 download file: s3://pbio-nf-work/scratch/be/9de9f7b7e00bf43741c59fbf01676a/component_models_predictions_train.tsv interrupted Mar-05 20:20:35.216 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=buildModelsSubsets (cluster3_4_SNP_models); work-dir=s3://pbio-nf-work/scratch/a7/82fd79b45f5a8fd97fe201ec81dd3c error [java.lang.OutOfMemoryError]: unable to create native thread: possibly out of memory or process/resource limits reached Mar-05 20:20:35.216 [Task monitor] ERROR nextflow.processor.TaskProcessor - Execution aborted due to an unexpected error java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:1553) at java.base/java.lang.System$2.start(System.java:2577) at java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152) at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364) at nextflow.util.ThrottlingExecutor.execute(ThrottlingExecutor.groovy:516) at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:145) at nextflow.util.ThrottlingExecutor.doInvoke1(ThrottlingExecutor.groovy:680) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030) at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019) at nextflow.util.ClientProxyThrottler.invokeMethod(ClientProxyThrottler.groovy:88) at org.codehaus.groovy.runtime.callsite.PogoInterceptableSite.call(PogoInterceptableSite.java:45) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139) at nextflow.cloud.aws.batch.AwsBatchTaskHandler.describeJob(AwsBatchTaskHandler.groovy:190) at nextflow.cloud.aws.batch.AwsBatchTaskHandler.checkIfCompleted(AwsBatchTaskHandler.groovy:248) at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:615) at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:537) at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:412) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030) at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97) at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293) at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy) at groovy.lang.Closure.run(Closure.java:498) at java.base/java.lang.VirtualThread.run(VirtualThread.java:309) Mar-05 20:20:35.219 [main] WARN n.processor.TaskPollingMonitor - Killing running tasks (29) Mar-05 20:20:35.226 [main] DEBUG n.processor.TaskPollingMonitor - Failed to kill pending tasks: TaskHandler[id: 8; name: buildModelsSubsets (cluster3_3_SNP_models); status: RUNNING; exit: -; error: -; workDir: s3://pbio-nf-work/scratch/7d/49eeb741e05dccc2ebcca1dacf0cca] -- cause: unable to create native thread: possibly out of memory or process/resource limits reached

5 replies

bentsherman Mar 6, 2024
Maintainer

Looks like the same issue as #4711. Which JVM are you using?

yashd94 Mar 6, 2024

OpenJDK JDK 21.0.2

bentsherman Mar 6, 2024
Maintainer

Yes but there are different distributions of openjdk, some work better than others. What is the output of java -version for you?

yashd94 Mar 6, 2024

I see, thanks for clarifying. This is what I have:

openjdk version "21.0.2" 2024-01-16
OpenJDK Runtime Environment (build 21.0.2+13-58)
OpenJDK 64-Bit Server VM (build 21.0.2+13-58, mixed mode, sharing)

bentsherman Mar 25, 2024
Maintainer

Weird, I can't tell what distribution it is. How did you install Java?

PublishDir directive taking a long time to transfer files between work directory and final directory using google lifesciences. #3131

Replies: 8 comments · 12 replies

pditommaso Aug 22, 2022 Maintainer

dguin Aug 22, 2022 Author

pditommaso Aug 23, 2022 Maintainer

pditommaso Oct 28, 2022 Maintainer

pditommaso Oct 31, 2022 Maintainer

bentsherman Feb 7, 2024 Maintainer

bentsherman Mar 6, 2024 Maintainer

bentsherman Mar 6, 2024 Maintainer

bentsherman Mar 25, 2024 Maintainer

Replies: 8 comments 12 replies

pditommaso
Aug 22, 2022
Maintainer

dguin Aug 22, 2022
Author

pditommaso Aug 23, 2022
Maintainer

pditommaso Oct 28, 2022
Maintainer

pditommaso Oct 31, 2022
Maintainer

bentsherman
Feb 7, 2024
Maintainer

bentsherman Mar 6, 2024
Maintainer

bentsherman Mar 6, 2024
Maintainer

bentsherman Mar 25, 2024
Maintainer