-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Easier way to create CSV files #4821
Comments
I have been using collectFile for this |
Just to add more details than sending a link to sarek. So this should do the trick in an easier/nicer way:
Having a delimiter field would be nice to handle both What I found was the most tricky part with the |
Would be interesting to know more about the use case. For example in #4670 was discussed to have implicit creation of csv based index file for outputs |
For Sarek, the use case is simple, when we started working on it and using it to remap the data for the Swedish Genome Project, we had some some hardware issues, so resumability was often not working properly, so around 10 % of samples run were failing and couldn't resume. In my opinion, they way we do that in fetchngs, we create such files to be entry point for other pipelines, so that's the same idea. But my guess is that one could also want to create a file for reporting, or even to gather results and be used by a downstream tool. |
The problem with creating files in an operator is that there is no task hash associated with an operator execution, which hinders caching and provenance tracking. An Having said that, I understand that an See also: |
I have seen this pattern happen again and again where a framework implements some special way to execute code (promises in JS, observables, etc) that requires every new function under the sun to be re-implemented in this special way. I don't want us to repeat this cycle with operators. A common example I see is, a user wants to use the ch_fastq | flatMap { splitFastq(it) } But then what's the point of the This is basically how I feel about Operators should be used when they provide some kind of dataflow functionality. The |
Regarding the
It's interesting to me how If only there were some native way to do these things in Nextflow... |
Simple sketch for an inline ch_records
| collect
| exec(name: 'RECORDS_TO_CSV') { records -> mergeCsv(records) } The It is more verbose than just having a The same treatment can be applied to |
@bentsherman regarding the use of a Nextflow process for this, especially some of the issues with a Nextflow process for this included having to launch a new container instance, potentially on a new cloud instance, at least when using a using on the other hand, its worth noting that CSV has an actual spec https://datatracker.ietf.org/doc/html/rfc4180 which includes things like carriage returns. You'll also need to handle quoting, and potentially escape characters, I think. CSV is such a mess of a format though, in practice with real life files, that it would be great if its usage could be discouraged. Not sure that will ever happen though. It would be nice if we could convince people to stop using CSV so much and just switch to something like JSON, which coincidentally makes your pipeline code easier as well since it supports typing and nested data structures so you dont need ridiculous things like multiple samplesheets or samplesheet duplicated grouping fields per-row, etc.. If a tabular text delimited file must be written, my experience has been that TSV is generally a better option. |
Agree that a |
hey maybe stuff like this would be good candidates for inclusion in something like an nf-core community utility Nextflow plugin :) |
This thread derailed a bit from the original request. @vdauwera what's your use case for having a |
The specific use case was a section of GATK pipeline that proceeds as follows:
The second step requires a sample map, defined as a TSV file (that's tab separated, not a typo) listing sample ID (as used in the original data RG:SM tag, and subsequently in the VCF) and the file paths to the GVCF and its index for each sample; one sample per line. This is a requirement of the GATK tools involved. |
Sorry for the digression, I will save my notes elsewhere. We can consider the inline exec idea separately. My point is that operators that write files are pesky and should be avoided in favor of regular functions where possible. @vdauwera this is possible to do with some custom Groovy code, e.g. the @stevekm we could also have a |
I don't believe so. For example |
IMO the operator is justified for these kinds of small operations where you need to generate a small intermediate file that's needed for technical reasons but is not a proper data output of the pipeline. |
I agree that |
Also automatic cleanup, which relies on the lifecycle of tasks and publishing, and has no visibility of files created by operators like |
@vdauwera I have implemented a https://github.com/bentsherman/nf-boost I know you're writing content and a plugin might be a weird thing to introduce, but this is a nice way for you to play with an experimental version before we add it is a core feature. There is an example pipeline which shows you how to use it. Feel free to try it out and let me know if there's anything more you'd like to see from it. |
Hey, sorry for the lag. Will definitely try this out. Probably don't want to introduce a plugin in the most basic training module but this might actually slot nicely into a follow-up. As a way to introduce plugins with a basic use case that builds on an earlier training. |
Agreed. Do let me know if it al least satisfies your use case so that can we works towards adding it as a core feature |
Hello someone could help me with that WARN: Input directory 'Control_R1' was found, but sample sheet '/mnt/c/Users/juanj/Desktop/maestria ric/Sample_sheet.csv' has no such entry. My sample sheet is> |
New feature
We have had the
splitCSV
operator for a long time and it's commonly used. @vdauwera was asking me how to do the reverse - write to a CSV file - and I found it surprisingly difficult to find a clean syntax.Usage scenario
The example that @vdauwera was a process with the following output:
Going into another process which collects those outputs and needs a CSV sample sheet file.
The closes I could find was this CSV writing implementation in nf-core/fetchngs:
Only four lines of code, but has to happen within a dedicated process and it's kind of ugly 👀 It'd be nicer to be able to do this with a channel operator on the way into a process, like with
collectFile
.Suggest implementation
Ideally could extend
collectFile
to be able to handle tuples like this with control of separators etc. Alternatively a new core Nextflow function to do this (toCSV
?), or if "there's an easy way to do this in Groovy" then a new pattern to point people to in the Nextflow Patterns docs could also work.The text was updated successfully, but these errors were encountered: