Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Somatic update #156

Open
7 of 11 tasks
eudesbarbosa opened this issue Jun 21, 2022 · 7 comments · May be fixed by #159
Open
7 of 11 tasks

Somatic update #156

eudesbarbosa opened this issue Jun 21, 2022 · 7 comments · May be fixed by #159
Assignees

Comments

@eudesbarbosa
Copy link
Member

eudesbarbosa commented Jun 21, 2022

Tasks

Tasks extracted from PR #94 :

Based on commit 0a30513

  • Unify output filenames for Control Freec and CopyWriter.
  • Consolidate GRCh37 and GRCh38.

Based on commit e3be244

  • Output both Full and Filtered VCF files for mutect2.
  • Rename 'common' to 'biallelic' variants - follow GATK convention.

Based on commit 1a45161

  • Remove parallel somatic variant annotation.
  • Catch multiple filter conditions in eb_filter.
  • Log conda information used for jannovar wrapper.

Based on commit c259444

  • Expand list of result expansion files in snappy_pipeline/workflows/somatic_variant_calling/__init__.py - mutect2 related.

Based on commit 18b8362

  • Fix log file in wrapper snappy_wrappers/wrappers/mutect2/contamination/wrapper.py
  • Fix log file in wrapper snappy_wrappers/wrappers/mutect2/filter/wrapper.py

Based on commit c469173

  • Update cbioportal_export default config.
  • Fix wrapper snappy_wrappers/wrappers/cbioportal/case_lists/wrapper.py.
  • Fix wrapper snappy_wrappers/wrappers/cbioportal/meta_files/wrapper.py.
  • Fix wrapper snappy_wrappers/wrappers/vcf2maf/vcf2maf/wrapper.py.

Based on commit d407f20

  • Disable collect target coverage when no target coverage files requested.

Based on commit f062fc5

  • Miscellaneous - hard to pin, probably just debugging.
@eudesbarbosa
Copy link
Member Author

For 'Unify output filenames for Control Freec and CopyWriter':
one should also consider unifying cnvetti_on_target_postprocess output. Currently using _ to separate the file extension:

output/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1/out/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1_{ext}

Line 290 - method CnvettiStepPartBase._get_output_files_postprocess()

@eudesbarbosa
Copy link
Member Author

I don't understand the changes to snappy_wrappers/wrappers/control_freec/transform/snappy-convert-control_freec.R:
-> They seem to be the reverse of the changes to the workflow.

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
--                                   log2_fn=paste("freec.", sample_name, ".log2.txt", sep=""),
--                                   call_fn=paste("freec.", sample_name, ".call.txt", sep=""),
--                                   segments_fn=paste("freec.", sample_name, ".segments.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),
+-                                   log2_fn=paste("freec.", sample_name, "_gene_log2.txt", sep=""),
+-                                   call_fn=paste("freec.", sample_name, "_gene_call.txt", sep=""),
+-                                   segments_fn=paste("freec.", sample_name, "_segments.txt", sep=""),

For the function postProcess in snappy_wrappers/wrappers/copywriter/call/snappy-copywriter-call.R:
-> Can't the outputs be an argument, similar to control_freec_write_files, so they don't need to be agnostic towards the rest of the code?
Example:

--                         paste( mapper, ".copywriter.", fullID, "_segments.txt", sep="" ) ),
+-                         paste( mapper, ".copywriter.", fullID, ".segments.txt", sep="" ) ), 

@eudesbarbosa
Copy link
Member Author

Another points about control_freec_write_files, it seems to have default values but the code will stop if files are not accessible. Are defaults really necessary? Can't they just be defined/taken from the wrapper?

@ericblanc20
Copy link
Contributor

I am afraid I have created a monster here. I should have stuck to a naming convention for all CNV steps:

  • copywriter in somatic_targeted_seq_cnv_calling,
  • CNVkit in somatic_targeted_seq_cnv_calling & somatic_wgs_cnv_calling,
  • Control-FREEC in somatic_wgs_cnv_calling
  • cnvetti (germline)

You are welcome to unify the naming convention.

On a side note, I found that output filenames are configured differently in different places,even within the same step & tool (for example names are constructed with underscore and with dot).

I don't know which is the right way to go. Ideally, I would like to:

  • merge the somatic_targeted_seq_cnv_calling with the somatic_wgs_cnv_calling (there is no logical need to keep them apart, as the WES & WGS somatic snvs & indel calling are grouped in a single step)
  • unify using the dot as separator in filenames, the same way that the tools are separated.

This is probably a lot of unnecessary work, but perhaps we should consider moving to dot separation across the somatic CNV steps.

@ericblanc20
Copy link
Contributor

ericblanc20 commented Jun 28, 2022

Regarding the control_freec_write_files, I assume you are referring to the Bioconductor packages org.Hs.eg.db, TxDb.Hsapiens.UCSC.hg19.knownGene & BSgenome.Hsapiens.1000genomes.hs37d5.

This is a real problem, because these packages are huge (genome sequence, genome feature annotations, gene ids & functional annotations), and they are only valid for one genome release. If we want to use GRCh38, or mouse data, then we need to use other version of these packages.

The problem comes as the packages are downloaded using the wrapper's conda environment. So the only solution (at the moment) is to put in the conda environment annotation packages for several genome releases
and species.

We discussed with Clemens a way around this problem. We would have an initial sub-step which creates a sub-directory in $PWD/work in which required packages are installed. This would then be used as supplementary library path for the R scripts, provided by the wrapper. I have a pretty good idea how we could do that, but I need a few days to work on it.

@eudesbarbosa eudesbarbosa linked a pull request Jun 28, 2022 that will close this issue
@eudesbarbosa
Copy link
Member Author

Name pattern: Probably a good idea to stick to one across all workflows. I think I went over all the ones present in the original PR, but it might be something to gradually fix.

Workflow distinction: I suggest we keep the separation. Too much work for very little value.

control_freec_write_files: Way beyond my knowledge of the topic. I was just referring to the fact that there seems to be some default values for the output arguments - in this case you needed to modify them, you might consider have no defaults and avoid that:

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),

@ericblanc20
Copy link
Contributor

Sorry I misunderstood your comment. You are absolutely right: ratios_fn should be obtained from the wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants