Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram files? #15

Open
jjfarrell opened this issue Mar 13, 2018 · 4 comments
Open

Cram files? #15

jjfarrell opened this issue Mar 13, 2018 · 4 comments

Comments

@jjfarrell
Copy link

Does spark-bam handle cram files? If so, how does the reference get specified?

@ryan-williams
Copy link
Member

I went through motions of passing .cram-loading through to hadoop-bam, but haven't tested it! You'd just call sc.loadReads like with a .bam.

IIUC, you'd specify relevant options like reference path the same way you do in hadoop-bam, e.g. as a property on the Hadoop Configuration (i.e. SparkContext.hadoopConfiguration).

Making those properties proper method-params to be more idiomatic Scala would be nice.

Feel free to post the results of trying it!

Alternatively, your application code can decide to call hadoop-bam or spark-bam based on the file's extension 🙁

@ryan-williams
Copy link
Member

Hey @jjfarrell, I looked through the related posts broadinstitute/gatk#4506 and HadoopGenomics/Hadoop-BAM#196 (comment) and am curious to dig a little bit.

Is there a public .cram file you can point me at? I couldn't tell whether your adni/cram/ADNI_002_S_0413.hg38.realign.bqsr.cram is available anywhere.

@jjfarrell
Copy link
Author

@ryan-williams That cram one is not available. However, I am working on a getting a cram of a GIAB sample available for testing.

@jjfarrell
Copy link
Author

@ryan-williams

Here is a publiclly available cram from 1000 genomes. Again I found the Spark GATK v4.0.2.1 job was quite slow processing this cram.

gatk FlagStatSpark --input 1000g/cram/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram --reference file:///restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -- --spark-runner SPARK --spark-master yarn

Here are the urls for a cram....

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram.crai
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants