Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sharded BAM output by default on Spark, and write utilities and design APIs to make it as easy to work with as single BAMs #1409

Closed
droazen opened this issue Jan 5, 2016 · 3 comments
Assignees
Labels

Comments

@droazen
Copy link
Contributor

droazen commented Jan 5, 2016

No description provided.

@droazen droazen added this to the beta milestone Jan 5, 2016
@droazen droazen added the Spark label Jan 5, 2016
@droazen droazen modified the milestones: alpha-1, beta Jan 6, 2016
@tomwhite
Copy link
Contributor

I've sketched what this might look like at https://github.com/broadinstitute/gatk/compare/tw_sharded_bam_reader. I'd be happy for someone else to take this over.

A few notes about the status:

  • Currently the Hadoop Path API is used. This should be changed to use the nio Path API (Add support for java.nio.file.Path to SamReaderFactory in htsjdk #1426)
  • SamReader.PrimitiveSamReaderToSamReaderAdapter is not public so htsjdk will need a change to make this easier to integrate. Another possibility would be to build this into htsjdk itself.
  • Many of the methods on SamReader assume that an index is present. To support these, there needs to be more work on how indexes are created and merged for sharded BAMs.
  • Comprehensive tests are needed.

@droazen
Copy link
Contributor Author

droazen commented Mar 10, 2016

This one will not be done for alpha-1 -- moving milestones

@droazen droazen modified the milestones: alpha-2, alpha-1 Mar 10, 2016
@droazen droazen modified the milestones: alpha-3, alpha-2 Jul 1, 2016
@tomwhite
Copy link
Contributor

tomwhite commented Jul 4, 2016

I think the core of this work belongs in Hadoop-BAM, along with the code to merge sharded BAM files. I'll leave this open for the GATK integration.

Opened HadoopGenomics/Hadoop-BAM#110.

@tomwhite tomwhite removed their assignment Jul 11, 2016
@droazen droazen modified the milestones: 4.0 release, alpha-3 Mar 20, 2017
@droazen droazen removed this from the Engine-4.0 milestone Oct 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants