Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutect implementation work #394

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Mutect implementation work #394

wants to merge 13 commits into from

Conversation

jstjohn
Copy link

@jstjohn jstjohn commented Mar 13, 2016

An initial rough version of the MuTect algorithm, mostly ported over from work on Avocado.


This change is Review on Reviewable

@ryan-williams
Copy link
Member

Thanks for the PR @jstjohn! One of @timodonnell, @arahuja, or I will have a look this week.

Any interest in a quick call to discuss your interest in Guacamole? Shoot me an email (address on my profile) if so!

@timodonnell
Copy link
Member

Thanks for the contribution @jstjohn . Looks awesome. Couple of quick comments (we'll add more as people look at this in more depth):

  • I think it'd be helpful to document what subset of features of Mutect 1 this implements, and any known differences between these implementations. For example, is filtering by the panel of normals supported (or is that what the noisy vcf is for?)? I think this could just be a big header comment in SomaticMutectLikeCaller.scala, or potentially added as docs. It would also be useful to indicate (maybe on the README) about how mature we can expect this code to be, e.g. how many full datasets has it been run on and compared against mutect calls?
  • I think a rebase is needed -- this PR seems to include some unrelated commits from me and Arun.

Code style

  • We've been trying to follow the convention of avoiding abbreviations, except for extremely common things like num. Renaming some of your names (e.g. pu -> pileupElement) would help stay consistent with this.
  • We've also been trying to keep a correct header comment with every non-trivial method.

Also, I tried running:

$ scripts/guacamole somatic-mutect-like    \
  --normal-reads src/test/resources/synth1.normal.100k-200k.withmd.bam  \
  --tumor-reads src/test/resources/synth1.tumor.100k-200k.withmd.bam   \
   --out /tmp/out.vcf

And I got this serialization error:

--> [0.02 sec. later]: Loci partitioning: 1:0-249250620=0,10:0-135534746=0,11:0-135006515=0,12:0-133851894=0,13:0-115169877=0,14:0-107349539=0,15:0-102531391=0,16:0-90354752=0,17:0-81195209=0,18:0-78077247=0,19:0-59128982=0,2:0-243199372=0,20:0-63025519=0,21:0-48129894=0,22:0-51304565=0,3:0-198022429=0,4:0-191154275=0,5:0-180915259=0,6:0-171115066=0,7:0-159138662=0,8:0-146364021=0,9:0-141213430=0,GL000191.1:0-106432=0,GL000192.1:0-547495=0,GL000193.1:0-189788=0,GL000194.1:0-191468=0,GL000195.1:0-182895=0,GL000196.1:0-38913=0,GL000197.1:0-37174=0, [...]
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.hammerlab.guacamole.DistributedUtil$.windowTaskFlatMapMultipleRDDs(DistributedUtil.scala:628)
    at org.hammerlab.guacamole.DistributedUtil$.windowFlatMapWithState(DistributedUtil.scala:395)
    at org.hammerlab.guacamole.DistributedUtil$.pileupFlatMapTwoRDDs(DistributedUtil.scala:323)
    at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:192)
    at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:154)
    at org.hammerlab.guacamole.SparkCommand.run(Command.scala:55)
    at org.hammerlab.guacamole.Command.run(Command.scala:46)
    at org.hammerlab.guacamole.Guacamole$.main(Guacamole.scala:71)
    at org.hammerlab.guacamole.Guacamole.main(Guacamole.scala)
Caused by: java.io.NotSerializableException: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments
Serialization stack:
    - object not serializable (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments, value: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments@3611153f)
    - field (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, name: args$1, type: class org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments)
    - object (class org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, <function2>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, name: function$3, type: interface scala.Function2)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, <function2>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, name: function$2, type: interface scala.Function2)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, <function3>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, name: function$5, type: interface scala.Function3)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 18 more
       12.95 real        28.89 user         2.31 sys

Do you get that as well? Our commandline parsing library (args4j) is kind of a hassle, and I haven't found a way to make the parsed arguments be serializable. The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).

@ryan-williams
Copy link
Member

The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).

Adding a with Serializable to Arguments would fix the ser error too, right?

pu.referenceBase,
pu.elements.filterNot(pe => {
val clippedFilter = isReadHeavilyClipped(pe.read, maxFractionBasesSoftClippedTumor)
val noisyFilter = pe.read.mdTagOpt.map(m =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: intellij points out that .map(…).getOrElse(false) should be replaced with .exists(…)

@jstjohn
Copy link
Author

jstjohn commented Mar 15, 2016

The serializable issue seems to be fixed with the suggested with Serializable addition to Arguments. I have also gone through and updated many of the function names/added docs to the functions. I uploaded a README to the docs folder outlining the MuTect algorithm and the differences I am aware of.


Review status: 0 of 17 files reviewed at latest revision, 3 unresolved discussions.


src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 303 [r3] (raw file):
Done.


src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 479 [r3] (raw file):
Done.


Comments from the review on Reviewable.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants