Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool request: Fast/Ortho/Finder #1168

Closed
lecorguille opened this issue Feb 8, 2017 · 15 comments
Closed

Tool request: Fast/Ortho/Finder #1168

lecorguille opened this issue Feb 8, 2017 · 15 comments

Comments

@lecorguille
Copy link
Member

In live from Gitter

It seems that this tools is needed by different groups.

I propose to put Victor (@Mataivic), a student, on this subject.
But @nsoranzo suggested to integrate it in "a small hackathon coming up next week at my institute"
So we can take the relay afterwards or at least test the output.

Also involved in the thread: @abretaud @pvanheus

@timdiels
Copy link

You may also want to consider OrthoFinder, it is believed to be more accurate than OrthoMCL/FastOrtho.

@lecorguille
Copy link
Member Author

For our project, we should build a little benchmark with those 3 tools and an old homemade script we plan to replace.

@Mataivic
Copy link
Contributor

Mataivic commented May 3, 2017

@lecorguille OrthoFinder : I'm on it.

@nsoranzo
Copy link
Member

nsoranzo commented May 3, 2017

@Mataivic We've already implemented a basic but working wrapper at https://github.com/NetBiol/hackathon2017/blob/master/galaxy-tools/orthofinder.xml . I can open a pull request here and you can work from that maybe?

@gregvonkuster
Copy link
Contributor

@nsoranzo At first glance, it looks like there is the possibility that your tools here https://github.com/NetBiol/hackathon2017/tree/master/galaxy-tools may potentially complement my tools here https://github.com/gregvonkuster/galaxy_tools/tree/master/tools/plant_tribes. I'll keep an eye on your work. ;)

@lecorguille lecorguille changed the title Tool request: FastOrtho Tool request: Fast/Ortho/Finder May 22, 2017
@lecorguille
Copy link
Member Author

@nsoranzo
What @Mataivic have already produced is a little more advanced. So, thanks for the proposal, but we will continue on our base.

@nsoranzo
Copy link
Member

@lecorguille Cool, no problem! Planning to submit it here?

@lecorguille
Copy link
Member Author

Yes!
(but be nice, it's his first wrapper 😄)

This tool should integrate a bigger workflow developed in a dedicated GitHub repository.

@Mataivic
Copy link
Contributor

Mataivic commented May 26, 2017

Hello,

I have been away for two weeks and started to really work on the wrapper this week ; here is what I've done until now : https://github.com/abims-sbr/tools-iuc/tree/orthofinder/tools/orthofinder

A whole bunch of tool options are implemented, but this draft does not deal with incompatible options. Issues remain about dataset collections :

  • for outputs (for example, orthofinder produces output in sub-directories and I have not figured out yet how to deal with dataset collections with sub-directories). I wrote more details in the readme.md file.
  • for inputs : for now, the wrapper works with multiple files inputs without using a dataset collection.

@gregvonkuster
Copy link
Contributor

@Mataivic I have built several Galaxy wrappers, available here https://github.com/gregvonkuster/galaxy_tools/tree/master/tools/plant_tribes, for the PlantTribes analysis pipelines here https://github.com/dePamphilis/PlantTribes. These tools are doing similar things to yours, so hopefully our work will be complementary.

I have dealt with the same issues you face with regard to outputs - my tools also produce directories of files, But I don't define the outputs as dataset collections because in most cases, my outputs consist of multiple Galaxy datatypes, and dataset collections assume 1 datatype. Also, in some cases my tools produce multiple directory levels on output, and I'm not sure if/how dataset collections would handle directory hierarchies like this.

I've taken the approach of defining new Galaxy datatypes for these tools which are subclasses of the HTML datatype - they are in this PR: galaxyproject/galaxy#3999. These datatypes allow for the directories of files to be placed in the primary dataset's extra_files_path. The primary dataset is rendered with these directories of files as clickable items. Multiple levels of directories can be browsed as well. Here is an example - the center panel is rendered when the eye icon is clicked on history item # 118.

pt

These tools form a workflow that typically proceeds in this order:

AssemblyPostProcessor -> GeneFamilyClassifier -> GeneFamilyIntegrator -> GeneFamilyAligner -> GeneFamilyPhylogenyBuilder

Tools downstream from those that produce these directories of files are written to consume them as inputs. The "end-point" tools in the workflow produce typical Galaxy-like datatypes so that general Galaxy features work on them. For example, the GeneFamiltPhylogenyBuilder tool produces a dataset collection of tree files with the Galaxy nhx datatype:

gfpb

The elements of this dataset collection can then be rendered with the recently introduced Phylocanvas chart.

chart

Although these tools wrap the PlantTribes analysis pipelines, they can be used to perform this same analysis on any genome. It would be great if you find these datatypes useful for your tools as well as it would help define a more standard approach for handling these directories of files.

@Mataivic
Copy link
Contributor

@gregvonkuster Thank you, I'll have a look at it - I don't know really how to deal with Galaxy datatypes yet but I'll do my best to learn that quickly -.

What do you mean by "dataset collections assume 1 datatype" ? My outputs collections contains files with various file extension (.txt, .csv and .faa) so I guess it means something else than files extensions ?.

@gregvonkuster
Copy link
Contributor

I don't know really how to deal with Galaxy datatypes yet but I'll do my best to learn that quickly -.

Here is a good explanation https://galaxyproject.org/learn/datatypes/. If you choose to use my datatypes for your tool, you will just define your tool outputs to use one of those datatypes using the format attribute, something like this:

<outputs>
    <data name="output_aln" format="ptalign"/>
</outputs

What do you mean by "dataset collections assume 1 datatype" ?

In the tool outputs section, dataset collections are defined with a format as well, something like this:

<outputs>
        <collection name="tree" type="list">
            <discover_datasets pattern="__name__" directory="some_dir" format="nhx" />
        </collection>
</outputs>

The format attribute defines the Galaxy datatype for all datasets within the collection, and dataset collections do not currently accommodate multiple Galaxy datatypes. So in the above tool, the dataset collection definition requires that all files within the some_dir directory will be the nhx Galaxy datatype.

My outputs collections contains files with various file extension (.txt, .csv and .faa) so I guess it means something else than files extensions ?.

Galaxy "loosely" uses file extensions to categorize Galaxy datatypes, with each Galaxy datatype class having a file extension via the file_ext attribute (e.g., https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/text.py#L24). Since your directories contain files of multiple datatypes, (i.e., file ext txt is associated with the Galaxy datatype class Text, file ext csv is associated with (subclassed from) the Galaxy datatype class Tabular https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/tabular.py#L879 and file ext faa is associated with the Galaxy datatype class Fasta), you will bump into problems attempting to define your tool output as a dataset collection. You will definitely have to use one of the new datatypes I've defined for these kinds of tools, or perhaps add a new one.

Can you provide some details about your outputs? It will help me to possibly be able to tell you which of my existing datatypes can be used or whether you will need an additional datatype.

@Mataivic
Copy link
Contributor

@gregvonkuster Thank you for the details !

Can you provide some details about your outputs?

Well, The outputs are several dataset collections, which correspond to the output files of different steps of the tool :

  • A collection for the orthogroups (the first main results of the tool). It contains .csv and .txt files
  • A collection for the Working Directory after the blast (it is useful for the user to get these files : if we wants to run the tool from pre-computed blast-results). It contains .txt and .faa files
  • A collection which contains the final results, which do not work for now since there are many subfolders. I did remembr each file exactly but I guess there are several files extensions here as well.

Should I consider to split each collection ? Each collection would contain a single datatype, but I guess it would make a lot of outputs...

@gregvonkuster
Copy link
Contributor

Well, The outputs are several dataset collections, which correspond to the output files of different steps of the tool :

From this description, it sounds like you have a single tool that produces outputs at multiple steps, which implies that steps following an output step will consume the output, do further processing, and produce more outputs. If this is the case, perhaps your tool should be split into multiple tools?

  • A collection for the orthogroups (the first main results of the tool). It contains .csv and .txt files
  • A collection for the Working Directory after the blast (it is useful for the user to get these files : if we wants to run the tool from pre-computed blast-results). It contains .txt and .faa files
  • A collection which contains the final results, which do not work for now since there are many subfolders. I did remembr each file exactly but I guess there are several files extensions here as well.

Except for your third item, it looks like your directories of files are fairly easy to handle. I'm not quite sure of a best approach since I don't have the context about the analyses your tool is attempting to perform. Do tools (or tool processing] steps) that consume outputs assume the files are all in the same directory? My tools do. If so, you can probably still use dataset collections, but you'll need to account of a couple of important items.

  1. As stated previously, dataset collections require a single Galaxy datatype per collection. So for your first and second items you'll need 2 collections, one for .csv files and another for .txt files in the first item, and one for faa files and another for txt files in the second item.
  2. If downstream tools (or tool processing steps) require all files to be in the same directory, your tool code will have to symlink all elements of the 2 collections into a temporary directory to be consumed.

The other approach would be to use one of the new datatypes I've created in the PR discussed above or add a new one yourself. My datatypes categorize the data in this way.

  • ptortho: Proteins orthogroup fasta files
  • ptorthocs: Protein and coding sequences orthogroup fasta files
  • pttgf: Targeted gene families
  • pttree: Phylogenetic trees
  • ptphylip: Orthogroup phylip multiple sequence alignments
  • ptalign: Proteins orthogroup alignments
  • ptalignca: Protein and coding sequences orthogroup alignments
  • ptaligntrimmed: Trimmed proteins orthogroup alignments
  • ptaligntrimmedca: Trimmed protein and coding sequences orthogroup alignments
  • ptalignfiltered: Filtered proteins orthogroup alignments
  • ptalignfilteredca: Filtered protein and coding sequences orthogroup alignments

If any of your outputs consist of datasets that are defined by any of those descriptions, that datatype could be used for your output. Or you could define a new datatype if needed.

A very important caveat regarding this approach is that these datatypes cannot currently be tested with the travis test environment defined for this tools-iuc repository. The current Galaxy functional test framework does not accommodate datatypes that represent dynamic numbers of files of multiple datatypes, contained within directory hierarchies.

In fact, I'm still working to get some functional tests built for several of my tools that use these datatypes. My approach for this is to incorporate Galaxy workflows for testing the tools. I have taken a look at this project https://github.com/phnmnl/wft4galaxy, but ran into this issue phnmnl/wft4galaxy#2, so I haven't pursued it. Instead, I'm trying to use planemo for testing the workflows, but I have yet to get this approach working.

Should I consider to split each collection ? Each collection would contain a single datatype, but I guess it would make a lot of outputs...

Based on the testing issues I've discussed above, this may be your best approach. But I only see this working for your first 2 items. I don't see how it will work for your third item which is a hierarchy of directories of files of multiple datatypes. I'm not sure dataset collections will work for this (ping @jmchilton).

For your first 2 items, if you use dataset collections, you'll only need 2 I think, so there won't be "a lot of outputs", but only 2 collections. Of course, the number of elements in each collection may be very large, but that's ok.

@abretaud
Copy link
Contributor

Orthofinder is in IUC and up-to-date now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants