Tool request: Fast/Ortho/Finder #1168

lecorguille · 2017-02-08T10:49:36Z

In live from Gitter

It seems that this tools is needed by different groups.

I propose to put Victor (@Mataivic), a student, on this subject.
But @nsoranzo suggested to integrate it in "a small hackathon coming up next week at my institute"
So we can take the relay afterwards or at least test the output.

Also involved in the thread: @abretaud @pvanheus

timdiels · 2017-02-10T18:06:03Z

You may also want to consider OrthoFinder, it is believed to be more accurate than OrthoMCL/FastOrtho.

lecorguille · 2017-02-10T19:56:13Z

For our project, we should build a little benchmark with those 3 tools and an old homemade script we plan to replace.

Mataivic · 2017-05-03T07:40:42Z

@lecorguille OrthoFinder : I'm on it.

nsoranzo · 2017-05-03T20:09:06Z

@Mataivic We've already implemented a basic but working wrapper at https://github.com/NetBiol/hackathon2017/blob/master/galaxy-tools/orthofinder.xml . I can open a pull request here and you can work from that maybe?

gregvonkuster · 2017-05-04T11:34:32Z

@nsoranzo At first glance, it looks like there is the possibility that your tools here https://github.com/NetBiol/hackathon2017/tree/master/galaxy-tools may potentially complement my tools here https://github.com/gregvonkuster/galaxy_tools/tree/master/tools/plant_tribes. I'll keep an eye on your work. ;)

lecorguille · 2017-05-22T09:16:41Z

@nsoranzo
What @Mataivic have already produced is a little more advanced. So, thanks for the proposal, but we will continue on our base.

nsoranzo · 2017-05-22T13:22:37Z

@lecorguille Cool, no problem! Planning to submit it here?

lecorguille · 2017-05-22T13:27:15Z

Yes!
(but be nice, it's his first wrapper 😄)

This tool should integrate a bigger workflow developed in a dedicated GitHub repository.

Mataivic · 2017-05-26T12:25:07Z

Hello,

I have been away for two weeks and started to really work on the wrapper this week ; here is what I've done until now : https://github.com/abims-sbr/tools-iuc/tree/orthofinder/tools/orthofinder

A whole bunch of tool options are implemented, but this draft does not deal with incompatible options. Issues remain about dataset collections :

for outputs (for example, orthofinder produces output in sub-directories and I have not figured out yet how to deal with dataset collections with sub-directories). I wrote more details in the readme.md file.
for inputs : for now, the wrapper works with multiple files inputs without using a dataset collection.

gregvonkuster · 2017-05-26T14:22:29Z

@Mataivic I have built several Galaxy wrappers, available here https://github.com/gregvonkuster/galaxy_tools/tree/master/tools/plant_tribes, for the PlantTribes analysis pipelines here https://github.com/dePamphilis/PlantTribes. These tools are doing similar things to yours, so hopefully our work will be complementary.

I have dealt with the same issues you face with regard to outputs - my tools also produce directories of files, But I don't define the outputs as dataset collections because in most cases, my outputs consist of multiple Galaxy datatypes, and dataset collections assume 1 datatype. Also, in some cases my tools produce multiple directory levels on output, and I'm not sure if/how dataset collections would handle directory hierarchies like this.

I've taken the approach of defining new Galaxy datatypes for these tools which are subclasses of the HTML datatype - they are in this PR: galaxyproject/galaxy#3999. These datatypes allow for the directories of files to be placed in the primary dataset's extra_files_path. The primary dataset is rendered with these directories of files as clickable items. Multiple levels of directories can be browsed as well. Here is an example - the center panel is rendered when the eye icon is clicked on history item # 118.

These tools form a workflow that typically proceeds in this order:

AssemblyPostProcessor -> GeneFamilyClassifier -> GeneFamilyIntegrator -> GeneFamilyAligner -> GeneFamilyPhylogenyBuilder

Tools downstream from those that produce these directories of files are written to consume them as inputs. The "end-point" tools in the workflow produce typical Galaxy-like datatypes so that general Galaxy features work on them. For example, the GeneFamiltPhylogenyBuilder tool produces a dataset collection of tree files with the Galaxy nhx datatype:

The elements of this dataset collection can then be rendered with the recently introduced Phylocanvas chart.

Although these tools wrap the PlantTribes analysis pipelines, they can be used to perform this same analysis on any genome. It would be great if you find these datatypes useful for your tools as well as it would help define a more standard approach for handling these directories of files.

Mataivic · 2017-05-26T14:43:29Z

@gregvonkuster Thank you, I'll have a look at it - I don't know really how to deal with Galaxy datatypes yet but I'll do my best to learn that quickly -.

What do you mean by "dataset collections assume 1 datatype" ? My outputs collections contains files with various file extension (.txt, .csv and .faa) so I guess it means something else than files extensions ?.

gregvonkuster · 2017-05-26T15:31:26Z

I don't know really how to deal with Galaxy datatypes yet but I'll do my best to learn that quickly -.

Here is a good explanation https://galaxyproject.org/learn/datatypes/. If you choose to use my datatypes for your tool, you will just define your tool outputs to use one of those datatypes using the format attribute, something like this:

<outputs>
    <data name="output_aln" format="ptalign"/>
</outputs

What do you mean by "dataset collections assume 1 datatype" ?

In the tool outputs section, dataset collections are defined with a format as well, something like this:

<outputs>
        <collection name="tree" type="list">
            <discover_datasets pattern="__name__" directory="some_dir" format="nhx" />
        </collection>
</outputs>

The format attribute defines the Galaxy datatype for all datasets within the collection, and dataset collections do not currently accommodate multiple Galaxy datatypes. So in the above tool, the dataset collection definition requires that all files within the some_dir directory will be the nhx Galaxy datatype.

My outputs collections contains files with various file extension (.txt, .csv and .faa) so I guess it means something else than files extensions ?.

Galaxy "loosely" uses file extensions to categorize Galaxy datatypes, with each Galaxy datatype class having a file extension via the file_ext attribute (e.g., https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/text.py#L24). Since your directories contain files of multiple datatypes, (i.e., file ext txt is associated with the Galaxy datatype class Text, file ext csv is associated with (subclassed from) the Galaxy datatype class Tabular https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/tabular.py#L879 and file ext faa is associated with the Galaxy datatype class Fasta), you will bump into problems attempting to define your tool output as a dataset collection. You will definitely have to use one of the new datatypes I've defined for these kinds of tools, or perhaps add a new one.

Can you provide some details about your outputs? It will help me to possibly be able to tell you which of my existing datatypes can be used or whether you will need an additional datatype.

Mataivic · 2017-05-26T16:42:00Z

@gregvonkuster Thank you for the details !

Can you provide some details about your outputs?

Well, The outputs are several dataset collections, which correspond to the output files of different steps of the tool :

A collection for the orthogroups (the first main results of the tool). It contains .csv and .txt files
A collection for the Working Directory after the blast (it is useful for the user to get these files : if we wants to run the tool from pre-computed blast-results). It contains .txt and .faa files
A collection which contains the final results, which do not work for now since there are many subfolders. I did remembr each file exactly but I guess there are several files extensions here as well.

Should I consider to split each collection ? Each collection would contain a single datatype, but I guess it would make a lot of outputs...

gregvonkuster · 2017-05-27T12:57:50Z

Well, The outputs are several dataset collections, which correspond to the output files of different steps of the tool :

From this description, it sounds like you have a single tool that produces outputs at multiple steps, which implies that steps following an output step will consume the output, do further processing, and produce more outputs. If this is the case, perhaps your tool should be split into multiple tools?

A collection for the orthogroups (the first main results of the tool). It contains .csv and .txt files

A collection for the Working Directory after the blast (it is useful for the user to get these files : if we wants to run the tool from pre-computed blast-results). It contains .txt and .faa files

A collection which contains the final results, which do not work for now since there are many subfolders. I did remembr each file exactly but I guess there are several files extensions here as well.

Except for your third item, it looks like your directories of files are fairly easy to handle. I'm not quite sure of a best approach since I don't have the context about the analyses your tool is attempting to perform. Do tools (or tool processing] steps) that consume outputs assume the files are all in the same directory? My tools do. If so, you can probably still use dataset collections, but you'll need to account of a couple of important items.

As stated previously, dataset collections require a single Galaxy datatype per collection. So for your first and second items you'll need 2 collections, one for .csv files and another for .txt files in the first item, and one for faa files and another for txt files in the second item.
If downstream tools (or tool processing steps) require all files to be in the same directory, your tool code will have to symlink all elements of the 2 collections into a temporary directory to be consumed.

The other approach would be to use one of the new datatypes I've created in the PR discussed above or add a new one yourself. My datatypes categorize the data in this way.

ptortho: Proteins orthogroup fasta files
ptorthocs: Protein and coding sequences orthogroup fasta files
pttgf: Targeted gene families
pttree: Phylogenetic trees
ptphylip: Orthogroup phylip multiple sequence alignments
ptalign: Proteins orthogroup alignments
ptalignca: Protein and coding sequences orthogroup alignments
ptaligntrimmed: Trimmed proteins orthogroup alignments
ptaligntrimmedca: Trimmed protein and coding sequences orthogroup alignments
ptalignfiltered: Filtered proteins orthogroup alignments
ptalignfilteredca: Filtered protein and coding sequences orthogroup alignments

If any of your outputs consist of datasets that are defined by any of those descriptions, that datatype could be used for your output. Or you could define a new datatype if needed.

A very important caveat regarding this approach is that these datatypes cannot currently be tested with the travis test environment defined for this tools-iuc repository. The current Galaxy functional test framework does not accommodate datatypes that represent dynamic numbers of files of multiple datatypes, contained within directory hierarchies.

In fact, I'm still working to get some functional tests built for several of my tools that use these datatypes. My approach for this is to incorporate Galaxy workflows for testing the tools. I have taken a look at this project https://github.com/phnmnl/wft4galaxy, but ran into this issue phnmnl/wft4galaxy#2, so I haven't pursued it. Instead, I'm trying to use planemo for testing the workflows, but I have yet to get this approach working.

Should I consider to split each collection ? Each collection would contain a single datatype, but I guess it would make a lot of outputs...

Based on the testing issues I've discussed above, this may be your best approach. But I only see this working for your first 2 items. I don't see how it will work for your third item which is a hierarchy of directories of files of multiple datatypes. I'm not sure dataset collections will work for this (ping @jmchilton).

For your first 2 items, if you use dataset collections, you'll only need 2 I think, so there won't be "a lot of outputs", but only 2 collections. Of course, the number of elements in each collection may be very large, but that's ok.

abretaud · 2020-09-24T11:50:20Z

Orthofinder is in IUC and up-to-date now

nsoranzo added the missing tools label Feb 8, 2017

martenson added new software tool request and removed missing tools new software labels Feb 23, 2017

lecorguille changed the title ~~Tool request: FastOrtho~~ Tool request: Fast/Ortho/Finder May 22, 2017

abretaud closed this as completed Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool request: Fast/Ortho/Finder #1168

Tool request: Fast/Ortho/Finder #1168

lecorguille commented Feb 8, 2017

timdiels commented Feb 10, 2017

lecorguille commented Feb 10, 2017

Mataivic commented May 3, 2017

nsoranzo commented May 3, 2017

gregvonkuster commented May 4, 2017

lecorguille commented May 22, 2017

nsoranzo commented May 22, 2017

lecorguille commented May 22, 2017

Mataivic commented May 26, 2017 •

edited

Loading

gregvonkuster commented May 26, 2017

Mataivic commented May 26, 2017

gregvonkuster commented May 26, 2017

Mataivic commented May 26, 2017

gregvonkuster commented May 27, 2017

abretaud commented Sep 24, 2020

Tool request: Fast/Ortho/Finder #1168

Tool request: Fast/Ortho/Finder #1168

Comments

lecorguille commented Feb 8, 2017

timdiels commented Feb 10, 2017

lecorguille commented Feb 10, 2017

Mataivic commented May 3, 2017

nsoranzo commented May 3, 2017

gregvonkuster commented May 4, 2017

lecorguille commented May 22, 2017

nsoranzo commented May 22, 2017

lecorguille commented May 22, 2017

Mataivic commented May 26, 2017 • edited Loading

gregvonkuster commented May 26, 2017

Mataivic commented May 26, 2017

gregvonkuster commented May 26, 2017

Mataivic commented May 26, 2017

gregvonkuster commented May 27, 2017

abretaud commented Sep 24, 2020

Mataivic commented May 26, 2017 •

edited

Loading