Integrate sequence data into Bio4j

difficulty medium
technologies aws++, s3+++, dynamodb+++, biology+, scala++

There is a lot of raw sequence data that is connected to the data already integrated in bio4j: Protein sequences, coding sequences (genes), RNA, etc. Some of this sequence data is available as part of Bio4j modules; however, graph databases are obviously not designed for storing (and indexing) essentially Strings at this scale. Another key aspect here is that the needs of string matching in biology are pretty specific, depending on parameters such as

the type of sequences involved
the subtle and complex issue of assigning biological meaning to sequence similarities
the input for the queries

There is however a use case for which this integration would add great value: just being able to select a particular set of sequences based in the result of queries, which can take advantage of the rich model and integrated datasets; basically we are using the Bio4j graph as an index. For this the sequences could be stored as a combination of DynamoDB items and S3 objects, and then queries could return either just ids or the sequences themselves as needed.

In terms of implementing something giving the user the possibility of making queries about the sequence composition itself, we want to focus on protein sequences. Here just exact matches could be of use in some contexts, even more if it can be combined with graph traversals. The integration of domain specific tools such as BLAST-like local alignment tools would also be nice and incredibly useful.

Expected outcome

A/some bio4j module/s providing access to sequence-based data linked with bio4j model entities. A more deep integration for protein sequences, allowing the user to query sequence composition, together with the integration of domain specific tools such as BLAST.

Possible mentors

@evdokim (mailto:ekovach@ohnosequences.com)
@eparejatobes (mailto:eparejatobes@ohnosequences.com)
@alberskib (mailto:alberskib@gmail.com)

If you are interested ask on the bio4j/gsoc15 gitter channel, or to any of the possible mentors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate-sequence-data-into-Bio4j.md

Integrate-sequence-data-into-Bio4j.md

Integrate sequence data into Bio4j

Expected outcome

Possible mentors

Files

Integrate-sequence-data-into-Bio4j.md

Latest commit

History

Integrate-sequence-data-into-Bio4j.md

File metadata and controls

Integrate sequence data into Bio4j

Expected outcome

Possible mentors