-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should one search for ORFs? #299
Comments
We should definitely support
|
Sorry, I wasn't being clear. We should definitely extend the |
I mean, codons seem like 3-mers to me 🤷 - actually, they might be the O.G. Kmer 😆 I suppose my questions would be
|
Thanks for opening the issue out of the PR, I did have my doubts about the implementation and where to put it. Finding that there were other predicates in This
The implementation I ended up with was a little cumbersome: It first creates an iterator of a function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool = false) where N
regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
finder(x) = findfirst(regorf, sequence, first(x) + 1) # + 3
itr = takewhile(!isnothing, iterated(finder, findfirst(regorf, sequence)))
return itr
end And then traverse the sequence looking for the iterator match in both strands of the sequence and stores the ORFs that match the function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, min_len::Int64 = 6) where N
orfs = Vector{ORF}()
reversedseq = reverse_complement(sequence)
seqlen = length(sequence)
frames = Dict(0 => 3, 1 => 1, 2 => 2)
for strand in ('+', '-')
seq = strand == '-' ? reversedseq : sequence
@inbounds for location in locationiterator(seq; alternative_start)
if length(location) >= min_len
frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
end
end
end
return sort(orfs)
end It takes care of the frame issue by simply calculating the module of the start position of each match ( Another alternative I was thinking about (but I lack the expertise) is to create a better |
Please take a look here: https://github.com/Roleren/ORFik/blob/master/src/findOrfsFasta.cpp Its well tested implementation and very fast at the same time. |
CC @camilogarciabotero, ref. #297
How would a user go about findings ORFs with BioSequences? To do this, they need to be able to do several fundamental operations that BioSequences currently doesn't support:
BioRegex
doesn't support mostfind*
functions, likefindnext
andfindall
.It's possible this needs to be added in Kmers.jl, since Kmers contain the RNACodon type, the CodonSet type, and reverse translation. Kmers.jl is currently in an incomplete state, but I'm slowly working rewriting it.
So, my current proposal is:
find*
functions forBioRegex
.Maybe we should include a convenience function in Kmers that is something like this:
The text was updated successfully, but these errors were encountered: