-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mac issues #159
base: master
Are you sure you want to change the base?
Fix mac issues #159
Conversation
…:initializeBacktrace
This commit introduces two checks in the MAC realignment algorithm in order to alleviate Issue soedinglab#153 ("MAC realignment algorithm damages hits", soedinglab#153): 1. If MAC hit is significantly shorter than the Viterbi hit in terms of number of matched columns then the original Viterbi hit is restored and an appropriate warning is issued. This happens when MAC algorithm generates several disjoint local hits or when MAC hit is shorter due to weak score at the start/end regions of the Viterbi hit. The threshold on the MAC hit length is defined by a new option, -mac_min_length (default=0.9, i.e. at least 90% of number of matched columns in Viterbi hit must be in the corresponding MAC hit). There is no obvious way to deal with hit split in the current MAC algorithm aproach, it is expected that the algorithm should find the same Viterbi hit albeit improved and extended at its borders. Usually the MAC hit restores its integrity as soon as the -mact option is decreased. For example, in the test case described in the issue report one could find the proper MAC hit if "-mact 0.33" is used (instead of the default "-mact 0.35" value). Therefore the warning also suggests decreasing -mact option as a viable solution if MAC alignment is still needed. Note: It would be nice to have support for "-mact auto" option that forced the MAC algorithm to decrease -mact to the max. value that restores the hit integrity (i.e. match the original Viterbi hit to the degree defined by -mac_min_length option). 2. MAC realignment algorithm also can find a local hit that is completely disjoint from the original Viterbi hit. This possibility follows from the fact that viterbi_matrix used in MAC DP algorithm is masked (see maskViterbiAlignment()) in a way that allows extension of the hit. This commit adds a check that the found MAC hit has intersection with Viterbi hit in both, query and target, coordinate spaces. If the check fails, the alternative MAC hit that satisfies this condition is sought (in decreasing score order). When no valid MAC hit is found, the original Viterbi hit is restored and an appropriate warning is printed. This error was also observed in the reported issue, see the hit soedinglab#4 (Score=11.3) that was mapped erroneously to the second part of main hit, template columns [261; 357]. The alternative MAC hits' positions are collected in macAlgorithm() procedure and later used in backtraceMAC() procedure.
Dear Dmitry, Thanks a lot for your contribution. Johannes and me spent some time to review your code. Sorry that we were not able to look into it earlier. We agree that its a major issue that the MAC algorithm can return a hit outside of the Viterbi hit. However, as we understand your code, you only check if this happens on the alignment end ( We are not sure if we agree on the 90% coverage criterion. The purpose of the Viterbi is to give the best possible score/e-value estimate that the two sequences are homologous. The MAC algorithm is supposed to give us highest confidence alignment within that region. If only a short part of the alignment is sufficiently confident, then so be it. We are open to including this option with a default 0% threshold, which will always give a maximum accuracy alignment if it was requested. Additionally, a user can also request very greedy (i.e. short, confident) alignments with a large Thanks again for your contribution. |
The default for this option was changed from 0.9 to 0.0. When MAC is enabled user could agree that the hit will be shorter than the Viterbi hit, especially when high -mact values are used. Therefore discarding shorter hits by default would not be the best strategy (Milot's suggestion).
than the Viterbi hit in terms of number of matched columns. This warning is issued when mac_min_length threshold is not triggered, i.e. when MAC hit is accepted despite of being shorter than the Viterbi hit.
about MAC hit if it's shorter than the Viterbi hit (i.e. when -mac_min_length threshold is not triggered). Suggest not only lowering -mact value but also using -mac_min_length threshold.
Dear Milot, Thanks for your prompt reply (I didn't expect immediate feedback, it's summer time).
It's not quite correct. The check is performed for the result of backtracking where both start and end coordinates in query and target spaces are known. See lines 193 and 210 where the MAC hit is considered as valid if it's segments, [i1; i2] and [j1; j2], have non-empty intersection with the Viterbi hit's segments, respectively (i.e. hits overlap in both query and target spaces). In other words, the MAC and Viterbi hits' rectangles in the DP matrix have intersection.
OK. But I think there is no need to short circuit the code of additional backtracking, because even when user agrees to get shorter hits, this doesn't mean the check for the intersection is not needed (see above). We would still need to skip completely wrong hits. Therefore I've changed the default for the mac_min_length to 0.0, this would virtually mean that no checks for hit length are performed. Thank you for your attention. Best regards, |
P.S. I want to remind what is happening when Viterbi hit is split in two local hits during MAC realignment step: then only the strongest part is reported and the weaker part is lost. |
This pull request introduces two checks in the MAC realignment algorithm in order to alleviate
Issue #153 ("MAC realignment algorithm damages hits").
See the last commit comment for the description of the changes.