-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: invalid base: 0067 #137
Comments
The error message is printing the invalid character in hex, though there's a slight issue with the formatting. |
Thanks @zaeleus! @holmrenser — perhaps a quick solution to the problem would be to convert the lowercase reference sequence to uppercase? Something like:
with the excellent @DongzeHE: We should handle this internally. Specifically we should decide when we normalize the sequence. |
Ah I definitely did not catch the hex formatting! Since I'm doing the preprocessing myself I can easily convert to uppercase. Would be nice to have some docs and a bit more explicit warnings/errors. With some help I can probably submit a PR. |
Hi @holmrenser, I definitely agree. It would be great to handle such cases internally (and then potentially report statistics). The only question here is "when" normalization should take place. Right now the process looks like: (1) Construct the augmented references (this is where the error you saw occurred) Currently, any normalization is happening after step 1. We could put normalization in step (1) itself, or we could just upgrade the noodles version. The question is if that takes care of all cases we'd wish to handle or not (I think it does). |
The previous message incorrectly printed the invalid base without a hex prefix. Instead of printing all characters in hex, this now prints the debug output of a byte string, i.e., printable characters are shown normally and non-printable characters are escaped. See COMBINE-lab/simpleaf#137.
I can see there's a few considerations that have to be weighed. For now, I can confirm that converting the genome sequences to all uppercase solved the issue. Thanks! |
I'm trying to use simpleaf to build an index for Glycine max (soybean). The genome and gtf files required some preprocessing to get them properly formatted.
I ran the following command (using
simpleaf 0.16.2
):Which resulted in the following output:
The error message is a bit cryptic, so I don't really know what to do. I tried searching some of the rust repositories but haven't found the error message source yet.
If relevant I can provide the genome and gtf files.
EDIT:
Upon further investigation this seems to stem from the noodles crate: https://github.com/zaeleus/noodles/blob/906f5237c68fc6b04a73010580d3c4fed2c7b66e/noodles-fasta/src/record/sequence/complement.rs#L24. However, I don't really understand what's wrong yet.
Quick python check:
Which should be possible to reverse complement?
The text was updated successfully, but these errors were encountered: