Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: invalid base: 0067 #137

Open
holmrenser opened this issue Jun 4, 2024 · 6 comments
Open

Error: invalid base: 0067 #137

holmrenser opened this issue Jun 4, 2024 · 6 comments

Comments

@holmrenser
Copy link

holmrenser commented Jun 4, 2024

I'm trying to use simpleaf to build an index for Glycine max (soybean). The genome and gtf files required some preprocessing to get them properly formatted.

I ran the following command (using simpleaf 0.16.2):

simpleaf index --output simpleaf_index --fasta ../../g_max.genome.fasta --gtf ../../g_max.longest_transcripts.gtf --rlen 91 --threads 16 --use-piscem

Which resulted in the following output:

2024-06-04T10:05:48.261414Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2024-06-04T10:05:50.342651Z  INFO grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 752330 records.
2024-06-04T10:05:51.029383Z  INFO roers: Built the Grangers object for 752330 records
2024-06-04T10:05:51.237147Z  WARN grangers::grangers_info: The exon_number column contains null values. Will compute the exon number from exon start position .
2024-06-04T10:05:51.527120Z  WARN roers: Found missing gene_id and/or gene_name; Imputing. If both missing, will impute using transcript_id; Otherwise, will impute using the existing one.
2024-06-04T10:05:51.549542Z  INFO roers: Proceed 278761 exon records from 55589 transcripts
Error: invalid base: 0067

The error message is a bit cryptic, so I don't really know what to do. I tried searching some of the rust repositories but haven't found the error message source yet.

If relevant I can provide the genome and gtf files.

EDIT:

Upon further investigation this seems to stem from the noodles crate: https://github.com/zaeleus/noodles/blob/906f5237c68fc6b04a73010580d3c4fed2c7b66e/noodles-fasta/src/record/sequence/complement.rs#L24. However, I don't really understand what's wrong yet.

Quick python check:

>>> bytes([67])
b'C'

Which should be possible to reverse complement?

@rob-p
Copy link
Contributor

rob-p commented Jun 4, 2024

Tagging @DongzeHE and @zaeleus here; any idea what might be going on?

@zaeleus
Copy link

zaeleus commented Jun 4, 2024

The error message is printing the invalid character in hex, though there's a slight issue with the formatting. 0x67 is g. Complementing lowercase bases was recently added to noodles 0.75.0/noodles-fasta 0.39.0 in zaeleus/noodles#267.

@rob-p
Copy link
Contributor

rob-p commented Jun 4, 2024

Thanks @zaeleus! @holmrenser — perhaps a quick solution to the problem would be to convert the lowercase reference sequence to uppercase? Something like:

seqkit seq -u input_ref.fa > output_ref.fa

with the excellent seqkit tool?

@DongzeHE: We should handle this internally. Specifically we should decide when we normalize the sequence.

@holmrenser
Copy link
Author

Ah I definitely did not catch the hex formatting! Since I'm doing the preprocessing myself I can easily convert to uppercase. Would be nice to have some docs and a bit more explicit warnings/errors. With some help I can probably submit a PR.

@rob-p
Copy link
Contributor

rob-p commented Jun 4, 2024

Hi @holmrenser,

I definitely agree. It would be great to handle such cases internally (and then potentially report statistics). The only question here is "when" normalization should take place. Right now the process looks like:

(1) Construct the augmented references (this is where the error you saw occurred)
(2) Compute the relevant signatures of the resulting sequences (for reproducibility & metadata propagation)
(3) Build the index on these sequences

Currently, any normalization is happening after step 1. We could put normalization in step (1) itself, or we could just upgrade the noodles version. The question is if that takes care of all cases we'd wish to handle or not (I think it does).

zaeleus added a commit to zaeleus/noodles that referenced this issue Jun 4, 2024
The previous message incorrectly printed the invalid base without a hex
prefix. Instead of printing all characters in hex, this now prints the debug
output of a byte string, i.e., printable characters are shown normally
and non-printable characters are escaped.

See COMBINE-lab/simpleaf#137.
@holmrenser
Copy link
Author

I can see there's a few considerations that have to be weighed. For now, I can confirm that converting the genome sequences to all uppercase solved the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants