Nullomers Assessor: A computational method for statistical evaluation of nullomers and Minimal Absent Words (MAWs)
Description:
Nullomers Assessor is a probabilistic methodology for the evaluation of biological MAWs based on Markovian models with multiple-testing correction. A 'significant' MAW is an absent motif which is statistically expected to occur. Significant absent motifs are considered to be under negative selection. The algorithm estimates the frequency of residues and subsequently calculates 3 transition probabilities, one for each of the first three Markovian orders. The method can analyse either nucleotide or amino acid sequences. Three different statistical correction methods have been implemented and provided built-in with the current version of the script.
Preparatory steps:
The nullomers_assessor.py script requires 2 input files. The first one should be a fasta file containing the entire genome/proteome of a species. The second file should be a list of MAWs. Tools such as MAW or em-MAW can be used for the identification of MAWs. The two above-mentioned tools though, require a fasta file with one header and one sequence in order to calculate globally missing motifs from a given genome/proteome. The specific format can be easily achieved by using the fasta_formatter.py script which transforms a typical fasta file into a two-line fasta file. Sample files are provided in a separate directory. For more information visit: https://www.nullomers.org/.
Usage:
Simply download and execute the nullomers_assessor.py script by giving the following 6 arguments (separated by a blank space).
--absolute-path-of-fasta-file <string> <mandatory> A typical fasta file (either DNA or protein sequences)
--absolute-path-of-MAWs-file <string> <mandatory> A list of MAWs (without header)
--threshold <double> <mandatory> A float number between [0-1] indicating the threshold of
statistical correction. MAWs with corrected p-values
greater than the user-specified cut-off will be discarded
--type-of-sequences <string> <mandatory> 'DNA' for nucleotide sequences,
'PROT' for protein sequences
--statistical-correction <string> <mandatory> 'BONF' for Bonferroni correction,
'TARONE' for Tarone correction,
'FDR' for Benjamini-Hochberg procedure
--print-log <boolean> <mandatory> 'TRUE' or 'FALSE'
Example of usage:
Unix OS:
python3 nullomers_assessor.py input/P.troglodytes_genome.fasta output/pantro_nullomers.out 0.01 DNA BONF TRUE
Windows OS:
py nullomers_assessor.py C:\input\P.troglodytes_genome.fasta C:\output\pantro_nullomers.out 0.01 DNA BONF TRUE
Results:
Nullomers Assessor has been applied to numerous genomes and proteomes of various organisms including Homo Sapiens. The results of the study are provided via Nullomers Database, a web-based repository which hosts significant MAWs from hundreds of species and thoudands of individual virus sequences.
Publication:
For more information about the method and the statistical correction procedures, please consult the publication:
Koulouras G, Frith MC. Significant non-existence of sequences in genomes and proteomes. Nucleic Acids Res. 2021;49(6):3139–3155. [link]
Contact:
For questions, suggestions, bug-reports, or feedback, please email me:
- gkoulouras {at symbol} gmail {dot symbol} com
License:
This project is licensed under the Apache 2.0 license, quoted below.
Copyright (c) 2020 Grigorios Koulouras
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.