taxonomy¶
Overview¶
This submodule of suvtk
assigns virus taxonomy to your nucleotide sequences based on the ICTV guidelines (ie. classification to the lowest fitting taxon appended with “sp.”, eg. “Coronavirus sp.”). It uses an MMseqs2 database with all proteins of ICTV ratified viruses downloaded from NCBI. After mmseqs2
alignment, the taxonomy is then decided with a lowest common ancestor approach (LCA) based on the best hits as implemented with mmseqs easy-taxonomy
.
After determining the taxonomy, this subcommand also gives you information on any possible segmented viruses in your data. This could be interesting to look more into your data and try to group segments of the same virus together. For this, the suvtk co-occurrence
module can be helpful.
Finally, suvtk taxonomy
also runs suvtk virus-info
which outputs the mandatory MIUVIG parameters predicted genome structure (segmented, non-segmented, undetermined) and genome type (ssDNA, dsDNA, ssRNA(+), etc.) based on the predicted taxonomy in the miuvig_taxonomy.tsv
file. This file is a required input of suvtk comments
but can be generated yourself following subsequent tsv file format:
contig |
pred_genome_type |
pred_genome_struc |
---|---|---|
<sequence_name_as_in_fasta> |
<allowed_value> |
<allowed_value> |
Allowed MIUVIG values
genome_pred_struc
segmented | non-segmented | undetermined
genome_pred_type
DNA | dsDNA | ssDNA | RNA | dsRNA | ssRNA | ssRNA (+) | ssRNA (-) | mixed | uncharacterized
Adding your own taxonomy
You can provide your own taxonomy to the other submodules (eg. suvtk features
, suvtk comments
), if it is a tsv file in following format:
contig |
taxonomy |
taxid |
---|---|---|
<sequence_name_as_in_fasta> |
<lowest fitting taxon> sp. |
<taxid> |
… |
… |
… |
Required Input¶
-i, --input: Input FASTA file with sequences. (Required)
-o, --output: Output directory where results will be saved. (Required)
-d, --database: Path to the suvtk database folder. (Required)
Optional Parameters¶
-s, --identity: Minimum sequence identity threshold for hits (default: 0.7).
-t, --threads: Number of threads to use (default: 4).
Output¶
taxonomy.tsv
: Main file with taxonomy assignments for each sequence.miuvig_taxonomy.tsv
: MIUVIG-related taxonomy details (ie. predicted genome structure and type).segmented_viruses_info.tsv
: Additional info for segmented viruses (if applicable).
Example Usage¶
suvtk taxonomy -i sequences.fasta -o taxonomy_output -d /path/to/ICTV_db -s 0.7 -t 4