virus-info¶
Overview¶
The virus-info subcommand identifies and reports details on potential segmented viruses based on their taxonomy. It cross-references a user-supplied taxonomy file with a segmented viruses database and a genome type database to predict the genome structure and type for each contig. Additionally, it outputs information for further investigation if a high fraction of viruses within a taxon are segmented.
For each contig in the taxonomy file:
If the taxonomy is
"unclassified viruses", a minimal record is created (genome type set as “uncharacterized” and structure as “undetermined”).Otherwise, the contig’s taxnomy lineage is searched through a list of known segmented taxa and information on the segmentation of the corresponding taxon is retrieved. If the contig’s taxonomy belongs to a taxon containing segmented viruses and the fraction of segmented viruses in that taxon is high (>=25%), extra details are echoed to the terminal (see Example below).
It also deduces the predicted genome structure based on the taxonomy:
"segmented"if the fraction of segmented viruses in that taxon (segmented_fraction) is 100%,"undetermined"if segmented_fraction is > 0 but less than 100%,"non-segmented"otherwise.
Finally, it will also predict the genome type based on the contig’s taxonomy.
If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to
"uncharacterized"and its structure to"undetermined".
Example
Information echoed to the terminal if a sequence belongs to a taxon with segmented viruses:
Seq4 is part of the Chrysoviridae Family, 100.00% of these are segmented viruses.
Most segmented viruses of the Family Chrysoviridae have 4 segments, but it can vary between 3 and 7 depending on the species.
You might want to look into your data to see if you can identify the missing segments.
The corresponding info in segmented_viruses_info.tsv:
contig |
rank |
taxon |
parent |
total |
segmented |
segmented_fraction |
majority_segment |
min_segment |
max_segment |
|---|---|---|---|---|---|---|---|---|---|
Seq4 |
Family |
Chrysoviridae |
Alphatotivirineae |
29 |
29 |
100.0 |
4 |
3 |
7 |
Required Input¶
--taxonomy:
Path to the taxonomy file (TSV). This file should contain columns at least for “contig” and “taxonomy”. For example, it might be generated by thesuvtk taxonomysubcommand.-d, --database:
The directory path to the suvtk database. This directory must contain the following files:nodes.dmpnames.dmp
-o, --output:
The output directory path where the results will be saved. If the directory does not exist, it will be created.
Output Files¶
miuvig_taxonomy.tsv:Columns:
contig,pred_genome_type,pred_genome_strucThis file lists each contig with its predicted genome type and structure.
segmented_viruses_info.tsv:Contains detailed information on segmented virus records (if any were found) sorted by the segmented fraction.
Example Usage¶
Below is an example command-line invocation:
suvtk virus-info --taxonomy taxonomy.tsv --database /path/to/database --output output_dir
This command will process the taxonomy.tsv file, using the nodes.dmp and names.dmp in /path/to/database, and write the results (e.g., miuvig_taxonomy.tsv and segmented_viruses_info.tsv) to the output_dir.
Additional Notes¶
If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to
"uncharacterized"and its structure to"undetermined".