virus-info¶
Overview¶
The virus-info
subcommand identifies and reports details on potential segmented viruses based on their taxonomy. It cross-references a user-supplied taxonomy file with a segmented viruses database and a genome type database to predict the genome structure and type for each contig. Additionally, it outputs information for further investigation if a high fraction of viruses within a taxon are segmented.
For each contig in the taxonomy file:
If the taxonomy is
"unclassified viruses"
, a minimal record is created (genome type set as “uncharacterized” and structure as “undetermined”).Otherwise, the contig’s taxnomy lineage is searched through a list of known segmented taxa and information on the segmentation of the corresponding taxon is retrieved. If the contig’s taxonomy belongs to a taxon containing segmented viruses and the fraction of segmented viruses in that taxon is high (>=25%), extra details are echoed to the terminal (see Example below).
It also deduces the predicted genome structure based on the taxonomy:
"segmented"
if the fraction of segmented viruses in that taxon (segmented_fraction) is 100%,"undetermined"
if segmented_fraction is > 0 but less than 100%,"non-segmented"
otherwise.
Finally, it will also predict the genome type based on the contig’s taxonomy.
If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to
"uncharacterized"
and its structure to"undetermined"
.
Example
Information echoed to the terminal if a sequence belongs to a taxon with segmented viruses:
Seq4 is part of the Chrysoviridae Family, 100.00% of these are segmented viruses.
Most segmented viruses of the Family Chrysoviridae have 4 segments, but it can vary between 3 and 7 depending on the species.
You might want to look into your data to see if you can identify the missing segments.
The corresponding info in segmented_viruses_info.tsv
:
contig |
rank |
taxon |
parent |
total |
segmented |
segmented_fraction |
majority_segment |
min_segment |
max_segment |
---|---|---|---|---|---|---|---|---|---|
Seq4 |
Family |
Chrysoviridae |
Alphatotivirineae |
29 |
29 |
100.0 |
4 |
3 |
7 |
Required Input¶
--taxonomy:
Path to the taxonomy file (TSV). This file should contain columns at least for “contig” and “taxonomy”. For example, it might be generated by thesuvtk taxonomy
subcommand.-d, --database:
The directory path to the suvtk database. This directory must contain the following files:nodes.dmp
names.dmp
-o, --output:
The output directory path where the results will be saved. If the directory does not exist, it will be created.
Output Files¶
miuvig_taxonomy.tsv
:Columns:
contig
,pred_genome_type
,pred_genome_struc
This file lists each contig with its predicted genome type and structure.
segmented_viruses_info.tsv
:Contains detailed information on segmented virus records (if any were found) sorted by the segmented fraction.
Example Usage¶
Below is an example command-line invocation:
suvtk virus-info --taxonomy taxonomy.tsv --database /path/to/database --output output_dir
This command will process the taxonomy.tsv
file, using the nodes.dmp
and names.dmp
in /path/to/database
, and write the results (e.g., miuvig_taxonomy.tsv
and segmented_viruses_info.tsv
) to the output_dir
.
Additional Notes¶
If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to
"uncharacterized"
and its structure to"undetermined"
.