virus-info


Overview

The virus-info subcommand identifies and reports details on potential segmented viruses based on their taxonomy. It cross-references a user-supplied taxonomy file with a segmented viruses database and a genome type database to predict the genome structure and type for each contig. Additionally, it outputs information for further investigation if a high fraction of viruses within a taxon are segmented.

For each contig in the taxonomy file:

  • If the taxonomy is "unclassified viruses", a minimal record is created (genome type set as “uncharacterized” and structure as “undetermined”).

  • Otherwise, the contig’s taxnomy lineage is searched through a list of known segmented taxa and information on the segmentation of the corresponding taxon is retrieved. If the contig’s taxonomy belongs to a taxon containing segmented viruses and the fraction of segmented viruses in that taxon is high (>=25%), extra details are echoed to the terminal (see Example below).

  • It also deduces the predicted genome structure based on the taxonomy:

    • "segmented" if the fraction of segmented viruses in that taxon (segmented_fraction) is 100%,

    • "undetermined" if segmented_fraction is > 0 but less than 100%,

    • "non-segmented" otherwise.

  • Finally, it will also predict the genome type based on the contig’s taxonomy.

  • If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to "uncharacterized" and its structure to "undetermined".

Example

Information echoed to the terminal if a sequence belongs to a taxon with segmented viruses:

Seq4 is part of the Chrysoviridae Family, 100.00% of these are segmented viruses.
Most segmented viruses of the Family Chrysoviridae have 4 segments, but it can vary between 3 and 7 depending on the species.

You might want to look into your data to see if you can identify the missing segments.

The corresponding info in segmented_viruses_info.tsv:

contig

rank

taxon

parent

total

segmented

segmented_fraction

majority_segment

min_segment

max_segment

Seq4

Family

Chrysoviridae

Alphatotivirineae

29

29

100.0

4

3

7


Required Input

  • --taxonomy:
    Path to the taxonomy file (TSV). This file should contain columns at least for “contig” and “taxonomy”. For example, it might be generated by the suvtk taxonomy subcommand.

  • -d, --database:
    The directory path to the suvtk database. This directory must contain the following files:

    • nodes.dmp

    • names.dmp

  • -o, --output:
    The output directory path where the results will be saved. If the directory does not exist, it will be created.

Output Files

  • miuvig_taxonomy.tsv:

    • Columns: contig, pred_genome_type, pred_genome_struc

    • This file lists each contig with its predicted genome type and structure.

  • segmented_viruses_info.tsv:

    • Contains detailed information on segmented virus records (if any were found) sorted by the segmented fraction.

Example Usage

Below is an example command-line invocation:

suvtk virus-info --taxonomy taxonomy.tsv --database /path/to/database --output output_dir

This command will process the taxonomy.tsv file, using the nodes.dmp and names.dmp in /path/to/database, and write the results (e.g., miuvig_taxonomy.tsv and segmented_viruses_info.tsv) to the output_dir.

Additional Notes

  • If a contig’s taxonomy is not part of the official ICTV taxonomy, the predicted genome type will default to "uncharacterized" and its structure to "undetermined".