features¶

Overview¶

This command creates an NCBI feature table by predicting open reading frames (ORFs) from an input FASTA file with pyrodigal-rv. If you want to annotate phage sequences, you can specify this with the --phage option which will use pyrodigal-gv for ORF prediction. However, it is recommended to do feature annotation of your phages with pharokka and phold.

Subsequently, the ORFs get annotated by aligning the protein translations to the Big Fantastic Virus Database with mmseqs2 and selecting the protein name of the top hit.

Reorientation of sequences

Additionally, if a taxonomy file is provided, suvtk features will reorient the input nucleotide sequences based on their taxonomy. This means that ssRNA(-) virus sequences (ie. part of the Negarnaviricota) will get a negative orientation (3’ → 5’) and all other sequences, including unclassified sequences, a positive orientation (5’ → 3’). This reorientation is based on the strand on which the majority of ORFs are found by pyrodigal.

Required Input¶

-i, --input: Input FASTA file with sequences. (Required)
-o, --output: Output directory where the results will be saved. (Required)
-d, --database: Path to the suvtk database folder. (Required)

Optional Parameters¶

--coding-complete: Flag to keep only genomes with >50% coding capacity.
--phage: Input sequences are phage genomes (use pyrodigal-gv for ORF prediction).
--taxonomy: Taxonomy file for adjusting sequence orientation (particularly for ssRNA(-) viruses).
--separate-files: Flag to save feature tables into separate files rather than one combined file.
-t, --threads: Number of threads to use (default: auto).

Warning

For now only coding complete sequences (>50% coding capacity) will have predicted features and be present in the feature table (.tbl).

Output¶

proteins.faa: Protein sequences in FASTA format.
miuvig_features.tsv: MIUVIG-related features details (ie. prediction tool, reference database and search method).
reoriented_nucleotide_sequences.fna: Potentially reoriented nucleotide sequences.
alignment.m8: Alignment file from the protein search.
One or more feature table files in NCBI format.
no_ORF_prediction.txt: List of sequences with insufficient ORF predictions.

Example Usage¶

suvtk features -i sequences.fasta -o output_dir -d /path/to/database --coding-complete --taxonomy taxonomy.tsv -t 4