features


Overview

This command creates an NCBI feature table by predicting open reading frames (ORFs) from an input FASTA file with pyrodigal-gv. Subsequently, the ORFs get annotated by aligning the protein translations to the Big Fantastic Virus Database with mmseqs2 and selecting the protein name of the top hit.

Example
>Feature Seq1
73	3537	CDS
			product	RNA-directed RNA polymerase 
			inference	ab initio prediction:pyrodigal-gv:0.3.2
			inference	alignment:MMseqs2:17.b804f:UniProtKB:A5Y5A1,BFVD:A5Y5A1_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000
>Feature Seq2
92	1645	CDS
			product	Maturation protein
			inference	ab initio prediction:pyrodigal-gv:0.3.2
			inference	alignment:MMseqs2:17.b804f:UniProtKB:A0A8K1XYM6,BFVD:A0A8K1XYM6_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000
1664	2035	CDS
			product	hypothetical protein
			inference	ab initio prediction:pyrodigal-gv:0.3.2
2037	3962	CDS
			product	RNA-directed RNA polymerase 
			inference	ab initio prediction:pyrodigal-gv:0.3.2
			inference	alignment:MMseqs2:17.b804f:UniProtKB:A0A8S5L3Y1,BFVD:A0A8S5L3Y1_unrelaxed_rank_001_alphafold2_ptm_model_2_seed_000

Note

  1. When there is no hit for the predicted ORF, the ORF will be annotated with ‘hypothetical protein’.

  2. The /inference evidence qualifier is further used to add support for the annotation in the feature table:

  • ORF prediction: ab initio prediction:<prediction_software>:<version>

  • ORF annotation: alignment:<alignment_software>:<version>[:<reference_db1>:<reference_accession1>,<reference_db2>:<reference_accession2>,...]

Reorientation of sequences

Additionally, if a taxonomy file is provided, suvtk features will reorient the input nucleotide sequences based on their taxonomy. This means that ssRNA(-) virus sequences (ie. part of the Negarnaviricota) will get a negative orientation (3’ → 5’) and all other sequences, including unclassified sequences, a positive orientation (5’ → 3’). This reorientation is based on the strand on which the majority of ORFs are found by pyrodigal.


Required Input

  • -i, --input: Input FASTA file with sequences. (Required)

  • -o, --output: Output directory where the results will be saved. (Required)

  • -d, --database: Path to the suvtk database folder. (Required)

Optional Parameters

  • -g, --translation-table: Translation table (default: 1). Valid codes: 1–6, 9–16, 21–31.

  • --coding-complete: Flag to keep only genomes with >50% coding capacity.

  • --taxonomy: Taxonomy file for adjusting sequence orientation (particularly for ssRNA(-) viruses).

  • --separate-files: Flag to save feature tables into separate files rather than one combined file.

  • -t, --threads: Number of threads to use (default: 4).

Warning

For now only coding complete sequences (>50% coding capacity) will have predicted features and be present in the feature table (.tbl).

Output

  • proteins.faa: Protein sequences in FASTA format.

  • miuvig_features.tsv: MIUVIG-related features details (ie. prediction tool, reference database and search method).

  • reoriented_nucleotide_sequences.fna: Potentially reoriented nucleotide sequences.

  • alignment.m8: Alignment file from the protein search.

  • One or more feature table files in NCBI format.

  • no_ORF_prediction.txt: List of sequences with insufficient ORF predictions.

Example Usage

suvtk features -i sequences.fasta -o output_dir -d /path/to/database -g 1 --coding-complete --taxonomy taxonomy.tsv -t 4