features¶
Overview¶
This command creates an NCBI feature table by predicting open reading frames (ORFs) from an input FASTA file with pyrodigal-rv. If you want to annotate phage sequences, you can specify this with the --phage option which will use pyrodigal-gv for ORF prediction. However, it is recommended to do feature annotation of your phages with pharokka and phold.
Subsequently, the ORFs get annotated by aligning the protein translations to the Big Fantastic Virus Database with mmseqs2 and selecting the protein name of the top hit.
Example
>Feature Seq1
73 3537 CDS
product RNA-directed RNA polymerase
inference ab initio prediction:pyrodigal-gv:0.3.2
inference alignment:MMseqs2:17.b804f:UniProtKB:A5Y5A1,BFVD:A5Y5A1_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000
>Feature Seq2
92 1645 CDS
product Maturation protein
inference ab initio prediction:pyrodigal-gv:0.3.2
inference alignment:MMseqs2:17.b804f:UniProtKB:A0A8K1XYM6,BFVD:A0A8K1XYM6_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000
1664 2035 CDS
product hypothetical protein
inference ab initio prediction:pyrodigal-gv:0.3.2
2037 3962 CDS
product RNA-directed RNA polymerase
inference ab initio prediction:pyrodigal-gv:0.3.2
inference alignment:MMseqs2:17.b804f:UniProtKB:A0A8S5L3Y1,BFVD:A0A8S5L3Y1_unrelaxed_rank_001_alphafold2_ptm_model_2_seed_000
Note
When there is no hit for the predicted ORF, the ORF will be annotated with ‘hypothetical protein’.
The /inference evidence qualifier is further used to add support for the annotation in the feature table:
ORF prediction:
ab initio prediction:<prediction_software>:<version>ORF annotation:
alignment:<alignment_software>:<version>[:<reference_db1>:<reference_accession1>,<reference_db2>:<reference_accession2>,...]
Reorientation of sequences
Additionally, if a taxonomy file is provided, suvtk features will reorient the input nucleotide sequences based on their taxonomy. This means that ssRNA(-) virus sequences (ie. part of the Negarnaviricota) will get a negative orientation (3’ → 5’) and all other sequences, including unclassified sequences, a positive orientation (5’ → 3’). This reorientation is based on the strand on which the majority of ORFs are found by pyrodigal.
Required Input¶
-i, --input: Input FASTA file with sequences. (Required)
-o, --output: Output directory where the results will be saved. (Required)
-d, --database: Path to the suvtk database folder. (Required)
Optional Parameters¶
--coding-complete: Flag to keep only genomes with >50% coding capacity.
--phage: Input sequences are phage genomes (use
pyrodigal-gvfor ORF prediction).--taxonomy: Taxonomy file for adjusting sequence orientation (particularly for ssRNA(-) viruses).
--separate-files: Flag to save feature tables into separate files rather than one combined file.
-t, --threads: Number of threads to use (default: auto).
Warning
For now only coding complete sequences (>50% coding capacity) will have predicted features and be present in the feature table (.tbl).
Output¶
proteins.faa: Protein sequences in FASTA format.miuvig_features.tsv: MIUVIG-related features details (ie. prediction tool, reference database and search method).reoriented_nucleotide_sequences.fna: Potentially reoriented nucleotide sequences.alignment.m8: Alignment file from the protein search.One or more feature table files in NCBI format.
no_ORF_prediction.txt: List of sequences with insufficient ORF predictions.
Example Usage¶
suvtk features -i sequences.fasta -o output_dir -d /path/to/database --coding-complete --taxonomy taxonomy.tsv -t 4