suvtk package¶
Attention
Docstrings were generated with GitHub Copilot and can contain errors.
Submodules¶
suvtk.cli module¶
cli.py¶
This script provides a command-line interface (CLI) for submitting viral sequences to GenBank. It includes various subcommands for processing and preparing data, such as taxonomy assignment, feature extraction, structured comment generation, and more.
Classes¶
- FullHelpGroup
Custom Click Group to display commands in the order they were added.
Functions¶
- cli()
Main entry point for the CLI tool.
Commands¶
download-database
taxonomy
features
virus_info
co_occurrence
gbk2tbl
comments
table2asn
- class suvtk.cli.FullHelpGroup(name=None, commands=None, invoke_without_command=False, no_args_is_help=None, subcommand_metavar=None, chain=False, result_callback=None, **kwargs)[source]¶
Bases:
Group
Custom Click Group to display commands in the order they were added.
- format_commands(ctx: click.Context, formatter: click.HelpFormatter)[source]¶
Formats and displays commands in the correct order.
suvtk.co_occurrence module¶
co_occurrence.py¶
This script identifies co-occurring sequences in an abundance table based on prevalence and correlation thresholds. It supports optional segment-specific analysis and contig length correction.
Functions¶
- calculate_proportion(df)
Calculate the proportion of samples for each contig.
- create_correlation_matrix(df_transposed)
Generate a Spearman correlation matrix and mask the upper triangle.
- segment_correlation_matrix(df, segment_list)
Calculate correlations for specific segments with all rows in the DataFrame.
- create_segment_list(segment_file)
Read a file containing segment identifiers and return them as a list.
- co_occurrence(input, output, segments, lengths, prevalence, correlation, strict)
Main command to identify co-occurring sequences in an abundance table.
- suvtk.co_occurrence.calculate_proportion(df)[source]¶
Calculate the proportion of samples for each contig in a dataframe.
- Parameters:
df (pandas.DataFrame) – A pandas DataFrame where rows represent contigs and columns represent samples.
- Returns:
The original DataFrame with two additional columns: ‘sample_count’, the total number of samples a contig is present in, and ‘proportion_samples’, the proportion of samples a contig is present in.
- Return type:
pandas.DataFrame
- suvtk.co_occurrence.create_correlation_matrix(df_transposed)[source]¶
Calculate a Spearman correlation matrix for the transposed dataframe and mask the upper triangle.
- Parameters:
df_transposed (pandas.DataFrame) – A transposed pandas DataFrame where rows represent samples and columns represent variables.
- Returns:
A masked correlation matrix with the upper triangle set to NaN, and the axes renamed to ‘Contig1’ and ‘Contig2’.
- Return type:
pandas.DataFrame
- suvtk.co_occurrence.create_segment_list(segment_file)[source]¶
Reads a file containing segment identifiers and returns them as a list.
- Parameters:
segment_file (str) – The path to a file containing segment identifiers, one per line.
- Returns:
A list of segment identifiers with whitespace stripped.
- Return type:
list
- suvtk.co_occurrence.segment_correlation_matrix(df, segment_list)[source]¶
Calculate the correlation of each segment in the segment list with all rows in the DataFrame.
- Parameters:
df (pandas.DataFrame) – A pandas DataFrame with rows representing samples and columns representing variables.
segment_list (list) – A list of segment indices to calculate correlations with.
- Returns:
A DataFrame where each column represents a segment from the segment_list and each value is the Spearman correlation with the corresponding row in the original DataFrame.
- Return type:
pandas.DataFrame
suvtk.comments module¶
suvtk.download_database module¶
download_database.py¶
This script downloads and extracts the suvtk database as a gzipped tar file from Zenodo.
Functions¶
- doi_to_record_id(doi: str) -> str
Extract the numeric record ID from a Zenodo DOI.
- fetch_record_metadata(record_id: str) -> dict
Fetch the Zenodo record metadata in JSON form.
- find_tar_file(files: list) -> dict
Locate the first .tar or .tar.gz file in the record’s files list.
- download_file(url: str, dest: str, chunk_size: int = 1024 * 1024)
Stream-download a file from a URL to a local path.
- unpack_tar(archive: str, output_dir: str = None)
Extract a .tar or .tar.gz archive to a directory.
- suvtk.download_database.doi_to_record_id(doi)[source]¶
Extract the numeric record ID from a Zenodo DOI.
- Return type:
str
- suvtk.download_database.download_file(url, dest, chunk_size=1048576)[source]¶
Stream-download a file from url to local path dest.
- suvtk.download_database.fetch_record_metadata(record_id)[source]¶
Fetch the Zenodo record metadata in JSON form.
- Return type:
dict
suvtk.features module¶
features.py¶
This script processes input sequences to predict open reading frames (ORFs), aligns the predicted protein sequences against a database, and generates feature tables for submission to GenBank.
Functions¶
- validate_translation_table(ctx, param, value)
Validate the given translation table.
- calculate_coding_capacity(genes, seq_length)
Calculate the total coding capacity for a list of genes.
- find_orientation(genes)
Determine the orientation of genes based on strand information.
- predict_orfs(orf_finder, seq)
Predict ORFs, compute coding capacity, and determine orientation.
- features(fasta_file, output_path, database, transl_table, coding_complete, taxonomy, separate_files, threads)
Main command to create feature tables for sequences.
- suvtk.features.calculate_coding_capacity(genes, seq_length)[source]¶
Calculate the total coding capacity for a list of genes.
- Parameters:
genes (list) – A list of gene objects.
seq_length (int) – The length of the sequence.
- Returns:
The total coding capacity.
- Return type:
float
- suvtk.features.extract_gene_results(genes, record_id, seq_length)[source]¶
Extract gene prediction results for a sequence.
- Parameters:
genes (list) – A list of gene objects.
record_id (str) – The ID of the sequence record.
seq_length (int) – The length of the sequence.
- Returns:
A list of gene prediction results.
- Return type:
list
- suvtk.features.find_orientation(genes)[source]¶
Calculate the sum of the strand orientations for a list of genes. If the sum is zero, return the orientation of the largest gene.
- Parameters:
genes (list) – A list of gene objects, each having ‘strand’, ‘begin’, and ‘end’ attributes.
- Returns:
The sum of strand orientations across all genes, or the orientation of the largest gene if the sum is zero.
- Return type:
int
- suvtk.features.get_lineage(record_id, taxonomy_data, taxdb)[source]¶
Retrieve the lineage of a given record from the taxonomy table.
- Parameters:
record_id (str) – The ID of the sequence record.
taxonomy_data (pandas.DataFrame) – The taxonomy data table.
taxdb (taxopy.TaxDb) – The taxonomy database.
- Returns:
The lineage of the record.
- Return type:
list
- suvtk.features.predict_orfs(orf_finder, seq)[source]¶
Find genes, compute coding capacity, and determine orientation.
- Parameters:
orf_finder (pyrodigal_gv.ViralGeneFinder) – The ORF finder object.
seq (Bio.Seq.Seq) – The sequence to analyze.
- Returns:
A tuple containing genes, coding capacity, orientation, and the ORF finder used.
- Return type:
tuple
- suvtk.features.save_ncbi_feature_tables(df, output_dir='.', single_file=True)[source]¶
Generate and save NCBI feature tables for sequences in a DataFrame.
This function creates a single feature table file by default, but can also save separate files for each unique sequence ID when specified.
- Parameters:
df (pd.DataFrame) – DataFrame containing sequence data with columns [‘seqid’, ‘accession’, ‘start’, ‘end’, ‘strand’, ‘type’, ‘Protein names’, ‘source’, ‘start_codon’, ‘partial_begin’, ‘partial_end’].
output_dir (str, optional) – Directory path to save the feature tables. Defaults to “.”.
single_file (bool, optional) – If True, saves all features to one file; otherwise, saves separate files.
- Return type:
None
- suvtk.features.select_top_structure(df)[source]¶
Select the top structure for each query based on the bitscore.
- Parameters:
df (pandas.DataFrame) – A DataFrame with columns ‘query’ and ‘bits’.
- Returns:
A DataFrame with the top structure for each query.
- Return type:
pandas.DataFrame
- suvtk.features.validate_translation_table(ctx, param, value)[source]¶
Validate that the given translation table is one of the valid genetic codes.
- Parameters:
ctx (click.Context) – The Click context object. Unused.
param (click.Parameter) – The parameter object. Unused.
value (int) – The given translation table.
- Returns:
The given translation table if it is valid.
- Return type:
int
- Raises:
click.BadParameter – If the given translation table is not valid.
- suvtk.features.write_feature_entries(file, group)[source]¶
Helper function to write feature entries to a file.
- Parameters:
file (file-like object) – The file to write to.
group (pandas.DataFrame) – The group of feature entries to write.
- Return type:
None
- suvtk.features.write_nucleotides(sequence, output_handle, overwrite)[source]¶
Write nucleotide sequences to a file.
- Parameters:
sequence (Bio.SeqRecord.SeqRecord) – The sequence record to write.
output_handle (str) – The output file path.
overwrite (bool) – Whether to overwrite the file.
- Returns:
Updated overwrite flag.
- Return type:
bool
- suvtk.features.write_proteins(genes, record_id, dst_path, overwrite)[source]¶
Write protein translations to a file.
- Parameters:
genes (list) – A list of gene objects.
record_id (str) – The ID of the sequence record.
dst_path (str) – The destination file path.
overwrite (bool) – Whether to overwrite the file.
- Returns:
Updated overwrite flag.
- Return type:
bool
suvtk.gbk2tbl module¶
This script converts a GenBank file (.gbk or .gb) into a Sequin feature table (.tbl), which is an input file of table2asn used for creating an ASN.1 file (.sqn).
Package requirement: BioPython and click
Examples
- Simple command:
python gbk2tbl.py –mincontigsize 200 –prefix any_prefix –input annotation.gbk
Inputs¶
- GenBank file
Passed to the script through input.
Outputs¶
- any_prefix.tblstr
The Sequin feature table.
- any_prefix.fsastr
The corresponding FASTA file.
- param –mincontigsize:
The minimum contig size, default = 0.
- type –mincontigsize:
int, optional
- param –prefix:
The prefix of output filenames, default = ‘seq’.
- type –prefix:
str, optional
Notes
These files are inputs for table2asn which generates ASN.1 files (*.sqn).
Development notes¶
This script is derived from the one developed by SEQanswers users nickloman (https://gist.github.com/nickloman/2660685/genbank_to_tbl.py) and ErinL who modified nickloman’s script and put it on the forum post (http://seqanswers.com/forums/showthread.php?t=19975).
Author of this version: Yu Wan (wanyuac@gmail.com, github.com/wanyuac) Creation: 20 June 2015 - 11 July 2015; the latest edition: 21 October 2019
Dependency: Python versions 2 and 3 compatible.
Licence: GNU GPL 2.1
suvtk.table2asn module¶
table2asn.py¶
This script generates a .sqn submission file for GenBank. It processes source and comments files, validates data, and prepares the submission package.
Functions¶
- process_comments(src_file, comments_file)
Update the comments file based on the source file.
- table2asn(input, output, src_file, features, template, comments)
Main command to generate a .sqn file for GenBank submission.
- suvtk.table2asn.process_comments(src_file, comments_file)[source]¶
Processes comments by updating the comments file based on the source file.
For each group of isolates, it updates the comments file with: - Extra data: collection_date, geo_loc_name, and lat_lon (copied from Collection_date,
geo_loc_name, and Lat_Lon in the source file).
- If duplicates exist, it updates:
The count of isolates.
The majority predicted genome type (taken from the comments file).
Sets the predicted genome structure to “segmented”.
- Parameters:
src_file (str) – The file path to the source file in tab-separated format.
comments_file (str) – The file path to the comments file in tab-separated format.
- Return type:
None
suvtk.taxonomy module¶
taxonomy.py¶
This script assigns taxonomy to sequences using MMseqs2 from the ICTV nr database. It generates taxonomy files and integrates with other modules for further processing.
Functions¶
- taxonomy(fasta_file, database, output_path, seqid, threads)
Main command to assign taxonomy to sequences.
suvtk.utils module¶
utils.py¶
This script provides utility functions for executing shell commands, reading CSV files safely, and determining the number of available CPUs. These utilities are used across various modules in the project.
Functions¶
- Exec(CmdLine, fLog=None, capture=False)
Execute a shell command and optionally log or capture the output.
- safe_read_csv(path, **kwargs)
Read a CSV file with ASCII encoding and handle UnicodeDecodeError.
- get_available_cpus()
Get the number of available CPUs for the current process.
- suvtk.utils.Exec(CmdLine, fLog=None, capture=False)[source]¶
Execute a command line in a shell, logging it to a file if specified.
- Parameters:
CmdLine (str) – The command line to execute.
fLog (file object or None, optional) – A file object to log the command and results, or None.
capture (bool, optional) – Whether to capture output instead of printing.
- Returns:
The output of the command if captured, else None.
- Return type:
str or None
- Raises:
subprocess.CalledProcessError – If the command execution fails.
- suvtk.utils.get_available_cpus()[source]¶
Get the number of available CPUs for the current process.
- Returns:
The number of available CPUs.
- Return type:
int
- suvtk.utils.safe_read_csv(path, **kwargs)[source]¶
Reads a CSV file using ASCII encoding. If a UnicodeDecodeError occurs, raises a ClickException showing the offending character.
- Parameters:
path (str) – Path to the CSV file.
**kwargs (dict) – Additional arguments to pass to pandas.read_csv.
- Returns:
The contents of the CSV file.
- Return type:
pandas.DataFrame
- Raises:
click.ClickException – If the file contains non-ASCII characters.
suvtk.virus_info module¶
virus_info.py¶
This script provides information on potentially segmented viruses based on their taxonomy. It also outputs genome type and structure information for MIUVIG structured comments.
Functions¶
- load_segment_db()
Load the segmented viruses database.
- load_genome_type_db()
Load the genome structure database.
- run_segment_info(tax_df, database, output_path)
Process taxonomy data to extract segmented virus and genome type information.
- virus_info(taxonomy, database, output_path)
Main command to analyze segmented viruses and generate genome type information.
- suvtk.virus_info.load_genome_type_db()[source]¶
Load the genome structure database.
- Returns:
Data frame with the genome structure database.
- Return type:
pandas.DataFrame
- suvtk.virus_info.load_segment_db()[source]¶
Load the segmented viruses database.
- Returns:
Data frame with the segmented viruses database.
- Return type:
pandas.DataFrame
- suvtk.virus_info.run_segment_info(tax_df, database, output_path)[source]¶
Process the taxonomy file to get segmented virus and genome type info.
- Parameters:
tax_df (pandas.DataFrame) – Pandas DataFrame with taxonomy information.
database (str) – The suvtk database path (contains nodes.dmp, names.dmp, etc.).
output_path (str) – The output directory where results will be saved.
- Return type:
None
comments.py¶
This script generates structured comment files based on MIUVIG standards. It validates input files, merges data from multiple sources, and ensures compliance with predefined standards for submission to GenBank.
Functions¶
Generate a structured comment file based on MIUVIG standards.