suvtk package

Attention

Docstrings were generated with GitHub Copilot and can contain errors.

Submodules

suvtk.cli module

cli.py

This script provides a command-line interface (CLI) for submitting viral sequences to GenBank. It includes various subcommands for processing and preparing data, such as taxonomy assignment, feature extraction, structured comment generation, and more.

Classes

FullHelpGroup

Custom Click Group to display commands in the order they were added.

Functions

cli()

Main entry point for the CLI tool.

Commands

  • download-database

  • taxonomy

  • features

  • virus_info

  • co_occurrence

  • gbk2tbl

  • comments

  • table2asn

class suvtk.cli.FullHelpGroup(name=None, commands=None, invoke_without_command=False, no_args_is_help=None, subcommand_metavar=None, chain=False, result_callback=None, **kwargs)[source]

Bases: Group

Custom Click Group to display commands in the order they were added.

list_commands(ctx: click.Context)[source]

Return commands in the order they were added.

format_commands(ctx: click.Context, formatter: click.HelpFormatter)[source]

Formats and displays commands in the correct order.

format_commands(ctx, formatter)[source]

Formats and displays commands in the correct order.

Parameters:
  • ctx (click.Context) – The Click context.

  • formatter (click.HelpFormatter) – The Click help formatter.

list_commands(ctx)[source]

Return commands in the order they were added.

Parameters:

ctx (click.Context) – The Click context.

Returns:

List of command names in the order they were added.

Return type:

list

suvtk.co_occurrence module

co_occurrence.py

This script identifies co-occurring sequences in an abundance table based on prevalence and correlation thresholds. It supports optional segment-specific analysis and contig length correction.

Functions

calculate_proportion(df)

Calculate the proportion of samples for each contig.

create_correlation_matrix(df_transposed)

Generate a Spearman correlation matrix and mask the upper triangle.

segment_correlation_matrix(df, segment_list)

Calculate correlations for specific segments with all rows in the DataFrame.

create_segment_list(segment_file)

Read a file containing segment identifiers and return them as a list.

co_occurrence(input, output, segments, lengths, prevalence, correlation, strict)

Main command to identify co-occurring sequences in an abundance table.

suvtk.co_occurrence.calculate_proportion(df)[source]

Calculate the proportion of samples for each contig in a dataframe.

Parameters:

df (pandas.DataFrame) – A pandas DataFrame where rows represent contigs and columns represent samples.

Returns:

The original DataFrame with two additional columns: ‘sample_count’, the total number of samples a contig is present in, and ‘proportion_samples’, the proportion of samples a contig is present in.

Return type:

pandas.DataFrame

suvtk.co_occurrence.create_correlation_matrix(df_transposed)[source]

Calculate a Spearman correlation matrix for the transposed dataframe and mask the upper triangle.

Parameters:

df_transposed (pandas.DataFrame) – A transposed pandas DataFrame where rows represent samples and columns represent variables.

Returns:

A masked correlation matrix with the upper triangle set to NaN, and the axes renamed to ‘Contig1’ and ‘Contig2’.

Return type:

pandas.DataFrame

suvtk.co_occurrence.create_segment_list(segment_file)[source]

Reads a file containing segment identifiers and returns them as a list.

Parameters:

segment_file (str) – The path to a file containing segment identifiers, one per line.

Returns:

A list of segment identifiers with whitespace stripped.

Return type:

list

suvtk.co_occurrence.segment_correlation_matrix(df, segment_list)[source]

Calculate the correlation of each segment in the segment list with all rows in the DataFrame.

Parameters:
  • df (pandas.DataFrame) – A pandas DataFrame with rows representing samples and columns representing variables.

  • segment_list (list) – A list of segment indices to calculate correlations with.

Returns:

A DataFrame where each column represents a segment from the segment_list and each value is the Spearman correlation with the corresponding row in the original DataFrame.

Return type:

pandas.DataFrame

suvtk.comments module

comments.py

This script generates structured comment files based on MIUVIG standards. It validates input files, merges data from multiple sources, and ensures compliance with predefined standards for submission to GenBank.

Functions

comments(taxonomy, features, miuvig, assembly, checkv, output)

Generate a structured comment file based on MIUVIG standards.

suvtk.download_database module

download_database.py

This script downloads and extracts the suvtk database as a gzipped tar file from Zenodo.

Functions

doi_to_record_id(doi: str) -> str

Extract the numeric record ID from a Zenodo DOI.

fetch_record_metadata(record_id: str) -> dict

Fetch the Zenodo record metadata in JSON form.

find_tar_file(files: list) -> dict

Locate the first .tar or .tar.gz file in the record’s files list.

download_file(url: str, dest: str, chunk_size: int = 1024 * 1024)

Stream-download a file from a URL to a local path.

unpack_tar(archive: str, output_dir: str = None)

Extract a .tar or .tar.gz archive to a directory.

suvtk.download_database.doi_to_record_id(doi)[source]

Extract the numeric record ID from a Zenodo DOI.

Return type:

str

suvtk.download_database.download_file(url, dest, chunk_size=1048576)[source]

Stream-download a file from url to local path dest.

suvtk.download_database.fetch_record_metadata(record_id)[source]

Fetch the Zenodo record metadata in JSON form.

Return type:

dict

suvtk.download_database.find_tar_file(files)[source]

From the record’s files list, locate the first .tar or .tar.gz.

Return type:

dict

suvtk.download_database.unpack_tar(archive, output_dir=None)[source]

Extract a .tar or .tar.gz archive into the specified output_dir (defaults to current directory), preserving all folder structure.

suvtk.features module

features.py

This script processes input sequences to predict open reading frames (ORFs), aligns the predicted protein sequences against a database, and generates feature tables for submission to GenBank.

Functions

validate_translation_table(ctx, param, value)

Validate the given translation table.

calculate_coding_capacity(genes, seq_length)

Calculate the total coding capacity for a list of genes.

find_orientation(genes)

Determine the orientation of genes based on strand information.

predict_orfs(orf_finder, seq)

Predict ORFs, compute coding capacity, and determine orientation.

features(fasta_file, output_path, database, transl_table, coding_complete, taxonomy, separate_files, threads)

Main command to create feature tables for sequences.

suvtk.features.calculate_coding_capacity(genes, seq_length)[source]

Calculate the total coding capacity for a list of genes.

Parameters:
  • genes (list) – A list of gene objects.

  • seq_length (int) – The length of the sequence.

Returns:

The total coding capacity.

Return type:

float

suvtk.features.extract_gene_results(genes, record_id, seq_length)[source]

Extract gene prediction results for a sequence.

Parameters:
  • genes (list) – A list of gene objects.

  • record_id (str) – The ID of the sequence record.

  • seq_length (int) – The length of the sequence.

Returns:

A list of gene prediction results.

Return type:

list

suvtk.features.find_orientation(genes)[source]

Calculate the sum of the strand orientations for a list of genes. If the sum is zero, return the orientation of the largest gene.

Parameters:

genes (list) – A list of gene objects, each having ‘strand’, ‘begin’, and ‘end’ attributes.

Returns:

The sum of strand orientations across all genes, or the orientation of the largest gene if the sum is zero.

Return type:

int

suvtk.features.get_lineage(record_id, taxonomy_data, taxdb)[source]

Retrieve the lineage of a given record from the taxonomy table.

Parameters:
  • record_id (str) – The ID of the sequence record.

  • taxonomy_data (pandas.DataFrame) – The taxonomy data table.

  • taxdb (taxopy.TaxDb) – The taxonomy database.

Returns:

The lineage of the record.

Return type:

list

suvtk.features.predict_orfs(orf_finder, seq)[source]

Find genes, compute coding capacity, and determine orientation.

Parameters:
  • orf_finder (pyrodigal_gv.ViralGeneFinder) – The ORF finder object.

  • seq (Bio.Seq.Seq) – The sequence to analyze.

Returns:

A tuple containing genes, coding capacity, orientation, and the ORF finder used.

Return type:

tuple

suvtk.features.save_ncbi_feature_tables(df, output_dir='.', single_file=True)[source]

Generate and save NCBI feature tables for sequences in a DataFrame.

This function creates a single feature table file by default, but can also save separate files for each unique sequence ID when specified.

Parameters:
  • df (pd.DataFrame) – DataFrame containing sequence data with columns [‘seqid’, ‘accession’, ‘start’, ‘end’, ‘strand’, ‘type’, ‘Protein names’, ‘source’, ‘start_codon’, ‘partial_begin’, ‘partial_end’].

  • output_dir (str, optional) – Directory path to save the feature tables. Defaults to “.”.

  • single_file (bool, optional) – If True, saves all features to one file; otherwise, saves separate files.

Return type:

None

suvtk.features.select_top_structure(df)[source]

Select the top structure for each query based on the bitscore.

Parameters:

df (pandas.DataFrame) – A DataFrame with columns ‘query’ and ‘bits’.

Returns:

A DataFrame with the top structure for each query.

Return type:

pandas.DataFrame

suvtk.features.validate_translation_table(ctx, param, value)[source]

Validate that the given translation table is one of the valid genetic codes.

Parameters:
  • ctx (click.Context) – The Click context object. Unused.

  • param (click.Parameter) – The parameter object. Unused.

  • value (int) – The given translation table.

Returns:

The given translation table if it is valid.

Return type:

int

Raises:

click.BadParameter – If the given translation table is not valid.

suvtk.features.write_feature_entries(file, group)[source]

Helper function to write feature entries to a file.

Parameters:
  • file (file-like object) – The file to write to.

  • group (pandas.DataFrame) – The group of feature entries to write.

Return type:

None

suvtk.features.write_nucleotides(sequence, output_handle, overwrite)[source]

Write nucleotide sequences to a file.

Parameters:
  • sequence (Bio.SeqRecord.SeqRecord) – The sequence record to write.

  • output_handle (str) – The output file path.

  • overwrite (bool) – Whether to overwrite the file.

Returns:

Updated overwrite flag.

Return type:

bool

suvtk.features.write_proteins(genes, record_id, dst_path, overwrite)[source]

Write protein translations to a file.

Parameters:
  • genes (list) – A list of gene objects.

  • record_id (str) – The ID of the sequence record.

  • dst_path (str) – The destination file path.

  • overwrite (bool) – Whether to overwrite the file.

Returns:

Updated overwrite flag.

Return type:

bool

suvtk.gbk2tbl module

This script converts a GenBank file (.gbk or .gb) into a Sequin feature table (.tbl), which is an input file of table2asn used for creating an ASN.1 file (.sqn).

Package requirement: BioPython and click

Examples

Simple command:

python gbk2tbl.py –mincontigsize 200 –prefix any_prefix –input annotation.gbk

Inputs

GenBank file

Passed to the script through input.

Outputs

any_prefix.tblstr

The Sequin feature table.

any_prefix.fsastr

The corresponding FASTA file.

param –mincontigsize:

The minimum contig size, default = 0.

type –mincontigsize:

int, optional

param –prefix:

The prefix of output filenames, default = ‘seq’.

type –prefix:

str, optional

Notes

These files are inputs for table2asn which generates ASN.1 files (*.sqn).

Development notes

This script is derived from the one developed by SEQanswers users nickloman (https://gist.github.com/nickloman/2660685/genbank_to_tbl.py) and ErinL who modified nickloman’s script and put it on the forum post (http://seqanswers.com/forums/showthread.php?t=19975).

Author of this version: Yu Wan (wanyuac@gmail.com, github.com/wanyuac) Creation: 20 June 2015 - 11 July 2015; the latest edition: 21 October 2019

Dependency: Python versions 2 and 3 compatible.

Licence: GNU GPL 2.1

suvtk.table2asn module

table2asn.py

This script generates a .sqn submission file for GenBank. It processes source and comments files, validates data, and prepares the submission package.

Functions

process_comments(src_file, comments_file)

Update the comments file based on the source file.

table2asn(input, output, src_file, features, template, comments)

Main command to generate a .sqn file for GenBank submission.

suvtk.table2asn.process_comments(src_file, comments_file)[source]

Processes comments by updating the comments file based on the source file.

For each group of isolates, it updates the comments file with: - Extra data: collection_date, geo_loc_name, and lat_lon (copied from Collection_date,

geo_loc_name, and Lat_Lon in the source file).

  • If duplicates exist, it updates:
    • The count of isolates.

    • The majority predicted genome type (taken from the comments file).

    • Sets the predicted genome structure to “segmented”.

Parameters:
  • src_file (str) – The file path to the source file in tab-separated format.

  • comments_file (str) – The file path to the comments file in tab-separated format.

Return type:

None

suvtk.taxonomy module

taxonomy.py

This script assigns taxonomy to sequences using MMseqs2 from the ICTV nr database. It generates taxonomy files and integrates with other modules for further processing.

Functions

taxonomy(fasta_file, database, output_path, seqid, threads)

Main command to assign taxonomy to sequences.

suvtk.utils module

utils.py

This script provides utility functions for executing shell commands, reading CSV files safely, and determining the number of available CPUs. These utilities are used across various modules in the project.

Functions

Exec(CmdLine, fLog=None, capture=False)

Execute a shell command and optionally log or capture the output.

safe_read_csv(path, **kwargs)

Read a CSV file with ASCII encoding and handle UnicodeDecodeError.

get_available_cpus()

Get the number of available CPUs for the current process.

suvtk.utils.Exec(CmdLine, fLog=None, capture=False)[source]

Execute a command line in a shell, logging it to a file if specified.

Parameters:
  • CmdLine (str) – The command line to execute.

  • fLog (file object or None, optional) – A file object to log the command and results, or None.

  • capture (bool, optional) – Whether to capture output instead of printing.

Returns:

The output of the command if captured, else None.

Return type:

str or None

Raises:

subprocess.CalledProcessError – If the command execution fails.

suvtk.utils.get_available_cpus()[source]

Get the number of available CPUs for the current process.

Returns:

The number of available CPUs.

Return type:

int

suvtk.utils.safe_read_csv(path, **kwargs)[source]

Reads a CSV file using ASCII encoding. If a UnicodeDecodeError occurs, raises a ClickException showing the offending character.

Parameters:
  • path (str) – Path to the CSV file.

  • **kwargs (dict) – Additional arguments to pass to pandas.read_csv.

Returns:

The contents of the CSV file.

Return type:

pandas.DataFrame

Raises:

click.ClickException – If the file contains non-ASCII characters.

suvtk.virus_info module

virus_info.py

This script provides information on potentially segmented viruses based on their taxonomy. It also outputs genome type and structure information for MIUVIG structured comments.

Functions

load_segment_db()

Load the segmented viruses database.

load_genome_type_db()

Load the genome structure database.

run_segment_info(tax_df, database, output_path)

Process taxonomy data to extract segmented virus and genome type information.

virus_info(taxonomy, database, output_path)

Main command to analyze segmented viruses and generate genome type information.

suvtk.virus_info.load_genome_type_db()[source]

Load the genome structure database.

Returns:

Data frame with the genome structure database.

Return type:

pandas.DataFrame

suvtk.virus_info.load_segment_db()[source]

Load the segmented viruses database.

Returns:

Data frame with the segmented viruses database.

Return type:

pandas.DataFrame

suvtk.virus_info.run_segment_info(tax_df, database, output_path)[source]

Process the taxonomy file to get segmented virus and genome type info.

Parameters:
  • tax_df (pandas.DataFrame) – Pandas DataFrame with taxonomy information.

  • database (str) – The suvtk database path (contains nodes.dmp, names.dmp, etc.).

  • output_path (str) – The output directory where results will be saved.

Return type:

None

Module contents