varona.extract

Module with functions for extracting data from lower-level structures.

This module contains functions that can be used to extract data from lower-level structures like VCF records and VEP API responses (dictionaries). These functions are called from the Varona top-level functions in the varona.varona module and can be replaced with custom functions when using those, if desired.

default_vep_cli_json_extractor(response_item: dict) dict[source]

An example function to extract VEP data from the CLI.

When using the VEP CLI, the output is a file where each line is a JSON record. The format of the JSON record is similar to the API response, but not identical. Using the command below with v112 Ensemble VEP, the translations are listed in the table below the command:

vep \
    -i some.vcf \
    -o some.json.gz \
    --variant_class \
    --nearest symbol \
    --pick \
    --stats_text \
    --stats_file some_vep.txt \
    --compress_output bgzip \
    --json --assembly GRCh37 \
    --species homo_sapiens \
    --cache --cache_version 112 \
    --dir_cache /data/vep_cache/112/GRCh37

extracted key

original response_item key

contig

seq_region_name

pos

start

ref

allele_string (first allele, ‘/’ sep)

alt (comma-sep)

allele_string (all other alleles, ‘/’ sep)

type

variant_class

effect

most_severe_consequence

gene_name

nearest[0]

gene_id

transcript_consequences[0].gene_id

transcript_id

transcript_consequences[0].transcript_id

Parameters:

response_item – The VEP API response is a list of dictionaries, and this is one of the items in the list.

Returns:

A dictionary with the VEP data transformed into a preferable format.

default_vep_response_extractor(response_item: dict) dict[source]

An example function to extract VEP data from the response.

The response item from the API is a dictionary (not a flat one), and this function extracts the data we’re interested in into a flat dictionary.

extracted key

original response_item key

contig

seq_region_name

pos

start

ref

allele_string (first allele, ‘/’ sep)

alt (comma-sep)

allele_string (all other alleles, ‘/’ sep)

type

variant_class

effect

most_severe_consequence

gene_name

transcript_consequences[0].gene_symbol

gene_id

transcript_consequences[0].gene_id

transcript_id

transcript_consequences[0].transcript_id

Parameters:

response_item – The VEP API response is a list of dictionaries, and this is one of the items in the list.

Returns:

A dictionary with the VEP data transformed into a preferable format.

platypus_vcf_record_extractor(record: VariantRecord, **addl_cols: Callable[[VariantRecord], int | float | str]) dict[source]

Example function to extract data from a VCF record.

Currently there actually isn’t a higher-level function to call this in an analogous way to the VEP API response extractor, but below is the general pattern:

import pysam
import pathlib
import polars as pl

from varona import extract

vcf_path = pathlib.Path("/path/to/file.vcf")
data = []
with pysam.VariantFile(vcf_path) as vcf:
    for record in vcf:
        extracted_data = extract.platypus_vcf_record_extractor(record)
        data.append(extracted_data)
df = pl.DataFrame(data)

The idea is that the “extractor” will transform the VCF record into a dictionary that will in turn be used as a row in a DataFrame. This function is therefore just and example extractor, and the one that the Varona command-line tool uses. If alternative fields from the VCF record are needed, the goal is to make substituting this extractor with a different one as easy as possible.

Below are the members of the VCF record object that are extracted by this function, along with the key names they are assigned to in the returned dictionary.

extracted key

VCF record member

contig

contig

pos

pos

ref

ref

alt (comma-sep)

alts (list)

sequence_depth

info[“TC”]

max_variant_reads

max(info[“TR”])

variant_read_pct

max_variant_reads / sequence_depth * 100

Parameters:
  • record – A pysam.VariantRecord object.

  • addl_cols – Additional columns to extract from the record. The keys are the column names and the values are functions that take the record and return the value for that column. The reason for having this additional layer of callback is to accommodate three competing MAF calculations in a single function. See the varona.maf module for more information on those functions, but these are a variable number of keyword arguments, where the argument name will become a key in the returned dictionary, and the argument value is a function also taking a pysam.VariantRecord object, but rather than returning a dictionary, returning a scalar such as in varona.maf.maf_from_fr().