varona.dataframe

High-level routines for the Varona library.

All of the functions and classes in this module are imported into the top-level library namespace.

vcf_dataframe(vcf_path: Path, vcf_extractor: Callable[[VariantRecord], dict], schema: dict[str, Any] | Schema | None = None) DataFrame[source]

From the records in a VCF file, make a dataframe given an extractor.

import pathlib
import polars as pl
import pysam
from varona import vcf_dataframe

def example_extractor(record: pysam.VariantRecord) -> dict:
    return {
        "contig": record.contig,
        "pos": record.pos,
        "ref": record.ref,
        "alt": record.alts[0]
    }

# Make a DataFrame from the VCF file.  The columns laid out by
# the extractor function.

vcf_path = pathlib.Path("/path/to/file.vcf")
df = vcf_dataframe(vcf_path, example_extractor)
print(df)
##shape: (5, 4)
##┌────────┬─────────┬─────┬─────┐
##│ contig ┆ pos     ┆ ref ┆ alt │
##│ ---    ┆ ---     ┆ --- ┆ --- │
##│ str    ┆ i64     ┆ str ┆ str │
##╞════════╪═════════╪═════╪═════╡
##│ 1      ┆ 1158631 ┆ A   ┆ G   │
##│ 1      ┆ 1246004 ┆ A   ┆ G   │
##│ 1      ┆ 1249187 ┆ G   ┆ A   │
##│ 1      ┆ 1261824 ┆ G   ┆ C   │
##│ 1      ┆ 1387667 ┆ C   ┆ G   │
##└────────┴─────────┴─────┴─────┘
Parameters:
  • vcf_path – The path to the VCF file.

  • vcf_extractor – The function to extract data from the VCF.

  • schema – Optional schema for the DataFrame to help enforce column types.

Returns:

DataFrame with the extracted data.

vep_api_dataframe(client: Client, loci_list: list[str], genome_assembly: Assembly, api_extractor: Callable[[dict], dict], schema: dict[str, Any] | None = None) DataFrame[source]

Query the Ensembl VEP API and make a DataFrame using a provided extractor.

Like vcf_dataframe(), this is a vehicle for a custom extractor function to be used on the response dictionaries from the Ensembl VEP API. Below is an example of how to use this function. A httpx.Client still needs to be supplied.

import pathlib
import polars as pl
import httpx
from varona import vep_api_dataframe, ensembl

def example_extractor(response: dict) -> dict:
    return {
        "contig": response["seq_region_name"],
        "pos": response["start"],
        "type": response["variant_class"]
    }

loci_list = [
    "1 1158631 . A G . . .",
    "1 91859795 . TATGTGA CATGTGA,CATGTGG . . .",
]
with httpx.Client(
    limits=httpx.Limits(
        max_connections=5,
        max_keepalive_connections=5
    ),
    timeout=httpx.Timeout(float(300)),
) as client:
    api_df = vep_api_dataframe(
        client,
        loci_list,
        ensembl.Assembly.GRCH37,
        example_extractor
    )
    print(api_df)
    ##shape: (2, 3)
    ##┌────────┬──────────┬──────────────┐
    ##│ contig ┆ pos      ┆ type         │
    ##│ ---    ┆ ---      ┆ ---          │
    ##│ str    ┆ i64      ┆ str          │
    ##╞════════╪══════════╪══════════════╡
    ##│ 1      ┆ 1158631  ┆ SNV          │
    ##│ 1      ┆ 91859795 ┆ substitution │
    ##└────────┴──────────┴──────────────┘
Parameters:
  • client – The HTTPX client to use for the API query.

  • loci_list – The list of loci to query the API.

  • genome_assembly – The genome assembly used in the Ensembl VEP API.

  • api_extractor – The function to extract data from the VEP API response.

  • schema – Optional schema for the DataFrame to help enforce column types.

Returns:

A DataFrame with the data from the VEP API.