varona.ensembl
Module for querying Ensembl API.
This module also contains code to read the VCF file and prepare the data for querying the Ensembl API.
- API_LONG_RETRY = 60
Delay retrying an API call after a 429 response without Retry-After header.
Currently this is the only place in the code where such a delay is defined. In the future, more settings for the httpx client may be desirable.
- class Assembly(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
CiStrEnum
Enum for the assembly choice.
There are only two choices for now, GRCh37 and GRCh38.
- HTS_COMPRESSED_VALS = ('GZIP', 'BGZF')
Possible values for the compression field in a
pysam.VariantFile
.
- VCF_MASK_COLS = [2, 5, 6, 7]
Columns to mask in the VCF file.
Values in these columns are replaced with ‘.’ before querying the Ensembl API.
- import_vep_data(json_path: Path, json_extractor: Callable[[dict], dict] | None = None)[source]
Imports VEP data from a JSON file where VEP was run locallly.
At the record level, the format of the JSON file is the same as the API response, where each line is a record dictionary.
- Parameters:
json_path – Path to the JSON file.
json_extractor – Optional function to transform data from the API response. Without this, there is probably more fields in the response with fields without the desired names. Additionally, the default dict items returned have additional structure (sublist, subdicts) that is preferably flattened by this function.
- query_vep_api(client: Client, chunk: list[str], assembly: Assembly = Assembly.GRCH37, retries=3, params: dict | None = {'pick': True, 'species': 'human', 'variant_class': True}, response_extractor: Callable[[dict], dict] | None = None)[source]
Queries the Ensembl Variant Effect Predictor (VEP) API.
The API endpoint is POST vep/:species/region. The API has several limits on this endpoint, including a maximum of 200 variants per request, and may return a 429 status code if the rate limit is exceeded. This function can retry the request multiple times, using the Retry-After header if it’s present to delay the next request.
This function uses the synchronous client from the httpx library, which is not as powerful as the asynchronous client. A future version of this function may use the asynchronous client to improve performance.
- Parameters:
chunk – List of (up to 200) strings from the VCF file. See
get_vcf_query_data()
for the format of these strings.assembly – Assembly to use for the API query, by default GRCh37.
retries – Number of times to retry the API call if it returns a 429 status code.
params – Additional parameters to pass to the API. By default, this includes the species=human variant_class=true, and pick=true parameters. pick=true is helpful for simplifying the consequences of the variants and make it easier to assign a single annotation derived from the consequences to the variant.
response_extractor – Optional function to transform data from the API response. Without this, there is probably more fields in the response with fields without the desired names. Additionally, the default dict items returned have additional structure (sublist, subdicts) that is preferably flattened by this function.
- vcf_rows(vcf_path: Path) Iterator[list[str]] [source]
Yields rows from a VCF file.
This function opens a VCF file that’s either plain text or compressed. Normally the
pysam.VariantFile
class would be used to read VCF files but in this case there’s no need to parse the VCF fully as VariantFile does, it’s just a matter of reading tab-separated lines of text.This function could be moved to a more general module in the future, but currently it’s only needed by the Ensembl API querying code.
- Parameters:
vcf_path – Path to the VCF file.
- Returns:
Iterator of rows from the VCF file.
- vcf_to_vep_query_data(vcf_path: Path, chunk_size=200) Iterator[list[str]] [source]
Gets lists of data in chunks from the VCF file for querying the Ensembl API.
This is tailored to provide input to the POST vep/:species/region endpoint.
- Parameters:
vcf_path – Path to the VCF file.
chunk_size – Maximum number of variants to include in each chunk.