Demonstrate the basic commands to extract variants from a genotype freeze

First we include boiler plate code to define the project, set environment variables and import libraries.

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.

First we define which samples (PNs) and which variants we want to export the genotypes from. We use CREATE statements to create what we refer to as virtual relations or temporary tables. We pick the first 10 samples in the freeze bucket relation/table/file and 100 variants from the BRCA2 gene. The bucket relation has two columns (PN,bucket). The row number and the bucket value are used to determine the position of the sample within the value column in the genotype freeze rows. We use filtering options to extract only rows from buckets that contain genotypes from our samples of interest that are stored in #pns#. Because the value column in the horizontal bucketized genotype table stores genotypes from multiple PNs, we first trim it down by replacing it with its length.

We will move the create statement we setup into a Python string because we will be using them in multiple samples below.

Horizontal GOR-style output

First we show how we can extract a table that represents the variants, for only the samples of interest, in a horizontal format. Both as fixed value-size format (1 char per genotype) and then we use the "fixed size value map" function to transform values into a comma separated list with VCF-style genotypes. The CSVSEL command takes a stream of variant value rows and uses the bucket relation and samples to generate a single value row per variant (because we use -gc #3,#4).

Multi-columnar VCF-style output

The CSVSEL command has a "-vcf" option to format the genotypes according to the VCF specification.

Notice that we use a nested query for the PN selection, to prevent that we get too many columns in the output. If we end the query with a `WRITE` command, we can easily write the output to a file stored in user_data, even though it contains hundred thousand rows. Keep in mind however, that displaying such columns is difficult and the SequenceMiner will not handle them well if there are more than few hundred columns.

Row-level output

By using the "-tag PN" option on the CSVSEL command, rather than getting a single row with a value column representing all the samples, we get one row per sample for each variant. Optionally, the "-hide" option can be used to suppress unwanted values, such as homozygous reference.

Transposed output

In some cases it may be desired to get the genotypes in a format where all the variants per sample are stored in a single row. This can be achieved with the GTTRANSPOSEcommand. Notice here that we the nested GOR expression as an input to the NOR command in order to apply SELECT to eliminate the redundant (chrom,pos) command that come from the transpose GOR command. As an "additiona stunt", we filter the samples such that we only see samples with two or more genotype columns with non-zero value.

Parallel execution

As a final example, we show an example where we extract genotypes for all carriers of high-impact variants in the top 10 genes with the highest number of high impact variants.

For example purpose, we use an advanced "-split" option in PGOR where we provide it segments to split up the genome for parallel execution. We defined the segments based on the density of the variants that we want to export, i.e. the variants in #selvars#, and the fact that we are extraction the genotypes for all the samples in the freeze (possibly >> 10k). With the SEGHIST command, we can ensure that each parallel task extracts no more than #numVarsPerTask# = 1000 in our example. One can inspect the #segs# table to see the actual number. Keep in mind that we don't want to have the task too few but also not too many, e.g. much over 500 partition tasks per query. We will fetch the data into a the Python variable pythongenotypes.

Since most sample are carriers of few variants, we can collapse the information per sample. We can use the federated virtual relation support in the GORdb query service to do the job using NOR: