Joining variant allele frequency information with variant effect prediction

First we include boiler plate code to define the project, set environment variables and import libraries.

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the example.

The first query is a simple inspection of the AF table using the UNPIVOT command. In GOR context we must have chrom and pos columns and therefore we would have had to use unpivot only from column three and upwards. Hence, we use non-ordered context and use NOR to read the AF table and unpivot all the columns.

We will now select few rows from the beginning of chr7, showing only a handful of the columns listed above.

We can now join the bi-allelic variants in the AF table (#af#) with the VEP information stored in #vep#. To do this, we use VARJOIN instead of the more generic JOIN command. It joins equivalent representations of bi-allelic variants in the left-source with the data in the right-source (here #vep#). The option "-r" is used to eliminate redundant display of (pos,ref,alt) columns from the right-source, although it can be different for InDels when no normalization is used. However, in the freeze tables, all variants have been left-normalized.

It is important to emphasize that #vep# table (often referred to as VEP-single) may have the same variant represented in multiple rows. This happens when the variant overlap with multiple genes. Indeed, the columns Max_impact and Max_consequence refer to "maximum" for the variant in the corresponding gene. We can see this easily if we join the genes stored in the #genes# table with the #vep# table. We us the JOIN command with the a "-segsnp" option:

We notice that first columns are from the left-source (#genes#) and that the Gene_symbol column (automatically rename Gene_symbolx to avoid naming conflict) has symbols different from BRCA2. We can see this more easily by using a join option that includes only the right-source, i.e. "-ir" that specifies "intersect from right":

We can do similarly for the join between #af# and #vep# that we showed earlier if we join the genes with a nested query as the right-source:

Finally, we can "collapse" the rows such that there is only a single row per variant, by using the list or set functionality of the GROUP command:

Notice, that we select a grouping genomic bin-size of 1bp but in addition, we use the "-gc" option to select grouping columns. Here, we select not only ref and alt but also the columns from the #af# table, since they should be constant per variant, given that there is only on row per variant in the #af# table. The "-sc" options specifies the string columns for which we apply the list-aggregator.