Demonstrate functionality to compare different GWAS runs

First we include boiler plate code to define the project, set environment variables and import libraries.

Next we define a Python string variable that we can reuse in multiple queries. These are GOR-pipe syntax definitions for the genotype freeze we use in the examples and paths to two GWAS runs: one Plink analysis and a corresponding Saige analysis of the same phenotype.

First we setup a query that fetches rows from both of the GWAS analysis. We do this by using the MERGE command within a nested query, hence the range specified with the "-p" option on the GOR command applies to both of the sources.

Join approach

GOR does not support outer-join that would ensure that a join between the two GWAS runs returns all variants from both runs. We can however achieve the same results as an outer-join by performing a left-join from the union of variants with each of the GWAS tables:

Pivot approach

An alternative to using join to present GWAS data from two or more runs, it is possible to use pivoting. The PIVOT command takes multiple rows from an input stream with a pivot column and presents the in a horizontal fashion. The benefit of this approach compared to the join approach above is that the data is only read once but also this approach is more easily extended to deal with multiple GWAS runs.

As an example, if we are only interested in variants where one of the runs has significant association, we can apply a filter with OR condition on the plink_P column and the saige_P column shown in previous output. When there are many GWAS runs, this can be done more elegantly by using the GRANNO command, where we aggregate over rows and annotate them for the same input stream. In our case, we find the minimum p-value per variant.

With minor modification to the above query, we can select variants from a subset of genes and add annotations to the results. The VEP consequence and AF annotations need no explanation. The NHGRI-EBI GWAS Catalog annotations are more involved. First, we use the RANK command to pick only the strongest disease associations per variant. Second, because the variants in the catalog may be based on different variants that the Plink and Saige runs, we do approximate/fuzzy join instead of a varjoin. The "-f 1000" option allow join with distance up to 1000bp and then we use ÀTMIN to pick only the join output representing the closest distance.