Here we show how we can compare our Plink results with GWAS data stored in AC structure. First we use our standard boilerplaite code.

Finngen in Analysis Catalog structure

In the Finngen AC, we have GWAS runs for over 2800 different phenotypes. It is important to keep in mind that these are BIG data sets with over 12 million rows of marker associations per phenotype; over 24 billion rows in total. We can count the phenotypes using the phenotype overview table:

In our first example, we will pick three cancer phenotypes that have the highest number of cases:

It is now possible to use first principles to access the GWAS results for each phenotype, however, it is important to keep in mind that these are BIG data sets with over 12 million rows per phenotype; over 24 billion rows in total.

For an example, if we want to list the top five p-values for each of these three phenotypes in the BRCA2 gene we can do the following:

In the query above, we apply the RANK command on the start position of the gene (because it is more memory efficient that to rank over the entire chromosome and all variants overlaping the gene will have the same values in the first two columns). However, we then change the position column by selecting the variant position, therefore we must apply window based sort to preserve proper GOR ordering in case of overlaping genes (not really needed in this example because we select only a single gene - an important pattern nevertheless!)

Using AC table functions

In case we want to run this type of analysis genome wide and for high number of phenotypes or for a gene list or for a marker list, it is important to leverage parallel execution. For the AC structures, the easiest way to do that is to use the pre-built YML report builders that can be used and queried as parameterized table functions. Importantly, behind the scenes they are setup to run optimized queries in parallel that depend on the input parameters at hand. Ofcourse, we must be careful not to select "crazy" combinations of input parameters, e.g. all phenotypes and p-value threshold of 1.0, since it may result in duplication of the entire AC dataset in temporary files. Here is how we can repeat the above query using the YML-table function.

We see annotated variant rows from BRCA2 and overlaping genes for the three phenotypes of interest. Now we can add similar ranking logic as we used before, to find the top three hits per phenotype.

Genome wide comparison

In our final example, we will show how we can define regions of interest based on a GWAS run and then use the AC table function to locate hits from the Finngen phenotypes.

The above definitions locate all moderately significant association hits outside of chr6 and find 10kb regions surrounding them. The command SEGPROJ is use to project overlaping segments together. We can now inspect these regions and the number of variants that overlap:

To fetch significant Finngen hits from these regions we can do the following:

Currently the AC tablefunction does not support a phenotype relation as parameter, however, we can use the approach from above.

Finally, we can find which hits are closest to our variants of interest: