In [2]:
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "matthew_sampson_hg38"
%env GOR_API_PROJECT={project}

Gnomad v3.1.2¶

Here we will show how we can take very large VCF files from the Genome Aggregation Database and convert them to GOR tables. We will use this to explore some strategic options for computational efficiency and explain how we can leverage parallelism to both process and store the results in S3 based GOR tables (dictionaries). The focus here is on version 3.1.2 but we will also show how we can combine v2.1.1 to produce a more reliable AF table in order to filter out "common" variants in rare-disease analysis.

The input VCF files are massive; 2.3Tbytes with over 759M rows and close to 1000 attributes in the INFO column. We will convert this into GORZ files with equal number of rown but with close to 1000 columns, i.e. each attribute stored in a dedicated column. By storing it a columns in GORZ, it reduces the storage footprint to 942Gbytes, without loosing any information, and at the same time makes it much faster to access the data.

Create a dictionary for the VCF files¶

First we create a dictionary to make it easy to refer to all the VCF files through a single "table". Because we are lazy, we use the #genes# table to list all the chromosomes, with the GROUP command. This also gives us the start and stop positions for the genomes, but we must convert the start coordinate because dictionaries are one-based (like the -p option in GOR) whereas the genome segments in GOR follow the strange zero-based convention ala UCSC. We don't want life to be too simple! Also, we specifically filter out chrM because we don't have VCF file for the mitochondria.

In [ ]:
%%gor
nor <(gor #genes# | group chrom | where chrom != 'chrM') 
      | replace bpStart bpStart+1 | select 1-3 | calc range #1+':'+bpstart+'-'+bpstop 
      | calc inputfile 's3://gsci-reference-data-sources/extdata/gnomad/3.1.2/genomes/gnomad.genomes.v3.1.2.sites.'+chrom+'.vcf.gz' 
      | rownum
| select inputfile,rownum,chrom,bpstart,chrom,bpstop
| rename #1 file
| rename #2 ID
| write user_data/hakon/gnomad/vcf_input.gord

Now we can sample the VCF files by reading 10k rows in different 500 locations. We split the info field with SPLIT into attribute-value pairs and then use COLSPLIT to get the attribute name. We the use GROUP to get a distinct list of them:

In [5]:
%%gor dfAttributes <<
create x = pgor  -split 500 user_data/hakon/gnomad/vcf_input.gord  | top 10000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| group chrom -gc x_1
;

nor [x] | group -gc x_1 | rename #1 attr
Query ran in 2.61 sec
Query fetched 908 rows in 1.01 sec (total time 3.63 sec)
                                             
In [7]:
dfAttributes.shape
Out[7]:
(908, 1)
In [13]:
dfAttributes.sort_values(by='attr', ascending=True).head()
Out[13]:
attr
0 AC
1 AC_XX
2 AC_XY
3 AC_afr
4 AC_afr_XX

We see that we have 908 attributes in the INFO field. Our plan is to convert these AV pairs into columns, to make the more easily accessible in tables, smaller data footprint and much faster parsing. Note that in order to generate this attribute list, we sampled the VCF files. Another way is to take the VCF header and carve out the header. With minor manual text-edit to make the header into a single-column TSV, we then ran the following:

In [46]:
av_info_query = """nor x.tsv 
| replace x replace(replace(x,'INFO=<',''),'>','')
| calc Attribute TAG(x,'ID',',')
| calc Number TAG(x,'Number',',')
| calc Type TAG(x,'Type',',')
| calc Description dunquote(TAG(x,'Description',','))
| hide x
| write user_data/hakon/gnomad3/attribute_descr.tsv """
In [47]:
%%gor
nor user_data/hakon/gnomad3/attribute_descr.tsv | top 5
Query ran in 0.56 sec
Query fetched 5 rows in 0.00 sec (total time 0.56 sec)
                                             
Out[47]:
Attribute Number Type Description
0 AC A Integer Alternate allele count
1 AN 1 Integer Total number of alleles
2 AF A Float Alternate allele frequency
3 popmax A String Population with maximum allele frequency
4 faf95_popmax A Float Filtering allele frequency (using Poisson 95% ...

Generate multiple CALC commands¶

To make it easier to visualize and execute, we first only generate five attribute columns.

In [19]:
calcCommands = ''
adf = dfAttributes
for i in range(0,len(adf.sort_values(by='attr', ascending=True).head())):
    calcCommands += f"""| calc {adf.at[i,'attr']} replace(tag(info,'{adf.at[i,'attr']}',';'),'NOT_FOUND','.')\n"""
    
print(calcCommands)
| calc AC replace(tag(info,'AC',';'),'NOT_FOUND','.')
| calc AC_XX replace(tag(info,'AC_XX',';'),'NOT_FOUND','.')
| calc AC_XY replace(tag(info,'AC_XY',';'),'NOT_FOUND','.')
| calc AC_afr replace(tag(info,'AC_afr',';'),'NOT_FOUND','.')
| calc AC_afr_XX replace(tag(info,'AC_afr_XX',';'),'NOT_FOUND','.')

Note that we use the REPLACE function to change the default behaviour in the TAG function, i.e. to retur a dot when the attribute is not found. We can now form a query that picks up the first few attributes like this:

In [21]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
$calcCommands
| select chrom,pos,AC-
Query ran in 0.71 sec
Query fetched 10 rows in 0.10 sec (total time 0.80 sec)
                                             
Out[21]:
CHROM POS AC AC_XX AC_XY AC_afr AC_afr_XX
0 chr7 10020 0 0 0 0 0
1 chr7 10020 0 0 0 0 0
2 chr7 10021 0 0 0 0 0
3 chr7 10021 0 0 0 0 0
4 chr7 10022 0 0 0 0 0
5 chr7 10026 0 0 0 0 0
6 chr7 10028 0 0 0 0 0
7 chr7 10030 0 0 0 0 0
8 chr7 10050 0 0 0 0 0
9 chr7 10059 0 0 0 0 0

Generate single multi CALC command¶

It turns out that when we have 908 consequtive CALC commands we are introducing overhead which is not eliminated by the GOR compiler (as it stands today). First, when few rows are ready, each row will introduce overhead in type inference. Second, with high number of rows, the overhead each CALC commands introduces in order to add the column to the row starts to be noticed. Therefore, in this extreeme schenario with houndres of new columns, we may be better of by using the multi-calc functionality in CALC. Here we will actually use the GORpipe syntax to generate the query stubs. We use the query helper and add our create statement into it.

In [2]:
qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""
create x = pgor  -split 500 user_data/hakon/gnomad/vcf_input.gord  | top 10000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| group chrom -gc x_1;
""")

We use GROUP twice, first to get a distinct set of attributes and the to make a list out of them.

In [3]:
%%gor dfQuery <<
$mydefs
nor [x] | group -gc x_1
| rename #1 col
| calc cmd "replace(tag(info,'"+#1+"',';'),'NOT_FOUND','.')"
| top 5
| group -lis -sc col,cmd -len 1000000
Query ran in 1.50 sec
Query fetched 1 rows in 0.93 sec (total time 2.43 sec)
                                             

Now we form the multi CALC command:

In [29]:
calcCommand = '| calc '+dfQuery.at[0,'lis_col']+' '+dfQuery.at[0,'lis_cmd']
print(calcCommand)
| calc AC,AC_XX,AC_XY,AC_afr,AC_afr_XX replace(tag(info,'AC',';'),'NOT_FOUND','.'),replace(tag(info,'AC_XX',';'),'NOT_FOUND','.'),replace(tag(info,'AC_XY',';'),'NOT_FOUND','.'),replace(tag(info,'AC_afr',';'),'NOT_FOUND','.'),replace(tag(info,'AC_afr_XX',';'),'NOT_FOUND','.')
In [30]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
$calcCommand
| select chrom,pos,AC-
Query ran in 1.07 sec
Query fetched 10 rows in 0.10 sec (total time 1.16 sec)
                                             
Out[30]:
CHROM POS AC AC_XX AC_XY AC_afr AC_afr_XX
0 chr7 10020 0 0 0 0 0
1 chr7 10020 0 0 0 0 0
2 chr7 10021 0 0 0 0 0
3 chr7 10021 0 0 0 0 0
4 chr7 10022 0 0 0 0 0
5 chr7 10026 0 0 0 0 0
6 chr7 10028 0 0 0 0 0
7 chr7 10030 0 0 0 0 0
8 chr7 10050 0 0 0 0 0
9 chr7 10059 0 0 0 0 0

PIVOT approach¶

Before we compare the performance of these two approaches shown above, we will present yet another way to extract AV data into columns from VCF file, without introducing a special purpose command for the task, that would be the fasted option but not necessary in our opinion, given that this is one-time conversion exersize.

In [35]:
colList = dfQuery.at[0,'lis_col']
colList
Out[35]:
'AC,AC_XX,AC_XY,AC_afr,AC_afr_XX'
In [37]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 10
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v $colList
| rename (.*)_x_2 #{1}
| select chrom,pos,AC-
Query ran in 0.70 sec
Query fetched 10 rows in 0.11 sec (total time 0.82 sec)
                                             
Out[37]:
CHROM POS AC AC_XX AC_XY AC_afr AC_afr_XX
0 chr7 10020 0 0 0 0 0
1 chr7 10020 0 0 0 0 0
2 chr7 10021 0 0 0 0 0
3 chr7 10021 0 0 0 0 0
4 chr7 10022 0 0 0 0 0
5 chr7 10026 0 0 0 0 0
6 chr7 10028 0 0 0 0 0
7 chr7 10030 0 0 0 0 0
8 chr7 10050 0 0 0 0 0
9 chr7 10059 0 0 0 0 0

The above method uses SPLIT and COLSPLIT to generate multiple rows from a single row (908 in the full case). While this introduces some overhead in handling the AV pair, it turns out that it is smaller than the cost of calling the TAG function repeatedly to extract the attribute value from the INFO string in the VCF file. A specialized command for AV to columnar transformation could combined the SPLIT, COLSPLIT, and PIVOT, all into a single command without introducing the row-model to represent the AV pairs in between. But now lets just test the performance of these three methods for all the 908 AV and large number of rows. First, lets time the cost of reading the VCF file:

In [38]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000 | group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.42 sec
Query fetched 1 rows in 18.12 sec (total time 19.54 sec)
                                             
Out[38]:
CHROM bpStart bpStop allCount t
0 chr7 0 159345973 50000 18.952sec
In [6]:
calcCommands = ''
adf = dfAttributes
for i in range(0,len(adf.sort_values(by='attr', ascending=True))):
    calcCommands += f"""| calc {adf.at[i,'attr']} replace(tag(info,'{adf.at[i,'attr']}',';'),'NOT_FOUND','.')\n"""
In [7]:
%%gor dfQuery <<
$mydefs
nor [x] | group -gc x_1
| rename #1 col
| calc cmd "replace(tag(info,'"+#1+"',';'),'NOT_FOUND','.')"
| group -lis -sc col,cmd -len 1000000
Query ran in 2.59 sec
Query fetched 1 rows in 0.94 sec (total time 3.53 sec)
                                             
In [8]:
calcCommand = '| calc '+dfQuery.at[0,'lis_col']+' '+dfQuery.at[0,'lis_cmd']
colList = dfQuery.at[0,'lis_col']
In [47]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000 
$calcCommands
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 2.46 sec
Query fetched 1 rows in 521.22 sec (total time 523.68 sec)
                                             
Out[47]:
CHROM bpStart bpStop allCount t
0 chr7 0 159345973 50000 523.138sec
In [48]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000 
$calcCommand
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.51 sec
Query fetched 1 rows in 451.86 sec (total time 453.37 sec)
                                             
Out[48]:
CHROM bpStart bpStop allCount t
0 chr7 0 159345973 50000 452.741sec
In [50]:
%%gor
gor -p chr7 user_data/hakon/gnomad/vcf_input.gord | top 50000
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v $colList
| rename (.*)_x_2 #{1}
| group chrom -count | calc t str(time()/1000)+'sec'
Query ran in 1.40 sec
Query fetched 1 rows in 131.03 sec (total time 132.42 sec)
                                             
Out[50]:
CHROM bpStart bpStop allCount t
0 chr7 0 159345973 50000 131.874sec

What we see from these experiments is that because the cost of using the TAG function multiple times, there is not that great difference between the regular CALC approach and the multi-CALC approach, i.e. 523sec vs 452sec. Both are much longer than 19sec it takes to simply streamin the VCF. The PIVOT approach does however only take 132sec - that is 7x overhead on top of reading the data.

Genome wide processing¶

The Gnomad data is close is 759M rows. Therefore converting the entire Gnomad 3.1.2 to tabular GOR files will take about 15 thousand times longer, i.e. 557hours. With GORdb parallel processing this will be a walk in the park!

However, since we know that this query will run for more than few minutes, we will execute it in the long-running queue, namely "lord". We therefore import the "old" query service SDK which supports the lord job types. First we run it for only 10 rows per worker, to make sure that everything is in order.

In [9]:
import nextcode
svc_query = nextcode.get_service("query")
In [10]:
lqh = GQH.GOR_Query_Helper()
replace_symbol = "#{1}"
mydefs = lqh.add("""def ##top## = | top 10;""")
mydefs = lqh.add(f"""
create #temp# = pgor -split <(gor #genes# | group chrom | where chrom != 'chrM' | segspan -maxseg 10000000) 
user_data/hakon/gnomad/vcf_input.gord ##top##
| split info -s ';'
| colsplit info 2 x -s '='
| select chrom,pos,ref,alt,filter,x_1,x_2
| pivot -gc ref,alt,filter -e . x_1 -v {colList}
| rename (.*)_x_2 {replace_symbol}
""")
In [9]:
job = svc_query.execute(f"{mydefs} gor -p chr1 [#temp#] | top 3",job_type="lord") #,nowait="true"

job.dataframe()
Out[9]:
CHROM POS REF ALT FILTER AC AC_XX AC_XY AC_afr AC_afr_XX ... popmax primate_ai_score revel_score segdup splice_ai_consequence splice_ai_max_ds transmitted_singleton variant_type vep was_mixed
0 chr1 10031 T C AC0;AS_VQSR 0 0 0 0 0 ... . . . NaN . . . multi-snv C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... .
1 chr1 10037 T C AS_VQSR 2 1 1 0 0 ... eas . . NaN . . . snv C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... .
2 chr1 10043 T C AS_VQSR 1 1 0 1 1 ... afr . . NaN . . . snv C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0... .

3 rows × 913 columns

Now we modify the ##top## definition such that it passes all rows through. Based on our estimates above, the 440 partitions we provide in the -split option using the SEGSPAN of 10Mb, the following query will take between one and two hours, depending on the number of workers available.

In [20]:
mydefs = lqh.add("""def ##top## = | skip 0;""")
In [21]:
job = svc_query.execute(f"{mydefs} nor <(gor -p chr1 [#temp#] | top 1) | unpivot 1-",job_type="lord",nowait="true")
In [22]:
job
Out[22]:
<GorQuery 652492 (DONE)>
In [23]:
job.dataframe()
Out[23]:
Col_Name Col_Value
0 CHROM chr1
1 POS 10031
2 REF T
3 ALT C
4 FILTER AC0;AS_VQSR
... ... ...
908 splice_ai_max_ds .
909 transmitted_singleton .
910 variant_type multi-snv
911 vep C|upstream_gene_variant|MODIFIER|DDX11L1|ENSG0...
912 was_mixed .

913 rows × 2 columns

Persisting the results in S3¶

In the query above that took close to two hours, we created a temporary table from our massive gnomad table with close to 100Mrows x 1000cols. This table is typically stored in Lustre/Posix GORdb cache and given its size, it will most likely be garbage collected automatically, unless it is used often. Now we want to write it into S3, a cheaper storage that we can also link into multiple projects. Writing out all these rows and columns from the cache to S3 is also not trivial and therefore we will use parallel write into GOR dictionary.

In [25]:
job_w = svc_query.execute(f"""{mydefs} pgor -split 100 [#temp#] 
                          | write s3data://project/user_data/hakon/gnomad/all.gord""",job_type="lord") #,nowait="true")

The above write took about 30 minutes. Notice that while the content of the dictionary is written to S3, the dictionarity folder all.gord is here stored under user_data on Lustre/Posix. Thus, when we access it, we skip the s3data prefix in the url-source, for instance now when we write out a subset of the columns:

In [27]:
%%gor
pgor -split 100 user_data/hakon/gnomad/all.gord 
| select 1,2,REF,ALT,FILTER,AS_VQSLOD,AC,AC_afr,AC_afr_XX,AC_afr_XY,AC_ami,AC_ami_XX,AC_ami_XY,AC_amr,AC_amr_XX,AC_amr_XY,AC_asj,AC_asj_XX,AC_asj_XY,AC_eas,AC_eas_XX,AC_eas_XY,AC_fin,AC_fin_XX,AC_fin_XY,AC_mid,AC_mid_XX,AC_mid_XY,AC_nfe,AC_nfe_XX,AC_nfe_XY,AC_oth,AC_oth_XX,AC_oth_XY,AC_popmax,AC_raw,AC_sas,AC_sas_XX,AC_sas_XY,AC_XX,AC_XY,AF,AF_afr,AF_afr_XX,AF_afr_XY,AF_ami,AF_ami_XX,AF_ami_XY,AF_amr,AF_amr_XX,AF_amr_XY,AF_asj,AF_asj_XX,AF_asj_XY,AF_eas,AF_eas_XX,AF_eas_XY,AF_fin,AF_fin_XX,AF_fin_XY,AF_mid,AF_mid_XX,AF_mid_XY,AF_nfe,AF_nfe_XX,AF_nfe_XY,AF_oth,AF_oth_XX,AF_oth_XY,AF_popmax,AF_raw,AF_sas,AF_sas_XX,AF_sas_XY,AF_XX,AF_XY,AN,AN_afr,AN_afr_XX,AN_afr_XY,AN_ami,AN_ami_XX,AN_ami_XY,AN_amr,AN_amr_XX,AN_amr_XY,AN_asj,AN_asj_XX,AN_asj_XY,AN_eas,AN_eas_XX,AN_eas_XY,AN_fin,AN_fin_XX,AN_fin_XY,AN_mid,AN_mid_XX,AN_mid_XY,AN_nfe,AN_nfe_XX,AN_nfe_XY,AN_oth,AN_oth_XX,AN_oth_XY,AN_popmax,AN_raw,AN_sas,AN_sas_XX,AN_sas_XY,AN_XX,AN_XY,nhomalt,nhomalt_afr,nhomalt_afr_XX,nhomalt_afr_XY,nhomalt_ami,nhomalt_ami_XX,nhomalt_ami_XY,nhomalt_amr,nhomalt_amr_XX,nhomalt_amr_XY,nhomalt_asj,nhomalt_asj_XX,nhomalt_asj_XY,nhomalt_eas,nhomalt_eas_XX,nhomalt_eas_XY,nhomalt_fin,nhomalt_fin_XX,nhomalt_fin_XY,nhomalt_mid,nhomalt_mid_XX,nhomalt_mid_XY,nhomalt_nfe,nhomalt_nfe_XX,nhomalt_nfe_XY,nhomalt_oth,nhomalt_oth_XX,nhomalt_oth_XY,nhomalt_popmax,nhomalt_raw,nhomalt_sas,nhomalt_sas_XX,nhomalt_sas_XY,nhomalt_XX,nhomalt_XY,n_alt_alleles,popmax,faf95,faf95_popmax,faf99
| write s3data://project/user_data/hakon/gnomad/maincols.gord
Query ran in 3.12 sec.
Query fetched 0 rows in 780.38 sec (total time 783.50 sec)
                                             
Out[27]:
CHROM POS REF ALT FILTER AS_VQSLOD AC AC_afr AC_afr_XX AC_afr_XY ... nhomalt_sas nhomalt_sas_XX nhomalt_sas_XY nhomalt_XX nhomalt_XY n_alt_alleles popmax faf95 faf95_popmax faf99

0 rows × 151 columns

Writing out a subset of the columns in gnomad/all.gord was finished in just few minutes, because the GOR format is much more efficient for extracting subset of columns/attributes than the VCF format. And reading the maincols.gord is likewise significantly faster than all.gord. We can try to see how long time it takes us to count the number of variants.

In [30]:
%%gor 
create #temp# = pgor -split 100 user_data/hakon/gnomad/maincols.gord | group chrom -count | calc tsec int(time()/1000);
nor [#temp#] | group -sum -ic allcount,tsec -avg -min -max
Query ran in 0.85 sec
Query fetched 1 rows in 0.04 sec (total time 0.89 sec)
                                             
Out[30]:
min_allCount max_allCount avg_allCount sum_allCount min_tsec max_tsec avg_tsec sum_tsec
0 23880 10602398 6.602628e+06 759302267 0 64 42.104348 4842

We see that there are 759M variants and with a split of 100 the average for a worker to count the rows is 42sec and here all the workers completed in 2 minutes.

Combining Gnomad version 3.1.2 and version 2.1.1¶

Version 3.1.2 is based on WGS data from about 75k samples while version 2.1.1 is a a combination of 125k WES samples and 15k WGS samples. Here we will generate a combined Gnomad table that include the most important allele frequencies, including "popmax" frequencies and conservative AF based on 95% confidence interval. For the conservative estimate, we use the parameterized definition #conserv(). It is derived from the assumption of a Poisson distributed allele count and the fact that the stdev is the square root of the mean, i.e. ACobs = ACm + 2sqrt(ACm), which leads us to a formula for our conservative estimate. It converges to a formula that is easier to understand, ACm = ACobs - 2sqrt(ACobs), when AC is high.

The query below first generates intermediate transformation in #v2# and #v3#, both of which pick the popmax frequency and population using SPLIT and ATMAX. For v2.1.1 we pick popmax both for the regular and conservative AF whereas in #v3# we leverage the fact that the regular popmax frequency is already provided, hence only a simple rename. We then pick out the most important columns that we think are relevant for fast downstream filtering, but keeping the column number low (compared with close to 1000 cols) is important when we have to join a table with close to 800M rows in downstream annotation analysis such as the one intended for this table.

In [ ]:
def ##gnomad_freq## = ref/variants/gnomad_freq.gorz;
def ##freq_max_file## = ##gnomad_freq##;

def #conserv($1,$2) = if($1 = '.' or $2 = '.','0',if(float($2)!=0,str(max(0.0,float(float($1) - 2*sqrt(float($1) + 1)+2)/float($2))),'0'));

create #v2# = pgor -split 100 ##gnomad_freq## 
| calc popmax_af gnomad_af_afr + ',' + gnomad_af_amr + ',' + gnomad_af_eas
      + ',' + gnomad_af_nfe + ',' + gnomad_af_sas
| calc popmax 'afr,amr,eas,nfe,sas'
| select 1-4,popmax,popmax_af
| split popmax,popmax_af
| atmax 1 -gc ref,alt popmax_af
| replace popmax_af gform(popmax_af,4,5)
| varjoin -r -norm <(gor ##gnomad_freq##
  | calc popmax_c_af #conserv(gnomad_ac_afr,gnomad_an_afr) + ',' + #conserv(gnomad_ac_amr,gnomad_an_amr) + ',' + #conserv(gnomad_ac_eas,gnomad_an_eas)
      + ',' + #conserv(gnomad_ac_nfe,gnomad_an_nfe)+ ',' + #conserv(gnomad_ac_sas,gnomad_an_sas)
  | calc popmax_c 'afr,amr,eas,nfe,sas'
  | select 1-4,popmax_c_af,popmax_c
  | split popmax_c,popmax_c_af
  | atmax 1 -gc ref,alt popmax_c_af
)
| varjoin -r -norm <(gor ##gnomad_freq## | select 1-4,gnomad_an,gnomad_genomes_af,gnomad_genomes_hom,gnomad_exomes_af,gnomad_exomes_hom,gnomad_genomes_qc,gnomad_exomes_qc)
| rename gnomad_(.*) #{1}
| replace  popmax_af,popmax_c_af,genomes_af,exomes_af if(isfloat(#rc),gform(float(#rc),4,4),if(#rc='.','0',#rc))
;


create #v3# = pgor -split 500  user_data/hakon/gnomad/maincols.gord 
| rename af_popmax popmax_af 
| calc popmax_c_af #conserv(ac_afr,an_afr) + ',' + #conserv(ac_amr,an_amr) + ',' + #conserv(ac_eas,an_eas)
      + ',' + #conserv(ac_nfe,an_nfe)+ ',' + #conserv(ac_sas,an_sas)
      
| calc popmax_c_af_XX #conserv(ac_afr_XX,an_afr_XX) + ',' + #conserv(ac_amr_XX,an_amr_XX) + ',' + #conserv(ac_eas_XX,an_eas_XX)
      + ',' + #conserv(ac_nfe_XX,an_nfe_XX)+ ',' + #conserv(ac_sas_XX,an_sas_XX)
      
| calc popmax_c_af_XY #conserv(ac_afr_XY,an_afr_XY) + ',' + #conserv(ac_amr_XY,an_amr_XY) + ',' + #conserv(ac_eas_XY,an_eas_XY)
      + ',' + #conserv(ac_nfe_XY,an_nfe_XY)+ ',' + #conserv(ac_sas_XY,an_sas_XY)
| calc popmax_c 'afr,amr,eas,nfe,sas'
| select 1-alt,an,popmax_af,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY,popmax,popmax_c,AF_XX,AF_XY,FILTER,AS_VQSLOD,faf95_*
| split popmax_c,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY
| atmax 1 -gc ref,alt popmax_c_af
| replace popmax_af,popmax_c_af,popmax_c_af_XX,popmax_c_af_XY,AF_XX,AF_XY if(isfloat(#rc),gform(float(#rc),4,4),if(#rc='.','0',#rc))

;

pgor -split 100 [#v2#] | select 1-4 | calc version 'v2' | merge <(gor [#v3#] | select 1-4 | calc version 'v3')
| group 1 -gc ref,alt -set -sc version | rename set_version versions
| varjoin -r -norm -l -e 'NaN' -rprefix v2 [#v2#] 
| varjoin -r -norm -l -e 'NaN' -rprefix v3 [#v3#]
| calc version_use if(versions='v2,v3',if(v2_exomes_qc='PASS' and v3_FILTER != 'PASS' or v2_an > v3_an,'v2','v3'),if(versions='v3','v3','v2'))
| calc ref_af,ref_c_af if(version_use = 'v2',v2_popmax_af,v3_popmax_af),if(version_use = 'v2',v2_popmax_c_af,v3_popmax_c_af)
| columnsort chrom,pos,ref,alt,versions,version_use,ref_*    
| write s3data://project/user_data/hakon/gnomad/v211v312popmaxconserv.gord
Alert: Notice the use of float($1) and float($2) in the #conserv function. This is important because the columns used for $1 and $2 may contain dots and are therefore treated as strings. Hence, in calculations, they have to be "forced" to be treated as floating point numbers, otherwise we will get a parsing error due to a string within a numeric calculation.

The transformations for #v2# and #v3# ran in less than 10 minutes but the write took slighly longer, i.e. close to 20 minutes. Because the source table for v3.1.2 is significantly larger than the table with v2.1.1, we used -split 500 for #v3#. Both of these creates will eventually be automatically garbage collected. In the final write we use only -split 100 because we don't want the end result, that will most likely be used very often, to be stored in too many partition because it can slow seeks, i.e. large number of random access lookups that may happen in joins, due to a larger number of files per chromosome.

Accessing the data¶

Before we end we will perform a quick analysis to count the number of variants in the combined table:

In [4]:
%%gor
create x = pgor -split 100 user_data/hakon/gnomad/v211v312popmaxconserv.gord | group chrom -gc versions -count;
nor [x] | group -sum -ic allcount -gc versions | granno -sum -ic sum_allcount
Query ran in 2.10 sec
Query fetched 3 rows in 65.24 sec (total time 67.34 sec)
                                             
Out[4]:
versions sum_allCount sum_sum_allCount
0 v2 21160577 780527236
1 v2,v3 253987685 780527236
2 v3 505378974 780527236

With 28 gorworkers standby, this query ran in about a minute. We see that in total there are 780 million variants in Gnomad, most of them present in v3.1.2. Now lets do a quick lookup test to the original VCF files and compare it in speed with the GOR files:

In [5]:
%%gor
gor -p chr7:100000000 user_data/hakon/gnomad/vcf_input.gord
| calc AF tag(info,'AF',';')
| calc AC tag(info,'AC',';')
| calc AN tag(info,'AN',';')
| hide info
| calc t time()
Query ran in 3.85 sec.
Query fetched 1 rows in 0.01 sec (total time 3.86 sec)
                                             
Out[5]:
CHROM POS ID REF ALT QUAL FILTER IDx AF AC AN t
0 chr7 100000000 . A G . PASS 20 0.000007 1 151970 2848

And now read the same information from the GOR files:

In [6]:
%%gor
gor -p chr7:100000000 user_data/hakon/gnomad/all.gord
| select 1-4,AF,AC,AN | calc t time()
Query ran in 2.01 sec
Query fetched 1 rows in 0.00 sec (total time 2.02 sec)
                                             
Out[6]:
CHROM POS REF ALT AF AC AN t
0 chr7 100000000 A G 0.000007 1 151970 1864

We see that the GOR lookup is almost twice as fast and the difference is greater when we access more attributes:

In [8]:
%%gor
nor <(gor -p chr7:100000000 user_data/hakon/gnomad/vcf_input.gord)
| select chrom,pos,ref,alt,id,info
| split info -s ';'
Query ran in 3.10 sec.
Query fetched 872 rows in 2.73 sec (total time 5.83 sec)
                                             
Out[8]:
CHROM POS REF ALT ID INFO
0 chr7 100000000 A G . AC=1
1 chr7 100000000 A G . AN=151970
2 chr7 100000000 A G . AF=6.58025e-06
3 chr7 100000000 A G . popmax=nfe
4 chr7 100000000 A G . faf95_popmax=0.00000
... ... ... ... ... ... ...
867 chr7 100000000 A G . dp_hist_all_n_larger=1
868 chr7 100000000 A G . ab_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|1...
869 chr7 100000000 A G . cadd_raw_score=0.192810
870 chr7 100000000 A G . cadd_phred=3.04200
871 chr7 100000000 A G . vep=G|intron_variant&non_coding_transcript_var...

872 rows × 6 columns

In [9]:
%%gor
nor <(gor -p chr7:100000000 user_data/hakon/gnomad/all.gord)
| unpivot 5- | calc t time()
Query ran in 0.49 sec
Query fetched 909 rows in 0.22 sec (total time 0.71 sec)
                                             
Out[9]:
CHROM POS REF ALT Col_Name Col_Value t
0 chr7 100000000 A G FILTER PASS 645
1 chr7 100000000 A G AC 1 645
2 chr7 100000000 A G AC_XX 0 645
3 chr7 100000000 A G AC_XY 1 645
4 chr7 100000000 A G AC_afr 0 645
... ... ... ... ... ... ... ...
904 chr7 100000000 A G splice_ai_max_ds . 646
905 chr7 100000000 A G transmitted_singleton . 646
906 chr7 100000000 A G variant_type snv 646
907 chr7 100000000 A G vep G|intron_variant&non_coding_transcript_variant... 646
908 chr7 100000000 A G was_mixed . 646

909 rows × 7 columns

We see that with GOR, we fetch all the attributes for a variant within a second, 5x faster than the VCF lookup.

In [ ]: