In [19]:
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

HPO Phewas based on a Phecode-like catalog¶

In this notebook, we will use the phenotypes defined in the notebook HPO_Phecode in a gene based Phewas.

  • We use variants and sequence read coverage to perform joint-genotyping and output horizontal organized genotypes.
  • Then we aggregate the variants based on all the samples to enable very efficient case-control analysis.
  • Finally, we use the PARALLEL command to calculate Fisher-Exact test for all variant and each phenotype.

Joint genotyping¶

We use the non-probabilistic GTGEN command to perform the genotyping. As input, we need row-level based (sparse) variants in a measure of sufficient read coverage, here set to 6 reads. We use the goodcov table instead of the more generic segcov table, because of much smaller data footprint. Given that we only have less than 25k samples, we will not use PARTGOR parallelism and only store the genotypes in a single horizontal bucket.

In [20]:
qh = GQH.GOR_Query_Helper()
mydefs = qh.add_many("""
def #hpo_pn# = user_data/hakon/hpo_pheno.tsv;
def #wgsvars# = source/var/wgs_varcalls.gord -s PN ;
def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN ;
create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1';
""")
In [21]:
%%gor
$mydefs
nor [#buckets#] | group -count
Query ran in 0.83 sec
Query fetched 1 rows in 0.06 sec (total time 0.89 sec)
                                             
Out[21]:
allCount
0 24158

We see that we have over 24k samples. Next we create the genotypes by fetching all the variants in 20bp vicinity of a gene of choice, e.g. BRCA1. A GT column is mandatory for GTGEN and it is calculated based on the QC metrics stored in each row, e.g. we map low quality variants to unknown '3'. Because we don't assume that the variant have been properly normalized, we perform a left-normalization and ensure that there are no duplicate rows per samples, using ATMAX. Finally, we feed the variant rows into GTGEN that also joins the coverage information. Depending on the size of the gene, this query may take a minute or two to run.

In [22]:
mydefs = qh.add_many("""
create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj 
| join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn )
| varnorm -left #3 #4
| atmax 1 -gc pn gt
| gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000;
""")

We can inspect the genotypes and generate a simple histogram of the genotype values:

In [23]:
%%gor
$mydefs
gor [#genotypes#] | top 10 | replace values listcount(fsvmap(values,1,'x',',')) 
Query ran in 0.11 sec
Query fetched 10 rows in 0.06 sec (total time 0.17 sec)
                                             
Out[23]:
CHROM POS Reference Call Bucket Values
0 chr17 43044346 C T b1_1 0;3121,1;36,3;21001
1 chr17 43044351 C T b1_1 0;3155,1;1,3;21002
2 chr17 43044355 T C b1_1 0;3168,1;1,3;20989
3 chr17 43044358 T G b1_1 0;3165,1;1,3;20992
4 chr17 43044386 A G b1_1 0;3173,1;1,3;20984
5 chr17 43044391 G A b1_1 0;2327,1;676,2;169,3;20986
6 chr17 43044492 T C b1_1 0;3149,1;2,3;21007
7 chr17 43044565 C T b1_1 0;3148,1;3,3;21007
8 chr17 43044725 C T b1_1 0;3088,1;1,3;21069
9 chr17 43044750 G A b1_1 0;3080,1;3,3;21075

and we see the total number of variants as:

In [24]:
%%gor
$mydefs
gor [#genotypes#] | group chrom -count
Query ran in 0.10 sec
Query fetched 1 rows in 1.09 sec (total time 1.19 sec)
                                             
Out[24]:
CHROM bpStart bpStop allCount
0 chr17 0 83257441 5691

Next we want to generate summary statistics for all the samples. Typically, the number of samples that we exclude is much smaller than the number of controls. Having the counts for all the samples, we can generate the counts for the controls using simple substraction. Here we also calculate allele number (AN), allele count (AC) and allele frequence. Note, this calculation is only correct for autosomal chromosomes - see the report-builder freeze_af.ftl.yml() for gender specific implementation of AF.

In [25]:
mydefs = qh.add_many("""
create #all# = gor [#genotypes#] 
| csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all')
| pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0 
| rename (.*_GTcount) all_#{1}
| hide cc
| calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount)
| calc AC all_1_GTcount + 2*all_2_GTcount 
| calc AF GFORM(AC/AN,4,4);
""")
In [26]:
%%gor
$mydefs
gor [#all#] | top 10
Query ran in 0.10 sec
Query fetched 10 rows in 0.01 sec (total time 0.10 sec)
                                             
Out[26]:
CHROM POS Reference Call all_0_GTcount all_1_GTcount all_2_GTcount all_3_GTcount AN AC AF
0 chr17 43044346 C T 3121 36 0 21001 6314 36 0.005702
1 chr17 43044351 C T 3155 1 0 21002 6312 1 0.000158
2 chr17 43044355 T C 3168 1 0 20989 6338 1 0.000158
3 chr17 43044358 T G 3165 1 0 20992 6332 1 0.000158
4 chr17 43044386 A G 3173 1 0 20984 6348 1 0.000158
5 chr17 43044391 G A 2327 676 169 20986 6344 1014 0.159800
6 chr17 43044492 T C 3149 2 0 21007 6302 2 0.000317
7 chr17 43044565 C T 3148 3 0 21007 6302 3 0.000476
8 chr17 43044725 C T 3088 1 0 21069 6178 1 0.000162
9 chr17 43044750 G A 3080 3 0 21075 6166 3 0.000487

The query above took short time to run, because we are only reading the genotypes once and the command CSVCC is efficiently calculating the frequency of each genotype value, per variant row. However, we will be using close to 5k phenotypes, therefore we want to use the PARALLEL command to split up the load from the many phenotypes at hand. In order to split the load evenly, we will order the phenotypes by size and round-robin them to the workers.

In [27]:
mydefs = qh.add_many("""
create #phenosizeordered# = nor #hpo_pn# 
| group -gc pheno_code -count | sort -c allcount:n | rownum;
""")

We are now ready to setup the PARALLEL macro command that uses a rownum in the parts relation and a modulus filter to provide different phenotype input to CSVCC in each parallel task. First, we use top 1 to run only one partition and top 10 to terminate the calculation quickly:

In [28]:
mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# | top 1) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount
| top 10
);
""")
In [29]:
%%gor
$mydefs
gor [#phewas#]
Query ran in 0.15 sec
Query fetched 10 rows in 0.01 sec (total time 0.16 sec)
                                             
Out[29]:
CHROM POS Reference Call Pheno case_0_GTcount case_1_GTcount case_2_GTcount case_3_GTcount excl_0_GTcount ... all_1_GTcount all_2_GTcount all_3_GTcount AN AC AF ctrl_0_GTcount ctrl_1_GTcount ctrl_2_GTcount ctrl_3_GTcount
0 chr17 43044346 C T HP:0030714 0 0 0 1 6 ... 36 0 21001 6314 36 0.005702 3115 36 0 20913
1 chr17 43044346 C T HP:0003049 0 0 0 1 11 ... 36 0 21001 6314 36 0.005702 3110 35 0 20951
2 chr17 43044346 C T HP:0000218 26 0 0 429 17 ... 36 0 21001 6314 36 0.005702 3078 36 0 20378
3 chr17 43044346 C T HP:0008661 3 0 0 15 3 ... 36 0 21001 6314 36 0.005702 3115 36 0 20975
4 chr17 43044346 C T HP:0004385 0 0 0 3 294 ... 36 0 21001 6314 36 0.005702 2827 29 0 20098
5 chr17 43044346 C T HP:0009103 0 0 0 4 62 ... 36 0 21001 6314 36 0.005702 3059 35 0 20606
6 chr17 43044346 C T HP:0002265 2 0 0 11 3 ... 36 0 21001 6314 36 0.005702 3116 36 0 20937
7 chr17 43044346 C T HP:0008668 0 0 0 1 9 ... 36 0 21001 6314 36 0.005702 3112 36 0 20939
8 chr17 43044346 C T HP:0010883 2 0 0 6 21 ... 36 0 21001 6314 36 0.005702 3098 36 0 20868
9 chr17 43044346 C T HP:0041064 0 0 0 2 38 ... 36 0 21001 6314 36 0.005702 3083 33 0 20729

10 rows × 24 columns

Notice how we use the all_*_GTcounts columns to calculate the control counts. Since the number of samples that are excluded are typically much fewer than the controls, this means that the CSVCC command needs to perform much less counting. We can see the speedup from the query below:

In [30]:
%%gor
$mydefs
nor user_data/hakon/hpo_pheno_summary.tsv | select 1-3 
| calc x casecount+exclcount
| multimap -cartesian <(nor [#buckets#] | group -count)
| group -sum -ic x,allcount
| calc f sum_x/sum_allcount
Query ran in 0.13 sec
Query fetched 1 rows in 0.17 sec (total time 0.31 sec)
                                             
Out[30]:
sum_x sum_allCount f
0 1511312 112817860 0.013396

From the above, we can see that we get close to 100-fold speedup by using this approach to calculate the Fisher-Exact statistics! Now we complete the final query by adding p-value and odds-ratio calculations for multiplicative, dominant, ad recessive model.

In [31]:
mydefs = qh.add_many("""
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# ) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount

| calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1)
| calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount))))
  
| calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1)
| calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount)))

| calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1)
| calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount)))

| calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount)))
| calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3)

| select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC
| where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0
| replace OR_* GFORM(float(#rc),2,2)
);
""")
In [32]:
%%gor
$mydefs
gor [#phewas#] | top 100
Query ran in 0.20 sec
Query fetched 100 rows in 0.02 sec (total time 0.22 sec)
                                             
Out[32]:
CHROM POS Reference Call Pheno AF pVal_mm OR_mm OR_SEmm pVal_dom ... case_0_GTcount case_1_GTcount case_2_GTcount case_3_GTcount ctrl_0_GTcount ctrl_1_GTcount ctrl_2_GTcount ctrl_3_GTcount AN AC
0 chr17 43044346 C T HP:0100852 0.005702 0.03900 2.7 1.4 0.03800 ... 241 6 0 1172 2812 26 0 19428 6314 36
1 chr17 43044346 C T HP:0001337 0.005702 0.06500 3.5 2.7 0.06500 ... 80 3 0 579 2916 31 0 19499 6314 36
2 chr17 43044346 C T HP:0031197 0.005702 0.01100 180.0 740.0 0.01100 ... 0 1 0 0 3112 35 0 20985 6314 36
3 chr17 43044346 C T HP:0001696 0.005702 0.05600 20.0 41.0 0.05600 ... 4 1 0 15 3116 35 0 20966 6314 36
4 chr17 43044346 C T HP:0031218 0.005702 0.01100 180.0 760.0 0.01100 ... 0 1 0 4 3108 34 0 20929 6314 36
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 chr17 43044346 C T HP:0000549 0.005702 0.01000 5.2 3.3 0.01000 ... 71 4 0 744 3024 32 0 19948 6314 36
96 chr17 43044346 C T HP:0001397 0.005702 0.09800 11.0 20.0 0.09800 ... 8 1 0 30 3113 35 0 20971 6314 36
97 chr17 43044346 C T HP:0006385 0.005702 0.00047 92.0 130.0 0.00038 ... 1 2 0 20 3120 34 0 20980 6314 36
98 chr17 43044346 C T HP:0100836 0.005702 0.08800 12.0 23.0 0.08800 ... 7 1 0 32 3108 35 0 20946 6314 36
99 chr17 43044346 C T HP:0006699 0.005702 0.04400 26.0 55.0 0.04400 ... 3 1 0 9 3093 34 0 20860 6314 36

100 rows × 23 columns

We can see that by spreading the phenotypes across 20 parallel workers, we complete about 25 million Fisher-Exact analysis test for 25k samples in less than a minute! We can see the names and filter out the important associations like this:

In [33]:
%%gor
$mydefs

nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n
Query ran in 0.18 sec
Query fetched 21,235 rows in 1.19 sec (total time 1.37 sec)
                                             
Out[33]:
CHROM POS Reference Call Pheno AF pVal_mm OR_mm OR_SEmm name rank_pVal_mm
0 chr17 43159574 T C HP:0011458 0.594900 1.900000e-219 14.0 1.50 Abdominal symptom 1
1 chr17 43159574 T C HP:0012531 0.594900 1.900000e-219 18.0 2.00 Pain 1
2 chr17 43169893 C T HP:0002017 0.341000 1.900000e-219 17.0 1.60 Nausea and vomiting 1
3 chr17 43169893 C T HP:0002037 0.341000 1.900000e-219 41.0 7.20 Inflammation of the large intestine 1
4 chr17 43169893 C T HP:0011458 0.341000 1.900000e-219 13.0 0.91 Abdominal symptom 1
... ... ... ... ... ... ... ... ... ... ... ...
21230 chr17 43097369 A G HP:0012014 0.001242 9.900000e-03 120.0 240.00 EEG with central focal spikes 1
21231 chr17 43097369 A G HP:0010471 0.001242 9.900000e-03 120.0 240.00 Oligosacchariduria 3
21232 chr17 43100620 CATATATATATGTT C HP:0008458 0.004777 9.900000e-03 210.0 920.00 Progressive congenital scoliosis 6
21233 chr17 43100664 A G HP:0010301 0.026530 9.900000e-03 19.0 25.00 Spinal dysraphism 7
21234 chr17 43133822 A G HP:0000327 0.001657 9.900000e-03 140.0 370.00 Hypoplasia of the maxilla 9

21235 rows × 11 columns

Finally, we print out the complete query:

In [34]:
print(mydefs)
print("""nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n""")
def #hpo_pn# = user_data/hakon/hpo_pheno.tsv;
def #wgsvars# = source/var/wgs_varcalls.gord -s PN;
def #goodcov# = source/cov/goodcov_6.wgs.gord -s PN;
create #buckets# = nor -asdict #wgsvars# | select #2 | rename #1 PN | calc bucket 'b1_1';
create #genotypes# = gor #genes# | where gene_symbol = 'BRCA1' | segproj 
| join -segsnp -f 20 -ir <(gor #wgsvars# | calc gt if(GL_Call>=20,callcopies,'3') | select 1-4,gt,pn )
| varnorm -left #3 #4
| atmax 1 -gc pn gt
| gtgen -gc #3,#4 [#buckets#] <(gor #goodcov# ) -maxseg 10000;
create #all# = gor [#genotypes#] 
| csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] <(nor [#buckets#] | select PN | calc cc 'all')
| pivot -gc #3,#4,cc -v 0,1,2,3 gt -e 0 
| rename (.*_GTcount) all_#{1}
| hide cc
| calc AN 2*(all_0_GTcount + all_1_GTcount + all_2_GTcount)
| calc AC all_1_GTcount + 2*all_2_GTcount 
| calc AF GFORM(AC/AN,4,4);
create #phenosizeordered# = nor #hpo_pn# 
| group -gc pheno_code -count | sort -c allcount:n | rownum;
def #parts# = 20;
create #phewas# = parallel -parts <(norrows #parts# ) <(
gor [#genotypes#] | csvcc -gc #3,#4 -vs 1 -u 3 [#buckets#] 
  <(nor #hpo_pn# 
    | inset -c pheno_code <(nor [#phenosizeordered#] | where mod(rownum, #parts#) = #{col:rownum} ))
| pivot -gc #3,#4,pheno,cc -v 0,1,2,3 gt -e 0
| pivot -gc #3,#4,pheno -v 1,3 cc -e 0
| prefix pheno[+1]- y
| rename y_1_(.*) case_#{1}
| rename y_3_(.*) excl_#{1}
| varjoin -r [#all#]
| calc ctrl_0_GTcount all_0_GTcount-case_0_GTcount-excl_0_GTcount
| calc ctrl_1_GTcount all_1_GTcount-case_1_GTcount-excl_1_GTcount
| calc ctrl_2_GTcount all_2_GTcount-case_2_GTcount-excl_2_GTcount
| calc ctrl_3_GTcount all_3_GTcount-case_3_GTcount-excl_3_GTcount

| calc pVal_mm EFORM(PVAL(case_2_GTcount*2+case_1_GTcount,case_0_GTcount*2 + case_1_GTcount,ctrl_2_GTcount*2+ctrl_1_GTcount,ctrl_0_GTcount*2 + ctrl_1_GTcount),5,1)
| calc OR_mm if((ctrl_2_GTcount*2+ctrl_1_GTcount) = 0 or (case_0_GTcount*2 + case_1_GTcount) = 0 and (ctrl_0_GTcount*2 + ctrl_1_GTcount) != 0,1000,if((case_0_GTcount*2 + case_1_GTcount) = 0 or (ctrl_0_GTcount*2 + ctrl_1_GTcount) = 0,float('NaN'),((case_2_GTcount*2+case_1_GTcount)/(case_0_GTcount*2 + case_1_GTcount))/((ctrl_2_GTcount*2+ctrl_1_GTcount)/(ctrl_0_GTcount*2 + ctrl_1_GTcount))))
  
| calc pVal_dom EFORM(PVAL(case_2_GTcount+case_1_GTcount,case_0_GTcount,ctrl_2_GTcount+ctrl_1_GTcount,ctrl_0_GTcount),5,1)
| calc OR_dom if(ctrl_2_GTcount+ctrl_1_GTcount = 0 or case_0_GTcount = 0 and ctrl_0_GTcount != 0,1000,if(case_0_GTcount = 0 or ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount+case_1_GTcount/case_0_GTcount)/(ctrl_2_GTcount+ctrl_1_GTcount/ctrl_0_GTcount)))

| calc pVal_rec EFORM(PVAL(case_2_GTcount,case_1_GTcount + case_0_GTcount,ctrl_2_GTcount,ctrl_1_GTcount + ctrl_0_GTcount),5,1)
| calc OR_rec if(ctrl_2_GTcount = 0 or case_1_GTcount + case_0_GTcount = 0 and ctrl_1_GTcount + ctrl_0_GTcount != 0,1000,if(case_1_GTcount + case_0_GTcount = 0 or ctrl_1_GTcount + ctrl_0_GTcount = 0,float('NaN'),(case_2_GTcount/case_1_GTcount + case_0_GTcount)/(ctrl_2_GTcount/ctrl_1_GTcount + ctrl_0_GTcount)))

| calc lnse sqrt(1.0/max(1.0,float(case_2_GTcount*2+case_1_GTcount))+1.0/max(1.0,float(case_0_GTcount*2 + case_1_GTcount))+1.0/max(1.0,float(ctrl_2_GTcount*2+ctrl_1_GTcount))+1.0/max(1.0,float(ctrl_0_GTcount*2 + ctrl_1_GTcount)))
| calc OR_SEmm GFORM((exp(ln(OR_mm)+1.96*lnse)-exp(ln(OR_mm)-1.96*lnse))/(2.0*1.96),4,3)

| select 1,2,Reference,Call,Pheno,AF,pVal_mm,OR_mm,OR_SEmm,pVal_dom,OR_dom,pVal_rec,OR_rec,case_0_GTcount,case_1_GTcount,case_2_GTcount,case_3_GTcount,ctrl_0_GTcount,ctrl_1_GTcount,ctrl_2_GTcount,ctrl_3_GTcount,AN,AC
| where float(pval_mm) < 0.1 and case_1_GTcount+case_2_GTcount > 0
| replace OR_* GFORM(float(#rc),2,2)
);

nor <(gor [#phewas#] | where pval_mm < 0.01 | select 1-OR_semm
      | map -c pheno <(nor user_data/hakon/hpo_pheno_summary.tsv | select 1,name)
    ) | rank pval_mm -o asc -gc pheno | where rank_pval_mm < 10 | sort -c pval_mm:n
In [ ]: