In [2]:
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "matthew_sampson_hg38"
%env GOR_API_PROJECT={project}

Parallel GORpipe queries¶

This notebook explains the basics of how to use parallel execution in GORdb. Parallel execution in important when working with very large datasets or when applying computationally heavy operations on the data. Importantly, GOR is inherently multi-threaded, meaning that regular commands may actually involve more than one thread, e.g. a special thread for decompressing files, another thread for command steps, another for nested queries, or for writing and compressing data. The commands we will discuss here are the macro commands for parallel execution. They are three:

  • PGOR
  • PARTGOR
  • PARALLEL

This notebook will focus only on PGOR and the more advanced commands will be described separately. We will proceed with selected examples to explain the behaviour but more detailed specification can be found in the documentation pages.

Basic PGOR functionality¶

Consider counting the number of exons per chromosome. Note that the number of exons is less than 2 millions and this query runs a second or two so it hardly justifies using parallel commands to speed it up.

In [3]:
%%gor

gor #exons# | group chrom -count | granno genome -sum -ic allcount
Query ran in 1.16 sec
Query fetched 25 rows in 2.86 sec (total time 4.02 sec)
                                             
Out[3]:
chrom bpStart bpStop allCount sum_allCount
0 chr1 0 248956422 145831 1572313
1 chr10 0 133797422 61586 1572313
2 chr11 0 135086622 91380 1572313
3 chr12 0 133275309 86588 1572313
4 chr13 0 114364328 27565 1572313
5 chr14 0 107043718 55289 1572313
6 chr15 0 101991189 58444 1572313
7 chr16 0 90338345 70107 1572313
8 chr17 0 83257441 88751 1572313
9 chr18 0 80373285 27803 1572313
10 chr19 0 58617616 81557 1572313
11 chr2 0 242193529 123667 1572313
12 chr20 0 64444167 37696 1572313
13 chr21 0 46709983 18791 1572313
14 chr22 0 50818468 33549 1572313
15 chr3 0 198295559 104300 1572313
16 chr4 0 190214555 65051 1572313
17 chr5 0 181538259 70514 1572313
18 chr6 0 170805979 73980 1572313
19 chr7 0 159345973 74905 1572313
20 chr8 0 145138636 58508 1572313
21 chr9 0 138394717 60518 1572313
22 chrM 0 16569 37 1572313
23 chrX 0 156040895 51549 1572313
24 chrY 0 57227415 4347 1572313
Note: We use the GRANNO genome command to sum and annotate the number of exons across the genome. One has to be careful applying this command with large grouping range because it load the rows into memory for annotation, however, here the rows are only 24.

Now we perform the same using the macro commad PGOR:

In [4]:
%%gor

pgor #exons# | group chrom -count | granno genome -sum -ic allcount
Query ran in 0.36 sec
Query fetched 25 rows in 0.02 sec (total time 0.38 sec)
                                             
Out[4]:
chrom bpStart bpStop allCount sum_allCount
0 chr1 0 248956422 145831 145831
1 chr10 0 133797422 61586 61586
2 chr11 0 135086622 91380 91380
3 chr12 0 133275309 86588 86588
4 chr13 0 114364328 27565 27565
5 chr14 0 107043718 55289 55289
6 chr15 0 101991189 58444 58444
7 chr16 0 90338345 70107 70107
8 chr17 0 83257441 88751 88751
9 chr18 0 80373285 27803 27803
10 chr19 0 58617616 81557 81557
11 chr2 0 242193529 123667 123667
12 chr20 0 64444167 37696 37696
13 chr21 0 46709983 18791 18791
14 chr22 0 50818468 33549 33549
15 chr3 0 198295559 104300 104300
16 chr4 0 190214555 65051 65051
17 chr5 0 181538259 70514 70514
18 chr6 0 170805979 73980 73980
19 chr7 0 159345973 74905 74905
20 chr8 0 145138636 58508 58508
21 chr9 0 138394717 60518 60518
22 chrM 0 16569 37 37
23 chrX 0 156040895 51549 51549
24 chrY 0 57227415 4347 4347
Alert: PGOR cannot be used inside a nested-query. Also, notice that even though GRANNO has genome group range, the sum_allCount number is only the sum per chromosome because of the way the PGOR command is implemented, i.e. there is no automatic distributive aggregation.

Previous parallel command is indeed a shorthand equivalent to the following query:

create #chr1# = gor -p chr1 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr10# = gor -p chr10 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
create #chr11# = gor -p chr11 <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);
...
create #chrY# = gor -p chrY <(gor #exons# | group chrom -count | granno genome -sum -ic allcount);

gor [#chr1#] [#chr10#] [#chr11#] ... [#chrY#]

Therefore, one or more gor-workers execute each CREATE statement and the result (a single row in this case) will be persited in cache. Also the data in scope for each GRANNO command is only the rows within each create statement. Because of the caching, re-run of a PGOR is always fast. Indeed, PGOR does always use a CREATE statement and the query can also be written the in following way:

In [5]:
%%gor
create #exon_counts# = pgor #exons# | group chrom -count | granno genome -sum -ic allcount;
gor [#exon_counts#] | top 3
Query ran in 0.19 sec
Query fetched 3 rows in 0.02 sec (total time 0.22 sec)
                                             
Out[5]:
chrom bpStart bpStop allCount sum_allCount
0 chr1 0 248956422 145831 145831
1 chr10 0 133797422 61586 61586
2 chr11 0 135086622 91380 91380

Fix the aggregation¶

In order to get correct exon counts across the genome we must aggregate after the parallel execution, a pattern that is commonly applied in GOR queries:

In [6]:
%%gor 
create #exon_counts# = pgor #exons# | group chrom -count ;
gor [#exon_counts#] | granno genome -sum -ic allcount | top 3
Query ran in 1.08 sec
Query fetched 3 rows in 0.20 sec (total time 1.28 sec)
                                             
Out[6]:
chrom bpStart bpStop allCount sum_allCount
0 chr1 0 248956422 145831 1572313
1 chr10 0 133797422 61586 1572313
2 chr11 0 135086622 91380 1572313

Controlling the level of splitting¶

Before we simply used PGOR without any instructions of how to split the queries across the genome. By default, the execution engine will rely on a metadata chrom split file and partition the larger chromosomes into two segments, with the cut at the centromeres, and a single partition for the smaller ones, i.e. 38 partitions. However, if there are commands that perform grouping where the grouping range is larger than 1base, it uses only one per chromosome, i.e. 25 partitions.

The PGOR command does however have a -split option that allows us to control how the splitting is performed. For instance, if there is large volume of data, we may want to use many more partitions.

Tip: For efficient parallel processing, it is important not to use too many partitions, because there is a small startup overhead cost for each gor-worker. Also, each worker will deliver it's results in a cache-file and merging the results from too many files can generate additional overhead.

We will now to our attention to the dbNSFP data set. It is a very large table with about 84M rows of variants and 643 columns. Our goal is to map calibrated Revel and MutPred scores to ACMG evidence level for PP3 and BP4. For the following examples, we will use the Python GOR query helper library.

In [7]:
qh = GQH.GOR_Query_Helper()
qh.add("""def ##ref## = ref;""")
qh.add("""def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol""")
mydefs = qh.defs()
print(mydefs)
def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;

Now we can use the string variable mydefs within the magic syntax like this:

In [8]:
%%gor
$mydefs
nor ##dbnsfp## | top 1 | unpivot 1- | granno -count | rename allcount Tot_Col_Count
| where lower(col_name) in ('chr','pos','ref','alt','transcript_stable_id','gene_symbol','revel_score','mutpred_score')
Query ran in 1.13 sec
Query fetched 8 rows in 0.01 sec (total time 1.15 sec)
                                             
Out[8]:
Col_Name Col_Value Tot_Col_Count
0 chr chr1 643
1 pos 65565 643
2 ref A 643
3 alt C 643
4 gene_symbol OR4F5 643
5 transcript_stable_id ENST00000641515 643
6 REVEL_score . 643
7 MutPred_score . 643

We inspect the dbNSFP table and see that it has many of the columns with semicolon separated list, i.e. multiple transcrips are collapsed into lists per variant.

In [9]:
%%gor
$mydefs

gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp## 
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| unpivot 3-
Query ran in 1.07 sec
Query fetched 6 rows in 2.12 sec (total time 3.19 sec)
                                             
Out[9]:
chr pos Col_Name Col_Value
0 chr1 925942 ref A
1 chr1 925942 alt C
2 chr1 925942 transcript_stable_id ENST00000420190;ENST00000437963;ENST0000034206...
3 chr1 925942 gene_symbol SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD11;SAMD...
4 chr1 925942 REVEL_score 0.521;0.521;0.521;.;.;.;.;.;.;.;.
5 chr1 925942 MutPred_score 0.988

We will now split these lists into multiple rows and replace the score value where it is missing for the transcrips with the minimum score per variant.

In [10]:
%%gor
$mydefs

gor ##ref##/ensgenes/ensgenes_transcripts.gor | segproj | where segcount > 1
| join -segsnp -ir ##dbnsfp## 
| where contains(transcript_stable_id,';')
| top 1
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
Query ran in 0.98 sec
Query fetched 11 rows in 1.19 sec (total time 2.17 sec)
                                             
Out[10]:
chr pos ref alt transcript_stable_id gene_symbol REVEL_score MutPred_score min_REVEL_score min_MutPred_score new_revel_score new_mutpred_score
0 chr1 925942 A C ENST00000420190 SAMD11 0.521 0.988 0.521 0.988 0.521 0.988
1 chr1 925942 A C ENST00000437963 SAMD11 0.521 NaN 0.521 0.988 0.521 0.988
2 chr1 925942 A C ENST00000342066 SAMD11 0.521 NaN 0.521 0.988 0.521 0.988
3 chr1 925942 A C ENST00000618181 SAMD11 . NaN 0.521 0.988 0.521 0.988
4 chr1 925942 A C ENST00000622503 SAMD11 . NaN 0.521 0.988 0.521 0.988
5 chr1 925942 A C ENST00000618323 SAMD11 . NaN 0.521 0.988 0.521 0.988
6 chr1 925942 A C ENST00000616016 SAMD11 . NaN 0.521 0.988 0.521 0.988
7 chr1 925942 A C ENST00000618779 SAMD11 . NaN 0.521 0.988 0.521 0.988
8 chr1 925942 A C ENST00000616125 SAMD11 . NaN 0.521 0.988 0.521 0.988
9 chr1 925942 A C ENST00000620200 SAMD11 . NaN 0.521 0.988 0.521 0.988
10 chr1 925942 A C ENST00000617307 SAMD11 . NaN 0.521 0.988 0.521 0.988

Specifying the number of split partitions¶

We are now ready to create our first transformation of dbNSFP

In [11]:
mydefs = qh.add("""create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
""")
print(mydefs)
def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;
create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
;

Notice that we use here 100 splits, because dbNSFP is a large table and we are performing significant calculations per row. With sufficient number of stand-by workers this should finish in about a minute. Also, worth noting is that with the CREATE statement, we are creating a temporary table which is significantly smaller than the original dbNSFP table, with its 643 columns.

In [12]:
%%gor
$mydefs

gor [##dbnsfp_small##] | where revel_score != '.' | top 2
Query ran in 0.74 sec
Query fetched 2 rows in 0.15 sec (total time 0.90 sec)
                                             
Out[12]:
chr pos ref alt transcript_stable_id gene_symbol revel_score mutpred_score
0 chr1 69091 A C ENST00000641515 OR4F5 0.109 0.823
1 chr1 69091 A C ENST00000335137 OR4F5 0.109 0.823

To understand better how the results in ##dbnsfp_small## are stored we can inspect the "dictionary" like this:

In [13]:
%%gor
$mydefs

nor -asdict [##dbnsfp_small##] | granno -count | top 5
Query ran in 0.58 sec
Query fetched 5 rows in 0.02 sec (total time 0.60 sec)
                                             
Out[13]:
col1 col2 col3 col4 col5 col6 allCount
0 ../../../../cache/result_cache/cache01/a64/a64... 1 chr7 120274633 chr7 149879677 114
1 ../../../../cache/result_cache/cache01/130/130... 2 chr12 61710374 chr12 89655886 114
2 ../../../../cache/result_cache/cache01/440/440... 3 chr6 30061525 chr6 57646158 114
3 ../../../../cache/result_cache/cache01/bc3/bc3... 4 chr2 94588522 chr2 119977070 114
4 ../../../../cache/result_cache/cache01/e9e/e9e... 5 chr9 14807 chr9 27950671 114

We see reference to 114 cache files (slightly different than the target split of 100 because of uneven chromosome sizes) and each cache file has metadata telling the system the position of the first and last row in any given file. Now we are ready to map the scores to ACMG evidence.

In [14]:
mydefs = qh.add("""create ##BP4_PP3_scores## = pgor [##dbnsfp_small##]
| calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong',
if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong',
if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate',
if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting',
if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting',
if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate',
if(0.932 <= float(revel_score),'PP3_strong',''))))))),'')
| calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong',
if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate',
if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting',
if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting',
if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate',
if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'')
""")
In [15]:
%%gor
$mydefs
gor [##BP4_PP3_scores##] | where len(revel_bp4_pp3)>1 | top 4
Query ran in 0.73 sec
Query fetched 4 rows in 0.06 sec (total time 0.79 sec)
                                             
Out[15]:
chr pos ref alt transcript_stable_id gene_symbol revel_score mutpred_score Revel_BP4_PP3 Mutpred_BP4_PP3
0 chr1 69091 A C ENST00000641515 OR4F5 0.109 0.823 BP4_moderate PP3_supporting
1 chr1 69091 A C ENST00000335137 OR4F5 0.109 0.823 BP4_moderate PP3_supporting
2 chr1 69091 A G ENST00000641515 OR4F5 0.133 0.813 BP4_moderate PP3_supporting
3 chr1 69091 A G ENST00000335137 OR4F5 0.133 0.813 BP4_moderate PP3_supporting

Notice, that in ##BP4_PP3_scores## we use only the default split, because the volume of data we are working with now is much smaller than in the original dbNSFP table.

Specifying the partition with segments¶

Here we want to show how we can calculate how the evidence levels are distributes with VEP consequence labels. Hence, we will join the project VEP table with ##BP4_PP3_scores##. This can be a heavy quer and we would like to make sure that the load per gor-workers is similar. To do that we will use the SEGHIST command to generate regions with similar number of rows. We start by counting the total number of rows in the VEP table by joining the distinct variants with the transcrips:

In [16]:
mydefs = qh.add("def ##veppath## = source/anno/vep_v107;")
In [17]:
%%gor rowcount <<
$mydefs
create #temp# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor 
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group chrom -sum -ic overlapcount;
nor [#temp#] | group -sum -ic sum_OverlapCount 
Query ran in 2.32 sec
Query fetched 1 rows in 5.91 sec (total time 8.22 sec)
                                             
In [18]:
rowCountPerSeg = int(rowcount.at[0,'sum_sum_OverlapCount']/200)
rowCountPerSeg
Out[18]:
1084847

The number above is the targe count to achieve 200 partitions. With it in hand, we can apply SEGHIST that gives us close to this count per segment.

In [19]:
mydefs = qh.add(f"""create #segs# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group 10000 -sum -ic overlapcount | seghist {rowCountPerSeg}""")
In [20]:
%%gor
$mydefs
nor [#segs#] | granno -count | top 5
Query ran in 1.31 sec
Query fetched 5 rows in 1.69 sec (total time 3.00 sec)
                                             
Out[20]:
Chrom bpStart bpStop Count allCount
0 chr1 0 6088290 1084847 216
1 chr1 6088290 16584106 1084847 216
2 chr1 16584106 26792265 1084847 216
3 chr1 26792265 39327763 1084847 216
4 chr1 39327763 45336868 1084847 216

We see that we get approximately 200 segments that contain close to 216k rows in the VEP table. Now we will join the consequence of the transcripts in the VEP table with the evidence level in the ##BP4_PP3_scores## table. For demonstration purposes, we will only use the segments on chr22 to begin with.

Note: SEGHIST provides a simple yet a very powerful framework to split parallelization based on expected load, e.g. density of variants, density of variants multiplied by transcript count as shown here, or any other measure that is expected to reflect query cost. The GROUP command in front of SEGHIST should use a range that is large enough such that the input to SEGHIST does not contain too many rows and small enough to enable SEGHIST to interpolate accurately. 10kbp is usually a safe number.
In [21]:
mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#] | where chrom = 'chr22') [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")
In [22]:
%%gor
$mydefs
gor [##part_counts##]
Query ran in 0.72 sec
Query fetched 791 rows in 0.02 sec (total time 0.73 sec)
                                             
Out[22]:
chr bpStart bpStop Revel_BP4_PP3 Mutpred_BP4_PP3 Consequence allCount
0 chr22 0 50818468 NaN NaN 3_prime_UTR_variant 70
1 chr22 0 50818468 NaN NaN NMD_transcript_variant 88
2 chr22 0 50818468 NaN NaN coding_sequence_variant 1
3 chr22 0 50818468 NaN NaN intron_variant 94
4 chr22 0 50818468 NaN NaN missense_variant 15360
... ... ... ... ... ... ... ...
786 chr22 0 50818468 PP3_supporting PP3_strong missense_variant 57
787 chr22 0 50818468 PP3_supporting PP3_supporting 5_prime_UTR_variant 2
788 chr22 0 50818468 PP3_supporting PP3_supporting missense_variant 271
789 chr22 0 50818468 PP3_supporting PP3_supporting splice_region_variant 13
790 chr22 0 50818468 PP3_supporting PP3_supporting start_lost 3

791 rows × 7 columns

Now we override ##part_counts## and for the whole genome:

In [23]:
mydefs = qh.add(f"""create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count""")

and run a summary showing how the evidence levels are covered by different VEP impacts:

In [24]:
%%gor
$mydefs
nor [##part_counts##] 
| map -c consequence ##ref##/VEP_impact.map
| group -gc Revel_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Revel_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select revel_bp4_pp3,vep_impact,f,count
| pivot -gc revel_bp4_pp3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order
Query ran in 3.91 sec.
Query fetched 8 rows in 6.14 sec (total time 10.05 sec)
                                             
Out[24]:
Revel_BP4_PP3 HIGH_f HIGH_count MODERATE_f MODERATE_count LOW_f LOW_count LOWEST_f LOWEST_count
0 PP3_strong 0.1% 282 94.4% 206552 5.3% 11704 0.1% 246
1 PP3_moderate 0.3% 2190 94.6% 685780 5.1% 36741 0.1% 428
2 PP3_supporting 0.5% 3315 94.8% 664206 4.7% 33059 0.1% 371
3 BP4_supporting 0.6% 13501 95.3% 2061078 4.1% 87969 0.0% 966
4 BP4_moderate 0.2% 12682 96.3% 5934168 3.5% 215552 0.0% 2658
5 BP4_strong 0.1% 320 97.2% 349011 2.6% 9515 0.1% 341
6 BP4_verystrong 0.1% 10 97.6% 18445 2.3% 435 0.1% 15
7 NaN 17.1% 822918 78.7% 3776064 4.1% 197932 0.1% 3973

Because we aggregated the counts for -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence, we can use the same CREATE statement ##part_counts## to summarize for Mutpred a second:

In [25]:
%%gor
$mydefs
nor [##part_counts##] 
| map -c consequence ##ref##/VEP_impact.map
| group -gc Mutpred_BP4_PP3,vep_impact -sum -ic allcount
| granno -gc Mutpred_BP4_PP3 -sum -ic sum_allcount
| calc f form(100*sum_allcount/sum_sum_allcount,2,1)+'%'
| rename sum_allcount count
| select Mutpred_BP4_PP3,vep_impact,f,count
| pivot -gc Mutpred_BP4_PP3 vep_impact -v 'HIGH','MODERATE','LOW','LOWEST' -e '0%'
| calc order decode(#1,'BP4_verystrong,1,BP4_strong,2,BP4_moderate,3,BP4_supporting,4,PP3_supporting,5,PP3_moderate,6,PP3_strong,7,0')
| sort -c order:r | hide order
Query ran in 1.01 sec
Query fetched 6 rows in 0.20 sec (total time 1.21 sec)
                                             
Out[25]:
Mutpred_BP4_PP3 HIGH_f HIGH_count MODERATE_f MODERATE_count LOW_f LOW_count LOWEST_f LOWEST_count
0 PP3_strong 17.1% 22489 77.9% 102720 4.9% 6427 0.2% 200
1 PP3_moderate 1.1% 5085 94.3% 427941 4.5% 20637 0.1% 303
2 PP3_supporting 0.5% 2778 95.2% 577747 4.3% 25966 0.0% 227
3 BP4_supporting 0.1% 4879 96.0% 3254904 3.8% 129346 0.1% 1932
4 BP4_moderate 0.1% 1853 96.4% 1196030 3.3% 41469 0.1% 1122
5 NaN 8.8% 818134 87.2% 8135962 4.0% 369062 0.1% 5214

Finally, if we want to test these queries in the Sequence Miner, we can get all the complete definitions like this:

In [26]:
print(qh.defs())
def ##ref## = ref;
def ##dbnsfp## = ##ref##/variants/dbnsfp.gorz
| rename ensembl_transcriptid transcript_stable_id | rename genename gene_symbol;
create ##dbnsfp_small## = pgor -split 100 ##dbnsfp## 
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,revel_score,mutpred_score
| split 5- -s ';'
| granno 1 -gc ref,alt -min -fc revel_score,mutpred_score
| calc new_revel_score if(not(isfloat(revel_score)),min_revel_score,float(revel_score))
| calc new_mutpred_score if(not(isfloat(mutpred_score)),min_mutpred_score,float(mutpred_score))
| select chr,pos,ref,alt,transcript_stable_id,gene_symbol,new_revel_score,new_mutpred_score
| replace new_* if(#rc='NaN','.',#rc)
| rename new_(.*) #{1}
;
create ##BP4_PP3_scores## = pgor [##dbnsfp_small##]
| calc Revel_BP4_PP3 if(revel_score!='.',if(float(revel_score) <= 0.003,'BP4_verystrong',
if(0.003 < float(revel_score) and float(revel_score) <= 0.016,'BP4_strong',
if(0.016 < float(revel_score) and float(revel_score) <= 0.183,'BP4_moderate',
if(0.183 < float(revel_score) and float(revel_score) <= 0.290,'BP4_supporting',
if(0.644 <= float(revel_score) and float(revel_score) < 0.773,'PP3_supporting',
if(0.773 <= float(revel_score) and float(revel_score) < 0.932,'PP3_moderate',
if(0.932 <= float(revel_score),'PP3_strong',''))))))),'')
| calc Mutpred_BP4_PP3 if(mutpred_score!='.',if(float(mutpred_score) <= 0.010,'BP4_strong',
if(0.010 < float(mutpred_score) and float(mutpred_score) <= 0.197,'BP4_moderate',
if(0.197 < float(mutpred_score) and float(mutpred_score) <= 0.391,'BP4_supporting',
if(0.737 <= float(mutpred_score) and float(mutpred_score) < 0.829,'PP3_supporting',
if(0.829 <= float(mutpred_score) and float(mutpred_score) < 0.932,'PP3_moderate',
if(0.932 <= float(mutpred_score),'PP3_strong','')))))),'')
;
def ##veppath## = source/anno/vep_v107;
create #segs# = pgor ##veppath##/allvars.gord 
| join -snpseg -ic ##ref##/ensgenes/ensgenes_transcripts.gor
| replace overlapcount if(overlapcount = 0,1,overlapcount)
| group 10000 -sum -ic overlapcount | seghist 1084847;
create ##part_counts## = pgor -split <(gor [#segs#]) [##BP4_PP3_scores##] 
| varjoin -r -xl transcript_stable_id -xr feature ##veppath##/vep_multi_wgs.gord
| group chrom -gc Revel_BP4_PP3,Mutpred_BP4_PP3,consequence -count;

In [ ]: