In [1]:
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

Parallel GORpipe queries with PARALLEL¶

The PARALLEL macro command is similar to PGOR and PARTGOR, however, unlike them it can be used both for GOR and NOR command and provides flexibility to make parallel queries based on any parameters provided in a relation/table. First we will start with a simple example where we use PARALLEL instead of PGOR.

In [7]:
qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""create #segs# = gor #genes# | group chrom | segspan -maxseg 100000000;""")
In [10]:
%%gor
$mydefs
nor [#segs#] | granno -count | top 10
Query ran in 0.09 sec
Query fetched 10 rows in 0.01 sec (total time 0.11 sec)
                                             
Out[10]:
Chrom bpStart bpStop segCount allCount
0 chr1 0 62239105 1 45
1 chr1 62239105 124478211 1 45
2 chr1 124478211 186717316 1 45
3 chr1 186717316 248956422 1 45
4 chr10 0 66898711 1 45
5 chr10 66898711 133797422 1 45
6 chr11 0 67543311 1 45
7 chr11 67543311 135086622 1 45
8 chr12 0 66637654 1 45
9 chr12 66637654 133275309 1 45

We see that we get 45 zero-based segments, each no more than 100Mbases in size. Lets now calculate the numbers of variants in dbSNP.

In [24]:
mydefs = qh.add("""create #pcounts# = parallel -parts <(nor [#segs#] | replace bpStart bpStart+1)
<(gor -p #{col:Chrom}:#{col:bpStart}-#{col:bpStop} #dbsnp# | group chrom -count);""")
In [18]:
%%gor
$mydefs
gor [#pcounts#] | top 10
Query ran in 0.79 sec
Query fetched 10 rows in 0.03 sec (total time 0.82 sec)
                                             
Out[18]:
chrom bpStart bpStop allCount
0 chr1 0 248956422 25553008
1 chr1 0 248956422 18569734
2 chr1 0 248956422 27038573
3 chr1 0 248956422 26007566
4 chr10 0 133797422 27554851
5 chr10 0 133797422 29196823
6 chr11 0 135086622 28411265
7 chr11 0 135086622 29833483
8 chr12 0 133275309 27487946
9 chr12 0 133275309 28683770

We see that we get multiple counts just like if we had done a PGOR. Unlike for PGOR, the temp table dictionary does not have chromasomal range partition metadata:

In [19]:
%%gor
$mydefs
nor -asdict [#pcounts#] | top 10
Query ran in 0.78 sec
Query fetched 10 rows in 0.01 sec (total time 0.79 sec)
                                             
Out[19]:
col1 col2
0 ../../../../cache/result_cache/cache01/ae1/ae1... 45
1 ../../../../cache/result_cache/cache01/4a4/4a4... 44
2 ../../../../cache/result_cache/cache01/709/709... 43
3 ../../../../cache/result_cache/cache01/bab/bab... 42
4 ../../../../cache/result_cache/cache01/4b6/4b6... 41
5 ../../../../cache/result_cache/cache01/11a/11a... 40
6 ../../../../cache/result_cache/cache01/d22/d22... 39
7 ../../../../cache/result_cache/cache01/ede/ede... 38
8 ../../../../cache/result_cache/cache01/21b/21b... 37
9 ../../../../cache/result_cache/cache01/6ec/6ec... 36

This is because PARALLEL is designed for both GOR and NOR execution and as is stands does always return same types of metadata (as it stands now). We can therefore modify the query and have it return non-gor data:

In [38]:
mydefs = qh.add("""create #pcounts# = parallel -parts <(nor [#segs#] | replace bpStart bpStart+1)
<(nor <(gor -p #{col:Chrom}:#{col:bpStart}-#{col:bpStop} #dbsnp# | group chrom -count) 
| calc range '#{col:Chrom}:#{col:bpStart}-#{col:bpStop}' | select range,allcount) ;""")
In [40]:
%%gor
$mydefs
nor [#pcounts#] | sort -c range | top 10
Query ran in 0.87 sec
Query fetched 10 rows in 0.22 sec (total time 1.09 sec)
                                             
Out[40]:
range allCount
0 chr10:1-66898711 29196823
1 chr10:66898712-133797422 27554851
2 chr11:1-67543311 29833483
3 chr11:67543312-135086622 28411265
4 chr12:1-66637654 28683770
5 chr12:66637655-133275309 27487946
6 chr13:1-57182164 18190940
7 chr13:57182165-114364328 24140223
8 chr14:1-53521859 15182593
9 chr14:53521860-107043718 22303823

We can indeed get similar results using PGOR with split option:

In [41]:
mydefs = qh.add("""create #pgor_counts# = pgor -split <(gor [#segs#]) #dbsnp# | top 10
| calc range '#{CHROM}:#{BPSTART}-#{BPSTOP}'
| group chrom -gc range -count
;""")
In [43]:
%%gor
$mydefs
gor [#pgor_counts#] | top 10
Query ran in 1.66 sec
Query fetched 10 rows in 0.03 sec (total time 1.69 sec)
                                             
Out[43]:
chrom bpStart bpStop range allCount
0 chr1 0 248956422 chr1:124478212-186717316 10
1 chr1 0 248956422 chr1:186717317-248956422 10
2 chr1 0 248956422 chr1:1-62239105 10
3 chr1 0 248956422 chr1:62239106-124478211 10
4 chr10 0 133797422 chr10:1-66898711 10
5 chr10 0 133797422 chr10:66898712-133797422 10
6 chr11 0 135086622 chr11:67543312-135086622 10
7 chr11 0 135086622 chr11:1-67543311 10
8 chr12 0 133275309 chr12:1-66637654 10
9 chr12 0 133275309 chr12:66637655-133275309 10
In [45]:
%%gor
$mydefs
nor -asdict [#pgor_counts#] | granno -count | top 10
Query ran in 0.49 sec
Query fetched 10 rows in 0.01 sec (total time 0.50 sec)
                                             
Out[45]:
col1 col2 col3 col4 col5 col6 allCount
0 ../../../../cache/result_cache/cache01/9f1/9f1... 1 chr6 0 chr6 0 45
1 ../../../../cache/result_cache/cache01/4a7/4a7... 2 chr2 0 chr2 0 45
2 ../../../../cache/result_cache/cache01/43a/43a... 3 chr9 0 chr9 0 45
3 ../../../../cache/result_cache/cache01/8e8/8e8... 4 chr5 0 chr5 0 45
4 ../../../../cache/result_cache/cache01/c92/c92... 5 chr3 0 chr3 0 45
5 ../../../../cache/result_cache/cache01/79f/79f... 6 chr11 0 chr11 0 45
6 ../../../../cache/result_cache/cache01/b76/b76... 7 chrM 0 chrM 0 45
7 ../../../../cache/result_cache/cache01/0a9/0a9... 8 chr14 0 chr14 0 45
8 ../../../../cache/result_cache/cache01/657/657... 9 chr11 0 chr11 0 45
9 ../../../../cache/result_cache/cache01/0d1/0d1... 10 chr3 0 chr3 0 45
Note: PGOR and PARTGOR generate GOR dictionaries that can be used in downstream GOR queries with partitioned seek, whereas PARALLEL outputs dictionary without range metadata. Therefore, we typically recommend the use of them for GOR based queries while PARALLEL may be used with NORD type phenotype dictionaries and NOR style queries.

NOR type parallelization¶

Here we look at an example where we want to calculate column statistics of a table. First we select an arbitrary table with a mix of numeric and string columns.

In [56]:
qh2 = GQH.GOR_Query_Helper()
mydefs2 = qh2.add_many("""
def #maxrows# = 1000000;
create #table# = nor ref/ensgenes/ensexons.gorz | calc f1 random() | calc f2 random() | calc f3 sqr(f1/(1+sqr(f2)));
""")

Firs we will run a generic query that is relatively slow but memory efficient to discover the column types:

In [59]:
mydefs2 = qh2.add("""
create #coltypes# = nor [#table#] | calc dummys 's' | calc dummyi 1
| unpivot 1-
| where col_value != 'NA'
| calc type if(isint(col_value),'I',if(isfloat(col_value),'F','S'))
| group -gc col_name,type
| group -gc col_name -set -sc type;
""")
In [60]:
%%gor
$mydefs2
nor [#coltypes#]
Query ran in 91.93 sec
Query fetched 15 rows in 0.01 sec (total time 91.94 sec)
                                             
Out[60]:
Col_Name set_type
0 chrom S
1 chromend I
2 chromstart I
3 dummyi I
4 dummys S
5 exon S
6 f1 F
7 f2 F
8 f3 F
9 gene_biotype S
10 gene_stable_id S
11 gene_symbol S
12 strand S
13 transcript_biotype S
14 transcript_stable_id S

Ignore the dummy columns that are created for now - we will explain them later. Now we add definitions to calculate more complex statistics, both for numeric and string columns. Because this analysis uses the RANK command, we will run this in parallel for a subset of columns at the time, because it will require significant memory and compute.

In [77]:
mydefs2 = qh2.add_many("""
def #parallel_parts# = 4;

create #fcolparts# = nor [#coltypes#] | where set_type in ('F','I','F,I')  | select col_name 
| rownum | calc g mod(rownum,#parallel_parts#) | group -gc g -lis -sc col_name;

create #scolparts# = nor [#coltypes#] | where not(set_type in ('F','I','F,I'))  | select col_name
| rownum | calc g mod(rownum,#parallel_parts#) | group -gc g -lis -sc col_name;

create #numstat# = parallel -parts [#fcolparts#] <(nor [#table#]
| unpivot 1-
| where col_value != 'NA'
| inset -c col_name <(nor [#coltypes#] | where listhasany('#{col:lis_col_name}',col_name))
| rank col_value -o asc -gc col_name -q 
| calc v10 if(lowOReqRank <= 0.1, col_value, -1e100)
| calc v25 if(lowOReqRank <= 0.25, col_value, -1e100)
| calc v50 if(lowOReqRank <= 0.5, col_value, -1e100)
| calc v75 if(lowOReqRank <= 0.75, col_value, -1e100)
| calc v90 if(lowOReqRank <= 0.90, col_value, -1e100)
| group -gc col_name -count -min -max -avg -std -fc col_value,v10-
| rename allcount nonNA
| select col_name-std_col_value,max_v10,max_v25,max_v50,max_v75,max_v90
| rename max_(v.*) #{1}
);
create #stringstat# = parallel -parts [#scolparts#] <(nor [#table#]
| unpivot 1-
| inset -c col_name <(nor [#coltypes#] | where listhasany('#{col:lis_col_name}',col_name))
| group -gc col_name,col_value -count
| sort -c col_name,allcount:nr
| rename allcount value_count
| group -gc col_name -min -max -len 1000000 -lis -sc col_value,value_count
| calc distValues listsize(lis_value_count)
| calc histo str(distValues)+':'+listmap(listfilter(listzip(lis_col_value,lis_value_count),'i<=10'),'brackets(x)')+if(listsize(lis_value_count)>10,',...','')
| hide lis_*
);
""")

In #fcolparts# and #scolparts# we make #parallel_parts# list of numeric and string columns, respectively. We group the column names together into lis_col_name and use a relation with this column to generate parallel partitions for #numstat# and #stringstat#. We ensure that the parallel commands don't get an emtpy relation with no row by using the dummy columns mentioned above (because corrently it does not tolerate it). Notice that each parallel query will read all of #table#, however, we unpivot and filter with listhasany('#{col:lis_col_name}',col_name)) to limit the scope of data that goes into RANK and GROUP -gc col_name,col_value, both of which may require significant memory. The PARALLEL macro command replaces #{col:lis_col_name} with the appropriate value for the partition at hand. We generate ranges for the numeric distribution in v10-v90 and a histogram with the most common strings.

We generate a description for all the columns in #table# by:

In [86]:
%%gor
$mydefs2

nor [#coltypes#] | map -c col_name [#numstat#]
| merge <(nor [#coltypes#] | map -c col_name [#stringstat#])
| where not(col_name in ('dummyi','dummys'))
Query ran in 0.40 sec
Query fetched 13 rows in 0.06 sec (total time 0.46 sec)
                                             
Out[86]:
Col_Name set_type nonNA min_Col_Value max_Col_Value avg_Col_Value std_Col_Value v10 v25 v50 v75 v90 min_value_count max_value_count distValues histo
0 chrom S NaN chr1 chrY NaN NaN NaN NaN NaN NaN NaN 104300.0 91380.0 25.0 25:(chr1;145831),(chr2;123667),(chr3;104300),(...
1 chromend I 1572313.0 647.0 2.48937043E8 7.604323e+07 5.600179e+07 1.130701e+07 3.260182e+07 6.383704e+07 1.117808e+08 1.553267e+08 NaN NaN NaN NaN
2 chromstart I 1572313.0 576.0 2.4893658E8 7.604297e+07 5.600179e+07 1.130672e+07 3.260151e+07 6.383714e+07 1.117807e+08 1.553259e+08 NaN NaN NaN NaN
3 exon S NaN ENSE00000000003 ENSE00003970673 NaN NaN NaN NaN NaN NaN NaN 1.0 98.0 491801.0 491801:(ENSE00001639513;201),(ENSE00001666408;...
4 f1 F 1572313.0 1.9561712161220157E-7 0.9999998997299483 5.001228e-01 2.887756e-01 9.975428e-02 2.499995e-01 5.002753e-01 7.503033e-01 9.001638e-01 NaN NaN NaN NaN
5 f2 F 1572313.0 6.00124052674289E-7 0.9999996842541631 5.002463e-01 2.885996e-01 1.002356e-01 2.504988e-01 5.005184e-01 7.503596e-01 9.000775e-01 NaN NaN NaN NaN
6 f3 F 1572313.0 2.9139279845133705E-14 0.9998610928646812 2.142417e-01 2.213703e-01 5.607365e-03 3.527133e-02 1.407709e-01 3.238778e-01 5.541057e-01 NaN NaN NaN NaN
7 gene_biotype S NaN IG_C_gene vault_RNA NaN NaN NaN NaN NaN NaN NaN 1.0 99.0 40.0 40:(protein_coding;1322775),(lncRNA;188620),(t...
8 gene_stable_id S NaN ENSG00000000003 ENSG00000289719 NaN NaN NaN NaN NaN NaN NaN 1.0 997.0 61498.0 61498:(ENSG00000145362;3796),(ENSG00000224078;...
9 gene_symbol S NaN 5S_rRNA snoZ196 NaN NaN NaN NaN NaN NaN NaN 1.0 997.0 39850.0 39850:(na;89965),(ANK2;3796),(SNHG14;2915),(AB...
10 strand S NaN + - NaN NaN NaN NaN NaN NaN NaN 769432.0 802881.0 2.0 2:(+;802881),(-;769432)
11 transcript_biotype S NaN IG_C_gene vault_RNA NaN NaN NaN NaN NaN NaN NaN 1.0 99.0 44.0 44:(protein_coding;823843),(nonsense_mediated_...
12 transcript_stable_id S NaN ENST00000000233 ENST00000697417 NaN NaN NaN NaN NaN NaN NaN 1.0 98.0 246450.0 246450:(ENST00000589042;363),(ENST00000591111;...

We can display the histogram with wider columns like this:

In [87]:
%%gor dfHisto <<
$mydefs2
nor [#stringstat#] | select col_name,histo
Query ran in 0.57 sec
Query fetched 8 rows in 0.03 sec (total time 0.60 sec)
                                             
In [89]:
for i in range(0,len(dfHisto)):
    print(f"{i+1}\t{dfHisto.at[i,'Col_Name']}\t{dfHisto.at[i,'histo']}")
    print("----------------")
1	exon	491801:(ENSE00001639513;201),(ENSE00001666408;199),(ENSE00001912145;151),(ENSE00001352600;143),(ENSE00001926576;137),(ENSE00001854584;131),(ENSE00001723087;122),(ENSE00001697882;118),(ENSE00001836596;117),(ENSE00001859880;113),...
----------------
2	strand	2:(+;802881),(-;769432)
----------------
3	gene_symbol	39850:(na;89965),(ANK2;3796),(SNHG14;2915),(ABCC8;2691),(SNX14;2419),(DYNC1H1;2357),(IFT122;2338),(KCNMA1;2138),(MAPK10;2044),(PLCB4;1918),...
----------------
4	chrom	25:(chr1;145831),(chr2;123667),(chr3;104300),(chr11;91380),(chr17;88751),(chr12;86588),(chr19;81557),(chr7;74905),(chr6;73980),(chr5;70514),...
----------------
5	gene_stable_id	61498:(ENSG00000145362;3796),(ENSG00000224078;2915),(ENSG00000006071;2691),(ENSG00000135317;2419),(ENSG00000197102;2357),(ENSG00000163913;2338),(ENSG00000156113;2138),(ENSG00000109339;2044),(ENSG00000101333;1918),(ENSG00000155657;1700),...
----------------
6	transcript_stable_id	246450:(ENST00000589042;363),(ENST00000591111;313),(ENST00000342992;312),(ENST00000342175;191),(ENST00000359218;191),(ENST00000460472;191),(ENST00000397345;182),(ENST00000427231;182),(ENST00000454784;173),(ENST00000409198;150),...
----------------
7	gene_biotype	40:(protein_coding;1322775),(lncRNA;188620),(transcribed_unprocessed_pseudogene;22141),(processed_pseudogene;11768),(unprocessed_pseudogene;7847),(transcribed_unitary_pseudogene;5151),(transcribed_processed_pseudogene;2886),(misc_RNA;2211),(snRNA;1901),(miRNA;1877),...
----------------
8	transcript_biotype	44:(protein_coding;823843),(nonsense_mediated_decay;212755),(lncRNA;186275),(retained_intron;168473),(processed_transcript;142138),(processed_pseudogene;11768),(unprocessed_pseudogene;7808),(transcribed_unprocessed_pseudogene;5855),(misc_RNA;2211),(snRNA;1901),...
----------------

For more examples of the use of the PARALLEL macro command, see the notebook HPO_Phewas.

In [ ]: