In [17]:
%%capture
# load the magic extension and imports
%reload_ext nextcode
import pandas as pd
import GOR_query_helper as GQH
%env LOG_QUERY=1

project = "bch_connect_hg38"
%env GOR_API_PROJECT={project}

Genome JOINS¶

This notebook explains how genomic joins work in GOR and some of the suddle details around it performance. The genomic features that can be used in spatial joins are of three types:

  • snp (single nucleotype position, e.g. chrom,pos)
  • seg (segments, e.g. chrom,bpstart,bpstop)
  • var (variants, e.g. segmenst based on chrom,pos,ref)

There are two main commands, JOIN and VARJOIN, the former a general purpose join command wheras the latter is for joining two variant tables.

GOR assumes consistent genomic ordering of both tables (files) and streams and uses primarily a merge join logic (equivalent to a sort-merge locic in non-ordered databases). What makes the merge-logic particularly powerful is to be able to switch between streaming mode and seek-mode, based on real-time cost optimization. Additionally, GOR can combine hash-based equi-joins on non-positional columns, e.g. providing a very efficient ways to express genome spatial joins for multiple samples.

Join-order¶

Here we will explain how the order of relations matters in joins. Importantly, the left-source is always read fully while the right-source may not, i.e. the join-logic may decide to seek into the right-source.

In [3]:
%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segseg #exons# | group chrom -count | calc t time()
Query ran in 2.37 sec
Query fetched 4 rows in 0.06 sec (total time 2.43 sec)
                                             
Out[3]:
chrom bpStart bpStop allCount t
0 chr13 0 114364328 188 1753
1 chr17 0 83257441 473 1753
2 chr5 0 181538259 1 1753
3 chrX 0 156040895 64 1753
In [4]:
%%gor
gor #exons#  | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()
Query ran in 0.07 sec
Query fetched 4 rows in 1.92 sec (total time 1.98 sec)
                                             
Out[4]:
chrom bpStart bpStop allCount t
0 chr13 0 114364328 188 1921
1 chr17 0 83257441 473 1921
2 chr5 0 181538259 1 1921
3 chrX 0 156040895 64 1921

Above we see equivalent queries that join a subset of the genes together with all the exons. The first version is slightly faster because it does not have to read all the #exons# table. The difference is not great because the system reads all the 1.5M exon rows very quickly in the second example. However, if we change exons to variants in #dbsnp# we see a great difference.

In [5]:
%%gor
gor #genes# | where gene_symbol ~ 'brc*' | join -segsnp #dbsnp# | group chrom -count | calc t time()
Query ran in 0.64 sec
Query fetched 4 rows in 3.71 sec (total time 4.35 sec)
                                             
Out[5]:
chrom bpStart bpStop allCount t
0 chr13 0 114364328 61303 4197
1 chr17 0 83257441 75293 4197
2 chr5 0 181538259 394 4197
3 chrX 0 156040895 13932 4197

If we turn the query around, it will however take forever to run, i.e.

gor #dbsnp# | join -segseg <(gor #genes# | where gene_symbol ~ 'brc*') | group chrom -count | calc t time()

will read all the 1.2 billion rows. In the former example, the JOIN command made 4 seeks into #dbsnp# and only variants in the neighborhood of the genes were read. We will explore this further below.

Elevator pitch - skip-scan-join¶

Here we will explain how the "cost" of random access vs cost of streaming impacts when to perform a seek vs simply keeping on streaming. Although we will use #genes# and #exons#, which are segments, we will use -snpsnp joins to #dbsnp# just to explain the principle.

In [6]:
%%gor
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' #dbsnp# 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.28 sec
Query fetched 1 rows in 55.10 sec (total time 55.38 sec)
                                             
Out[6]:
chrom bpStart bpStop allCount sum_found t at
0 chr1 0 248956422 4345 2091 55169 12.697123
In [7]:
%%gor
gor -p chr1 #dbsnp# | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount
Query ran in 0.36 sec
Query fetched 1 rows in 83.24 sec (total time 83.60 sec)
                                             
Out[7]:
chrom bpStart bpStop allCount t at
0 chr1 0 248956422 2091 83316 39.84505

We see that the former query that reads the big #dbsnp# table as the right source, takes 55sec while the query that reads #dbsnp# fully as the left-source is 77sec. The reason is that the former query does indeed perform 4243 seeks into the #dbsnp# table, spending on average 13msec per seek. This is based on the random-access cost for S3 object-storage. Next we will move chr1 of #dbsnp# into Lustre by putting it into a temprary table.

In [8]:
qh = GQH.GOR_Query_Helper()
mydefs = qh.add("""create temp_dbsnp = gor -p chr1 #dbsnp#;""")
In [13]:
%%gor
$mydefs
gor -p chr1 #genes# | join -snpsnp -l -e 'missing' [temp_dbsnp] 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.17 sec
Query fetched 1 rows in 1.90 sec (total time 2.08 sec)
                                             
Out[13]:
chrom bpStart bpStop allCount sum_found t at
0 chr1 0 248956422 4345 2091 1916 0.440967

Now the query took less than 2sec! We see that the average seek cost is only 0.4msec, i.e. 30 times faster random-access than for S3. Now we switch the join order.

In [14]:
%%gor
$mydefs
gor -p chr1 [temp_dbsnp]  | join -snpsnp #genes# | group chrom -count | calc t time() | calc at t/allcount
Query ran in 0.39 sec
Query fetched 1 rows in 80.54 sec (total time 80.93 sec)
                                             
Out[14]:
chrom bpStart bpStop allCount t at
0 chr1 0 248956422 2091 80553 38.523673

We see that the streaming speed in Lustre and S3 is practically the same for the 97M rows on chr1. To hammer this in, lets try the #exons# instead of #genes; two orders of magnitude more rows to join.

In [15]:
%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' [temp_dbsnp] 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.30 sec
Query fetched 1 rows in 7.68 sec (total time 7.98 sec)
                                             
Out[15]:
chrom bpStart bpStop allCount sum_found t at
0 chr1 0 248956422 167823 71618 7711 0.045947

We see that we get 157k rows in 8sec and average seek cost now down to 0.05sec! Finally, we try this for the S3 version:

In [16]:
%%gor
$mydefs
gor -p chr1 #exons# | join -snpsnp -l -e 'missing' #dbsnp# 
| calc found if(rsids!='missing',1,0)
| group chrom -count -sum -ic found | calc t time() | calc at t/allcount
Query ran in 0.26 sec
Query fetched 1 rows in 54.42 sec (total time 54.68 sec)
                                             
Out[16]:
chrom bpStart bpStop allCount sum_found t at
0 chr1 0 248956422 167823 71618 54519 0.32486

The query time is significantly slower, i.e. 54sec vs 8sec, but we see that the average seek cost is now down to 0.34msec. This is because may of the exons are close by and rather than performing seek on every exon segment, the JOIN command my revert to a skip-scan pattern, seeking into #dbsnp# if segments are far apart and streaming for segments that are close to each other.

Note: Seek time is not only affected by the random-access cost of the underlying file-system. It also depends on how many files are accessed, e.g. a dictionary with 100 files has 100 times slower random access.

We now look at the use-case of fetching variants from all samples in a project where there are about 25k samples stored and bucketized into over 150 buckets. First we fetch all the variants in a single exon:

In [25]:
%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars#)
| group chrom -count | calc t time()
Query ran in 0.76 sec
Query fetched 1 rows in 0.59 sec (total time 1.35 sec)
                                             
Out[25]:
CHROM bpStart bpStop allCount t
0 chr1 0 248956422 1497 1155

We see that it took over a second to fetch 1497 rows from these 25k samples (actually longer on first execution, since the file system was cold). Now we select 5 samples from the project and we run this experiment again, but now only fetching variant in the same exon from these five samples:

In [26]:
%%gor myPNs <<
nor <(gor #wgsvars# | top 1000) | select PN | distinct | top 5
Query ran in 1.52 sec
Query fetched 5 rows in 0.66 sec (total time 2.17 sec)
                                             
In [28]:
%%gor
$mydefs
gor -p chr1 #exons# | grep WASH7P | top 1
| join -segsnp -ir <(gor #wgsvars# -ff [var:myPNs])
| group chrom -count | calc t time()
Query ran in 0.29 sec
Loading relations [var:myPNs] from local state
Query ran in 0.62 sec
Query fetched 1 rows in 0.10 sec (total time 0.72 sec)
                                             
Out[28]:
CHROM bpStart bpStop allCount t
0 chr1 0 248956422 3 233

As expected, we see significantly faster response because we are only doing random access into 5 files instead of 150 or so before.

Maximum segment size¶

When we join a right-source that has segments, it is necessary for the system to get information on the segment sizes, because GOR does only implicitly index the data with the start position of the features and does at the moment not rely on any file metadata to describe the segment size distribution in files.

We will now make a query that uses a join to fetch variants from two genes and then joins them with sequence read coverage segments from the same samples. In order to exaggerate the times, we pick 100 samples.

In [49]:
qh2 = GQH.GOR_Query_Helper()
mydefs2 = qh2.add_many("""
def #wgsvars# = source/var/wgs_varcalls.gord -s PN;
def #segcov# = source/cov/segment_cov.gord -s PN;
create #mysegs# = gor -p chr13 #genes# | merge <(gor -p chr17 #genes#) | grep brca;
create #mypns# = nor -asdict #wgsvars# | select #2 | top 100;
""")
In [51]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t
Query ran in 0.14 sec
Query fetched 2 rows in 16.76 sec (total time 16.90 sec)
                                             
Out[51]:
CHROM bpStart bpStop allCount min_t max_t
0 chr13 0 114364328 212149 0 5158
1 chr17 0 83257441 222009 0 10149

In the above query, we see that the time difference from before and after adding the sequence coverage information is between 5sec and 9sec (i.e. max_t). When the JOIN command with the -snpseg option sees a nested query, it will default assume that the maximum segment size in the nested query is that of a gene, i.e. 4Mbp. However, when these segments were generated, a SEGSPAN command was use with -maxseg of 10kbp. Thus, if we provide this parameter, the seek can be done in a more efficient manner:

In [52]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| calc t1 time()
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#]) | calc t2 time()
| calc t t2-t1
| group chrom -count -min -max -ic t
Query ran in 0.19 sec
Query fetched 2 rows in 1.66 sec (total time 1.84 sec)
                                             
Out[52]:
CHROM bpStart bpStop allCount min_t max_t
0 chr13 0 114364328 212149 0 221
1 chr17 0 83257441 222009 0 145

We see from above much reduced overhead in fetching the first rows from the #segcov# query.

Alert: One should always specify -maxseg in a JOIN with the -snpseg, -segseg, or -varseg options.

Hash-option for additional equi-join¶

In the above query, we did only join the two tables based on genomic overlap and we did not look at the PN column in both relations. More typically, we would have liked only rows from the same samples to join. We can do that like this:

Note: We are only covering genomic joins in this notebook and not MAP and MULTIMAP that do indeed use hash-join approach to perform non-genomic join, e.g. for phenotypes or other similar annotations where the right-source is not crazy-big!
In [53]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9) 
| where PN = PNx
| group chrom -count | calc t time()
Query ran in 0.18 sec
Query fetched 2 rows in 1.47 sec (total time 1.66 sec)
                                             
Out[53]:
CHROM bpStart bpStop allCount t
0 chr13 0 114364328 2782 1600
1 chr17 0 83257441 3136 1600

Notice however that the condition PN = PNx does not respect the fact that we specified a left-join with the -l option. Therefore, given the fact that the #segcov# table is sparse (i.e. may have no segments where the depth is zero - here we use random() to ensure that is the case in this demo), our result may be wrong! GOR does however have a special LEFTWHERE command that can be used in scenarious like this.

In [54]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()
Query ran in 0.33 sec
Query fetched 2 rows in 1.51 sec (total time 1.84 sec)
                                             
Out[54]:
CHROM bpStart bpStop allCount t
0 chr13 0 114364328 3073 1642
1 chr17 0 83257441 3457 1642

A more appropriate way to write the above query would be to use the -xl and -xr options to enforce equi-join condition in the join command.

In [55]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()
Query ran in 0.22 sec
Query fetched 2 rows in 1.47 sec (total time 1.69 sec)
                                             
Out[55]:
CHROM bpStart bpStop allCount t
0 chr13 0 114364328 3073 1613
1 chr17 0 83257441 3457 1613

We see that the counts are the same as with the LEFTWHERE filter and here the times are the same as before. If we however increase the number of sample, we will see big difference, i.e. from using a cartesian-join vs. hash-join approach:

In [57]:
mydefs2 = qh2.add("""create #mypns# = nor -asdict #wgsvars# | select #2 | top 1000""")

First the cartesian-join approach with 1000 samples:

In [59]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| leftwhere distance[-1] PN = PNx
| group chrom -count | calc t time()
Query ran in 0.71 sec
Query fetched 2 rows in 37.11 sec (total time 37.82 sec)
                                             
Out[59]:
CHROM bpStart bpStop allCount t
0 chr13 0 114364328 26848 37486
1 chr17 0 83257441 35289 37486

And now the hash-join approach:

In [60]:
%%gor
$mydefs2

gor [#mysegs#]
| join -segsnp -ir <(gor #wgsvars# -ff [#mypns#])
| join -snpseg -l -e 0 -maxseg 10000 -xl pn -xr pn <(gor #segcov# -ff [#mypns#] | where random()<0.9)
| group chrom -count | calc t time()
Query ran in 0.30 sec
Query fetched 2 rows in 6.65 sec (total time 6.95 sec)
                                             
Out[60]:
CHROM bpStart bpStop allCount t
0 chr13 0 114364328 26848 6851
1 chr17 0 83257441 35289 6851

We see that with 1000 samples, the hash-join approach is much faster, the greater difference the more number of samples being processed simultaneously.

Note: Unlike -xl and -xr, LEFTWHERE can be be used to express any condition in a left-join. Sometimes, one can combine a fuzzy-style equi-join with a refined condition expressed with LEFTWERE, to implement fast but complex left-joins.

Join column and genome order¶

We will end with few examples that highlight table/source order in joins and how it affects the GOR-order of the output relation. Importantly, it are alway the first two (or three) columns that are used to perform the genome-spatial-join. However, we have also learned that the left-source is read fully, therefore, if we are joining a small table agains a massive one, it is preferred to use it as a left-source and allow the system to seek into the right-source if needed.

We will start by finding a region with multiple overlapping genes and exons

In [61]:
qh3 = GQH.GOR_Query_Helper()
mydefs3 = qh3.add_many("""
create #busy# = gor #genes# | segproj 
| rank genome -o desc segcount | where rank_segcount = 1 
| join -segseg -ir -maxseg 4000000 #genes#;
""")
In [62]:
%%gor
$mydefs3

gor [#busy#]
Query ran in 1.28 sec
Query fetched 22 rows in 0.00 sec (total time 1.28 sec)
                                             
Out[62]:
chrom gene_start gene_end gene_symbol
0 chr5 141330513 141512975 PCDHGA1
1 chr5 141338759 141512975 PCDHGA2
2 chr5 141343828 141512975 PCDHGA3
3 chr5 141350098 141512975 PCDHGB1
4 chr5 141355020 141512975 PCDHGA4
5 chr5 141359993 141512975 PCDHGB2
6 chr5 141364161 141512975 PCDHGA5
7 chr5 141370241 141512975 PCDHGB3
8 chr5 141373890 141512975 PCDHGA6
9 chr5 141382738 141512975 PCDHGA7
10 chr5 141387697 141512975 PCDHGB4
11 chr5 141390156 141512975 PCDHGA8
12 chr5 141397946 141512975 PCDHGB5
13 chr5 141402777 141512975 PCDHGA9
14 chr5 141408020 141512975 PCDHGB6
15 chr5 141412986 141512975 PCDHGA10
16 chr5 141417644 141512975 PCDHGB7
17 chr5 141421046 141512975 PCDHGA11
18 chr5 141430506 141512975 PCDHGA12
19 chr5 141475946 141512977 PCDHGC3
20 chr5 141484996 141512975 PCDHGC4
21 chr5 141489080 141512975 PCDHGC5

We see that we get multiple overlaping genes. Now we join with the variants:

In [82]:
mydefs3 = qh3.add("""def #cvars# = ref/disvariants/clinical_variants.gorz | select 1-4,disease;""")
In [83]:
%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars#)
Query ran in 0.34 sec
Query fetched 5,865 rows in 0.13 sec (total time 0.47 sec)
                                             
Out[83]:
chrom gene_start gene_end gene_symbol distance pos ref alt disease
0 chr5 141330513 141512975 PCDHGA1 0 141330778 C A INBORN_GENETIC_DISEASES
1 chr5 141330513 141512975 PCDHGA1 0 141330954 A G INBORN_GENETIC_DISEASES
2 chr5 141330513 141512975 PCDHGA1 0 141331012 G T INBORN_GENETIC_DISEASES
3 chr5 141330513 141512975 PCDHGA1 0 141331196 A T INBORN_GENETIC_DISEASES
4 chr5 141330513 141512975 PCDHGA1 0 141331255 C T INBORN_GENETIC_DISEASES
... ... ... ... ... ... ... ... ... ...
5860 chr5 141489080 141512975 PCDHGC5 0 141491632 C A INBORN_GENETIC_DISEASES
5861 chr5 141489080 141512975 PCDHGC5 0 141511002 C T NOT_PROVIDED
5862 chr5 141489080 141512975 PCDHGC5 0 141511014 A G NOT_PROVIDED
5863 chr5 141489080 141512975 PCDHGC5 0 141511035 C T NOT_PROVIDED
5864 chr5 141489080 141512975 PCDHGC5 0 141511049 G A INBORN_GENETIC_DISEASES

5865 rows × 9 columns

We see that column 2 is the starting position of the genes. Therefore, say we want to join the variants with #dbnsfp#, in order to annotate them with MutPred_rankscore, we will have to perform the join either within the nested query or by switching the variant position and the gene start. Let's do the former first:

In [73]:
mydefs3 = qh3.add("""def #mutpred# = ref/variants/dbnsfp.gorz | select 1-4,MutPred_rankscore;""")
In [84]:
%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# | varjoin -r <(gor #mutpred# ) )
Query ran in 0.31 sec
Query fetched 5,522 rows in 4.45 sec (total time 4.77 sec)
                                             
Out[84]:
chrom gene_start gene_end gene_symbol distance pos ref alt disease MutPred_rankscore
0 chr5 141330513 141512975 PCDHGA1 0 141330778 C A INBORN_GENETIC_DISEASES .
1 chr5 141330513 141512975 PCDHGA1 0 141330954 A G INBORN_GENETIC_DISEASES 0.97773
2 chr5 141330513 141512975 PCDHGA1 0 141331012 G T INBORN_GENETIC_DISEASES 0.66843
3 chr5 141330513 141512975 PCDHGA1 0 141331196 A T INBORN_GENETIC_DISEASES 0.45202
4 chr5 141330513 141512975 PCDHGA1 0 141331255 C T INBORN_GENETIC_DISEASES 0.54201
... ... ... ... ... ... ... ... ... ... ...
5517 chr5 141489080 141512975 PCDHGC5 0 141491324 T C INBORN_GENETIC_DISEASES 0.56767
5518 chr5 141489080 141512975 PCDHGC5 0 141491544 C G INBORN_GENETIC_DISEASES 0.24911
5519 chr5 141489080 141512975 PCDHGC5 0 141491591 G A INBORN_GENETIC_DISEASES 0.11515
5520 chr5 141489080 141512975 PCDHGC5 0 141491632 C A INBORN_GENETIC_DISEASES 0.06613
5521 chr5 141489080 141512975 PCDHGC5 0 141511049 G A INBORN_GENETIC_DISEASES .

5522 rows × 10 columns

The above approach workes fine because within the context of the nested query, the VARJOIN command acts on the pos columns of #cvars# and #mutpred#. Now we will try to use an alternative approach:

In [85]:
%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*
Query ran in 0.19 sec
Query fetched 5,865 rows in 0.13 sec (total time 0.32 sec)
                                             
Out[85]:
chrom pos ref alt disease distance gene_start gene_end gene_symbol
0 chr5 141330778 C A INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1
1 chr5 141330954 A G INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1
2 chr5 141331012 G T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1
3 chr5 141331196 A T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1
4 chr5 141331255 C T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1
... ... ... ... ... ... ... ... ... ...
5860 chr5 141491632 C A INBORN_GENETIC_DISEASES 0 141489080 141512975 PCDHGC5
5861 chr5 141511002 C T NOT_PROVIDED 0 141489080 141512975 PCDHGC5
5862 chr5 141511014 A G NOT_PROVIDED 0 141489080 141512975 PCDHGC5
5863 chr5 141511035 C T NOT_PROVIDED 0 141489080 141512975 PCDHGC5
5864 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141489080 141512975 PCDHGC5

5865 rows × 9 columns

And now we add the varjoin to #mutpred#:

In [86]:
%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*
| varjoin -r #mutpred#
Query ran in 0.28 sec
                                             
==== Data Error ====
VARJOIN
Wrong order observed at row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2
Last row: chr5:141511049

Row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2

Stack Trace:
org.gorpipe.exceptions.GorDataException: VARJOIN
Wrong order observed at row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2
Last row: chr5:141511049
Row: chr5	141338997	A	G	INBORN_GENETIC_DISEASES	0	141338759	141512975	PCDHGA2

	at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:39)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.Select2.process(Select2.scala:32)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.output_row(JoinAnalysis.scala:155)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.nested_process(JoinAnalysis.scala:213)
	at gorsat.Analysis.JoinAnalysis$SegOverlap.process(JoinAnalysis.scala:375)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.InRange.process(InRange.scala:36)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Analysis.CheckOrder.process(CheckOrder.scala:41)
	at gorsat.Commands.Analysis.process(Analysis.scala:167)
	at gorsat.Monitors.CancelMonitor.process(CancelMonitor.scala:34)
	at gorsat.ReaderThread.run(ReaderThread.java:132)


We see that we get Data Error, because the rows are not in correct order! The VARJOIN command verifies that but does not correct it - it assumes it and relies on it for performance! The reason this is happening is that there are overlapping gene segments in the #busy# table and they overlaped with multiple variants. When we changed the column order, we screwed up the order!!! To fix this we can apply SORT. But we can be clever - we don't have to use a fully-blocking sort operation - because GOR provides range based sort. Now that we know that the maximum segment size of genes is 4Mbp, we can inform SORT that this is the biggest error in the ordering.

In [87]:
%%gor
$mydefs3

gor [#busy#] | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease,distance,gene_*
| sort 4000000
| varjoin -r <(gor #mutpred#)
Query ran in 0.25 sec
Query fetched 5,522 rows in 4.36 sec (total time 4.61 sec)
                                             
Out[87]:
chrom pos ref alt disease distance gene_start gene_end gene_symbol MutPred_rankscore
0 chr5 141330778 C A INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1 .
1 chr5 141330954 A G INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1 0.97773
2 chr5 141331012 G T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1 0.66843
3 chr5 141331196 A T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1 0.45202
4 chr5 141331255 C T INBORN_GENETIC_DISEASES 0 141330513 141512975 PCDHGA1 0.54201
... ... ... ... ... ... ... ... ... ... ...
5517 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141421046 141512975 PCDHGA11 .
5518 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141430506 141512975 PCDHGA12 .
5519 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141475946 141512977 PCDHGC3 .
5520 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141484996 141512975 PCDHGC4 .
5521 chr5 141511049 G A INBORN_GENETIC_DISEASES 0 141489080 141512975 PCDHGC5 .

5522 rows × 10 columns

The important take home lesson here is that while we can easily change the column order, if we want to apply pipe-step GOR commands that rely on consistent genomic order, such as VARJOIN, GROUP, RANK, etc., we must ensure that the first two columns are in proper order. Before we leave this subject, we provide few more ways to achieve the above.

First we make sure that we don't have overlapping segments by using the SEGPROJ command, however, then we loose the gene symbol. Thereforew, we have to join it back in:

In [93]:
%%gor
$mydefs3

gor [#busy#] | segproj | join -segsnp <(gor #cvars# ) 
| select chrom,pos-disease | join -r -snpseg -maxseg 4000000 [#busy#]
| varjoin -r <(gor #mutpred# )
Query ran in 0.57 sec
Query fetched 5,522 rows in 4.58 sec (total time 5.15 sec)
                                             
Out[93]:
chrom pos ref alt disease gene_symbol MutPred_rankscore
0 chr5 141330778 C A INBORN_GENETIC_DISEASES PCDHGA1 .
1 chr5 141330954 A G INBORN_GENETIC_DISEASES PCDHGA1 0.97773
2 chr5 141331012 G T INBORN_GENETIC_DISEASES PCDHGA1 0.66843
3 chr5 141331196 A T INBORN_GENETIC_DISEASES PCDHGA1 0.45202
4 chr5 141331255 C T INBORN_GENETIC_DISEASES PCDHGA1 0.54201
... ... ... ... ... ... ... ...
5517 chr5 141511049 G A INBORN_GENETIC_DISEASES PCDHGA11 .
5518 chr5 141511049 G A INBORN_GENETIC_DISEASES PCDHGA12 .
5519 chr5 141511049 G A INBORN_GENETIC_DISEASES PCDHGC3 .
5520 chr5 141511049 G A INBORN_GENETIC_DISEASES PCDHGC4 .
5521 chr5 141511049 G A INBORN_GENETIC_DISEASES PCDHGC5 .

5522 rows × 7 columns

Notice that we did not have to use a SORT because the segments in the left-source are non-overlapping after SEGPROJ. We can indeed do something similar by using the -ir option in JOIN. This option outputs only the right-source rows, but only once, even though they may overlap multiple rows in the left-source.

In [94]:
%%gor
$mydefs3

gor [#busy#]  | join -segsnp -ir <(gor #cvars# ) 
| varjoin -r <(gor #mutpred# )
| join -r -snpseg -maxseg 4000000 [#busy#]
Query ran in 0.31 sec
Query fetched 5,522 rows in 4.33 sec (total time 4.64 sec)
                                             
Out[94]:
chrom pos ref alt disease MutPred_rankscore gene_symbol
0 chr5 141330778 C A INBORN_GENETIC_DISEASES . PCDHGA1
1 chr5 141330954 A G INBORN_GENETIC_DISEASES 0.97773 PCDHGA1
2 chr5 141331012 G T INBORN_GENETIC_DISEASES 0.66843 PCDHGA1
3 chr5 141331196 A T INBORN_GENETIC_DISEASES 0.45202 PCDHGA1
4 chr5 141331255 C T INBORN_GENETIC_DISEASES 0.54201 PCDHGA1
... ... ... ... ... ... ... ...
5517 chr5 141511049 G A INBORN_GENETIC_DISEASES . PCDHGA11
5518 chr5 141511049 G A INBORN_GENETIC_DISEASES . PCDHGA12
5519 chr5 141511049 G A INBORN_GENETIC_DISEASES . PCDHGC3
5520 chr5 141511049 G A INBORN_GENETIC_DISEASES . PCDHGC4
5521 chr5 141511049 G A INBORN_GENETIC_DISEASES . PCDHGC5

5522 rows × 7 columns

We see that for this approach to work, we must read the gene segments twice, if we want to both use gene segments to select rows from the right source and use gene annotations in the output. In the above examples, this is not of big concern because the gene table is very light-weight. This is very often the case and for example reading all genes or exons comes with very little cost.

In [ ]: