Longitudinal phenotype analysis

First we include boiler plate code to define the project, set environment variables and import libraries.

We start by defining a new ICD data table, here as the combination of two table in a our project.

List summary

We can summarize the ICD codes for each PN in the two tables in following way using the PIVOT command:

Similarly, we can collapse the tables into one and identify samples with two or more ICD codes of interes.

We show few selected rows from the the output with the following Python logic:

Finding first, last, and last-n ICDs

There are several ways to play with time ordered data as we will show, but first lets change the definition of are table such that it is ordered by samples and date in order to leverage "-ordered" option in some of the commads, like ÀTMIN and GROUP, and make the evaluation of the queries more efficient.

Row based approach

Now we can write the following queries. They are pretty self-explanatory keeping in mind the ordering of the #ordered# table and that the command ROWNUM generates a column with the row number. Note however that the RANK command does not support the"-ordered" option at the moment.

List based approach

Here we turn all the columns into a list per PN and then use the listfilter to pick only the last 3 elemenst. The list-functions LISTMAP, LISTFILTER, and LISTZIPFILTER recognize two variables in addition to the column values; namely x for the "current" element value (as string) and i for the list element index (as int).

We can compare the results and see that they are identical using the THROWIF command like this, utilizing the federated virtual relations based on Python dataframes.

De-list the data

We could also easily turn the list-based data into row-level data using the SPLIT command, e.g.

We see that the samples have only three rows each. Note that the above approach may not work well if there are column characters that "conflict" with the list separator (here by default a simple comma). A more specific separator can be specified both in GROU, SPLIT and the LISFILTER function.

Generating multiple Boolean conditions

This example show a simple pattern, based on GROUP and "-max", where multiple Boolean variables are defined where each is only based on a single observation.

Table like this is easily explored in the SM grid using the Venn-tool or we can simply summarize it in the following way:

Time based window analysis

Join based approach

The last set of examples will show how we can work with longitudinal data and do window based analysis. For example be able to count the number of ICD codes in a particular time window or to find a set of certain codes that coexist in a given time window. The first approach uses a MULTIMAP join, leveraging the order optimization in that command. Importantly, it prevents the right-source to be loaded into memory, which is important if we have high number of observations.

Parallel NOR version

These types of queries can be quite costly and the cost goes up with the square of the number of observations per sample. We can speed things up by assigning multiple workers to the processing with the PARALLEL command and partition the PN samples on the fly using the "in-order-of-appearance" function ÌOOA and filter them with WHERE based on the modulus of their row number. In addition to associating each ICD observation with the count and list of other ICD codes that are close in time, we can compute a Boolean condition based on the diagnoses found in the time window.

List based approach

Finally, we show how the same type of window based time analysis can be done using list functions in the GORpipe syntax. One can argue that syntactically this approach is more complicated, however, for the analysis presented here it is actually close to 4x more efficient! The idea is to collapse all the observations per sample into lists. Then clone the lists, split them and extract sub-lists based on the time condition using LISTZIPFILTER where we can filter the ICD code list based on the time list.

While the above approach is faster than the join approach shown earlier, it is more complex to read and the performance difference may be smaller if we need to keep track of more columns than icd_code and entry_date like in our example here. Obviously, the above query can be parallelized in the same way as the multimap join based approach. Finally, we show that we get the same sample count for our earlier condition analysis and we leave it as an exersize for the reader to show that they give exactly the same result.