pip install MEDS-transforms
input_dir: $MEDS_ROOT
output_dir: $PIPELINE_OUTPUT
description: Your special pipeline
stages:
- filter_subjects:
min_events_per_subject: 5
- add_time_derived_measurements:
age:
DOB_code: MEDS_BIRTH
age_code: AGE
age_unit: years
time_of_day:
time_of_day_code: TIME_OF_DAY
endpoints: [6, 12, 18, 24]
- fit_outlier_detection:
_base_stage: aggregate_code_metadata
aggregations:
- values/n_occurrences
- values/sum
- values/sum_sqd
- occlude_outliers:
stddev_cutoff: 1
- fit_normalization:
_base_stage: aggregate_code_metadata
aggregations:
- code/n_occurrences
- code/n_subjects
- values/n_occurrences
- values/sum
- values/sum_sqd
- fit_vocabulary_indices
- normalization
Save your pipeline YAML file on disk at $PIPELINE_YAML
.
In the terminal, run
MEDS_transform-pipeline pipeline_fp="$PIPELINE_YAML"
After you do, you will see output files stored in $PIPELINE_OUTPUT
with the results of each stage of the
pipeline, stored in stage specific directories, and the global output in $PIPELINE_OUTPUT/data
and
$PIPELINE_OUTPUT/metadata
(for data and metadata outputs, respectively). That's it!
Beyond just running a simple pipeline over the built-in stages, you can also do things like
- Define your own stages or use stages from other packages!
- Run your pipeline in parallel or across a slurm cluster with stage specific compute and memory requirements!
- Use meta-stage functionality like Match-Revise to dynamically control how your stage is run over different parts of the data!
To understand these capabilities and more, read the full documentation.
See any of the below projects to understand how to use MEDS-Transforms in different ways!
Note
If your package uses MEDS-Transforms, please submit a PR to add it to this list!
Read the full API documentation for technical details
MEDS-Transforms is built around the following design philosophy:
MEDS-Transforms is built for use with MEDS datasets. This format is an incredibly simple, usable, and powerful format for representing electronic health record (EHR) datasets for use in machine learning or artificial intelligence applications.
Any complex data pre-processing pipeline should be expressible as a series of simpler, interoperable stages. Expressing complex pipelines in this way allows the MEDS community to curate a library of "pre-processing stages" which can be used within the community to build novel, complex pipelines.
Each stage of a pipeline should be simple, testable, and (where possible) interoperable with other stages. This helps the community ensure correctness of pipelines and develop new tools in an efficient, reliable manner. It also helps researchers break down complex operations into simpler conceptual pieces. See the documentation on MEDS-Transforms Stages for more details on how to define your own stages!
Complex pipelines should also be communicable to other researchers, so that we can easily reproduce others' results, understand their work, and iterate on it. This is best enabled when pipelines can be defined by clear, simple configuration files over this shared library of stages. MEDS-Transforms realizes this with our pipeline configuration specification, shown above. See the full pipeline configuration documentation for more details.
Just as the MEDS format is designed to enable easy scaling of datasets through sharding, MEDS-Transforms is built around a mapreduce paradigm to enable easy scaling of pipelines to arbitrary dataset sizes by parallelizing operations across the input datasets' shards. Check out the mapreduce helpers MEDS-Transforms exposes for your use in downstream pipelines.
Much as MEDS is a data standard, MEDS-Transforms tries to embody the principle that data, rather than python objects, should be the interface between various pipeline components as much as possible. To that end, each MEDS-Transform stage can be run as a standalone script outputting transformed files to disk, which subsequent stages read. This means that you can easily run multiple MEDS-Transforms pipelines in sequence to combine operations across different packages or use-cases, and seamlessly resume pipelines after interruptions or failures from the partially completed stage outputs.
Note
This does cause some performance limitations, which we are solving; follow Issue #56 to track updates on this!
MEDS-Transforms pipelines can be run in serial mode or with controllable parallelization via Hydra launchers. The use of Hydra launchers and the core design principle of this library means that this parallelization is as simple as launching the individual stages multiple times with near-identical arguments to spin up more workers in parallel, and they can be launched in any mode over a networked file-system that you like. For example, default supported modes include:
- Local parallelism via the
joblib
Hydra launcher, which can be used to run multiple copies of the same script in parallel on a single machine. - Slurm parallelism via the
submitit
Hydra launcher, which can be used to run multiple copies of the same script in parallel on a cluster.
Note
The joblib
and submitit
Hydra launchers are optional dependencies of this package. To install them, you
can run pip install MEDS-transforms[local_parallelism]
or
pip install MEDS-transforms[slurm_parallelism]
, respectively.
MEDS-Transforms is built for you and other users to define their own stages and export them in your own
packages. When you define a stage in your package, you simply "register" it as a
MEDS_transforms.stages.Stage
object via a MEDS_transforms.stages
plugin in your package's entry points,
and MEDS-Transforms will be able to find it and use it in pipelines, te
99FD
sts, and more.
Concretely, to define a function that you want to run as a MEDS-Transforms stage, you simply:
E.g., in my_package/my_stage.py
:
from MEDS_transforms.stages import Stage
@Stage.register
def main(cfg: DictConfig):
# Do something with the MEDS data
pass
E.g., in your pyproject.toml
file:
[project.entry-points."MEDS_transforms.stages"]
my_stage = "my_package.my_stage:main"
MEDS-Transforms supports several different types of stages, which are listed in the
StageType
StrEnum
. These are:
MAP
stages, which apply an operation to each data shard in the input, and save the output to the same shard name in the output folder.MAPREDUCE
stages, which apply a metadata extraction operation to each shard in the input, then reduce those outputs to a single metadata file, which is merged with the input metadata and written to the output.MAIN
stages, which do not fall into either of the above categories and are simply run as standalone scripts without additional modification.MAIN
stages cannot use things like the "Match-Revise" protocol.
MAP
and MAPREDUCE
stages take in map and reduce functions; these functions can be direct functions that
apply to each shard, but more commonly they are "functors" that take as input the configuration parameters or
other consistently typed and annotated information and build the specific functions that are to be applied.
MEDS-Transforms can reliably bind these functors to the particular pipeline parameters to streamline your
ability to register stages. See the bind_compute_fn
function to better understand how this works and how to
ensure your stages will be appropriately recognized in downstream usage.
Stages are registered via the Stage.register
method, which can be used as a function or a decorator.
In addition to writing your own scripts, you can also allow users to reference your pipeline configuration
files directly from your package by ensuring they are included in your packaged files. Users can then refer to
them by using the pkg://
syntax in specifying the pipeline configuration file path, rather than an absolute
path on disk. For example:
MEDS_transform-pipeline pipeline_fp="pkg://my_package.my_pipeline.yaml"
Currently, the only supported meta-stage functionality is the "Match-Revise" protocol, which allows you to dynamically control how your stage is run over different parts of the data. This is useful for things like extraction of numerical values based on a collection of regular expressions, filtering different subsets of the data with different criteria, etc.
Given the critical importance of testing in the MEDS-Transforms library, we have built-in support for you to test your derived stages via a semi-automated, clear pipeline that will aid you in both writing tests and ensuring your stages are understandable to your users.
MEDS-Transforms has several key current priorities:
- Improve the quality of the documentation and tutorials.
- Improve the performance of the library, especially eliminating the fact that all stages currently write to disk and read from disk and that polars is not as efficient in low-resource settings.
- Improve the usability and clarity of the core components of this library, both conceptually and technically; this includes things like removing the distinction between data and metadata stages, ensuring all stages have a clear output schema, supporting reduce- or metadata- only stages, etc.
- Supporting more parallelization and scheduling systems, such as LSF, Spark, and more.
See the GitHub Issues to see all open issues we're considering. If you have an idea for a new feature, please open an issue to discuss it with us!
Contributions are very welcome; please follow the MEDS Organization's Contribution Guide if you submit a PR.