The steps in our proposed pipeline for a WSI are:
- Creating a thumbnail version of the WSI
- Creating a binary mask of the thumbnail, indicating where the foreground (tissue) is
- Extracting all the tissue patches of the WSI into a single .tar file
- Creating a feature vector for each tissue patch, and dumping them into a single .pickle
A conda env wsi-preproc
with all the required dependencies can be created by running
make setup
Scripts to run the preprocessing steps are in the tools
directory. The scripts' arguments can be passed as a Hydra configuration .yaml:
conda activate wsi-preproc
python3 tools/extract-thumbnails.py -cp configs/examples -cn wsi-thumbnails-example.yaml
The preprocessing steps can be executed with the following commands. The complete arguments are documented in the scripts' source.
python3 tools/extract-thumbnails.py \
++input_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories.
++output_path=/path/to/thumbnails \ #Root folder for the thumbnails.
++wsi_extension=.svs #Extension of the slide files (including the '.').
python3 tools/binarize-thumbnails.py \
++input_path=/path/to/thumbnails \ #Root folder of thumbnails. Can have nested directories.
++output_path=/path/to/binary/masks \ #Root folder for the binary masks.
++thumbnail_extension=.jpg #Extension of the thumbnail image files (including the '.').
python3 tools/extract-patches.py \
++wsis_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories..
++masks_path=/path/to/binary/masks \ #Root folder of the binary masks. Should respect structure of wsis_path.
++output_path \ #Root folder of the .tar files containing the tissue patches.
++wsi_extension=.svs \ #Extension of the slide files (including the '.').
++masks_extension=.jpg \ #Extension of the binary masks (including the '.').
++target_magnification=20 \ #The magnification at which the WSI will be tiled.
++tile_size=256 \ #Size of the output tissue patches.
++patch_content_threshold=0.6 #Fraction of tissue pixels in one tile to be saved.
python3 tools/extract-features.py \
++input_path=/path/to/tarfiles \ #Root folder of .tar files with tissue patches. Can have nested directories.
++output_path=/path/to/feature/vectors \ #Root folder for the feature vectors of the patches.
++model=vit_l_16_imagenet #Name of a feature extractor model registered in the framework.
- Doing patch extraction based on "where is content" is reasonable.
- There is room for experimentation. It is easy to add a new model or binarization functions to the framework and we encourage to do so.
- The steps are abstracted from each other. You could extract patches at x20 magnification once, and subsequently try different feature extractors.
- An important bottleneck of WSI preprocessing is the I/O in tiling/feature extraction:
- Reading thousands of patches from a single slide is slow.
- Reading and saving thousands of image patches/feature vectors is slow.
- Reading and saving a single .tar with thousands of patches is considerably faster.
- Although creating a .tar of patches is a sort of intermediate step, it is useful for the purpose of resource management:
- A GPU running feature extraction directly from a slide would be idle most of the time, because slow WSI I/O. This GPU could be working on something else instead.
- The overhead of introducing the "intermediate step" is depreciable.
It is possible to implement and add new binarization functions, models and data augmentations to the framework, just by decorating functions:
@register_model
def dummy_model():
return {'model': torch.nn.Identity(), 'transform': torchvision.transforms.ToTensor()}
now dummy_model
is available to be passed as the model
argument to tools/extract_features.py
.
Check the code in src/sample_processor/[binarization-models-data_augmentations]
to see how the currently available options were implemented.
Furthermore, if you desire to process all files in one folder differently, you can implement your own classes that inherit from PoolFolderProcessor
and ShardingFolderProcessor
to take advantage of multiprocessing.