Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.

The steps in our proposed pipeline for a WSI are:

Creating a thumbnail version of the WSI
Creating a binary mask of the thumbnail, indicating where the foreground (tissue) is
Extracting all the tissue patches of the WSI into a single .tar file
Creating a feature vector for each tissue patch, and dumping them into a single .pickle

Installation:

A conda env wsi-preproc with all the required dependencies can be created by running

make setup

Running:

Scripts to run the preprocessing steps are in the tools directory. The scripts' arguments can be passed as a Hydra configuration .yaml:

conda activate wsi-preproc
python3 tools/extract-thumbnails.py -cp configs/examples -cn wsi-thumbnails-example.yaml

The preprocessing steps can be executed with the following commands. The complete arguments are documented in the scripts' source.

Creating thumbnails:

python3 tools/extract-thumbnails.py \
    ++input_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories.
    ++output_path=/path/to/thumbnails \ #Root folder for the thumbnails.
    ++wsi_extension=.svs #Extension of the slide files (including the '.').

Creating binary masks:

python3 tools/binarize-thumbnails.py \
    ++input_path=/path/to/thumbnails \ #Root folder of thumbnails. Can have nested directories.
    ++output_path=/path/to/binary/masks \ #Root folder for the binary masks.
    ++thumbnail_extension=.jpg #Extension of the thumbnail image files (including the '.').

Extracting tissue patches:

python3 tools/extract-patches.py \
    ++wsis_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories..
    ++masks_path=/path/to/binary/masks \ #Root folder of the binary masks. Should respect structure of wsis_path.
    ++output_path \ #Root folder of the .tar files containing the tissue patches.
    ++wsi_extension=.svs \ #Extension of the slide files (including the '.').
    ++masks_extension=.jpg \ #Extension of the binary masks (including the '.').
    ++target_magnification=20 \ #The magnification at which the WSI will be tiled.
    ++tile_size=256 \ #Size of the output tissue patches.
    ++patch_content_threshold=0.6 #Fraction of tissue pixels in one tile to be saved.

Extracting features:

python3 tools/extract-features.py \
    ++input_path=/path/to/tarfiles \ #Root folder of .tar files with tissue patches. Can have nested directories.
    ++output_path=/path/to/feature/vectors \ #Root folder for the feature vectors of the patches.
    ++model=vit_l_16_imagenet #Name of a feature extractor model registered in the framework.

Why this way?

Doing patch extraction based on "where is content" is reasonable.
There is room for experimentation. It is easy to add a new model or binarization functions to the framework and we encourage to do so.
The steps are abstracted from each other. You could extract patches at x20 magnification once, and subsequently try different feature extractors.
An important bottleneck of WSI preprocessing is the I/O in tiling/feature extraction:
- Reading thousands of patches from a single slide is slow.
- Reading and saving thousands of image patches/feature vectors is slow.
- Reading and saving a single .tar with thousands of patches is considerably faster.
Although creating a .tar of patches is a sort of intermediate step, it is useful for the purpose of resource management:
- A GPU running feature extraction directly from a slide would be idle most of the time, because slow WSI I/O. This GPU could be working on something else instead.
- The overhead of introducing the "intermediate step" is depreciable.

Extending the package

It is possible to implement and add new binarization functions, models and data augmentations to the framework, just by decorating functions:

@register_model
def dummy_model():
	return {'model': torch.nn.Identity(), 'transform': torchvision.transforms.ToTensor()}

now dummy_model is available to be passed as the model argument to tools/extract_features.py.
Check the code in src/sample_processor/[binarization-models-data_augmentations] to see how the currently available options were implemented.

Furthermore, if you desire to process all files in one folder differently, you can implement your own classes that inherit from PoolFolderProcessor and ShardingFolderProcessor to take advantage of multiprocessing.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
configs		configs
src		src
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.

Installation:

Running:

Creating thumbnails:

Creating binary masks:

Extracting tissue patches:

Extracting features:

Why this way?

Extending the package

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bozeklab/wsi-preproc

Folders and files

Latest commit

History

Repository files navigation

Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.

Installation:

Running:

Creating thumbnails:

Creating binary masks:

Extracting tissue patches:

Extracting features:

Why this way?

Extending the package

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages