8000 GitHub - bozeklab/wsi-preproc: Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

bozeklab/wsi-preproc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.

The steps in our proposed pipeline for a WSI are:

  • Creating a thumbnail version of the WSI
  • Creating a binary mask of the thumbnail, indicating where the foreground (tissue) is
  • Extracting all the tissue patches of the WSI into a single .tar file
  • Creating a feature vector for each tissue patch, and dumping them into a single .pickle

Installation:

A conda env wsi-preproc with all the required dependencies can be created by running

make setup 

Running:

Scripts to run the preprocessing steps are in the tools directory. The scripts' arguments can be passed as a Hydra configuration .yaml:

conda activate wsi-preproc
python3 tools/extract-thumbnails.py -cp configs/examples -cn wsi-thumbnails-example.yaml

The preprocessing steps can be executed with the following commands. The complete arguments are documented in the scripts' source.

Creating thumbnails:

python3 tools/extract-thumbnails.py \
    ++input_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories.
    ++output_path=/path/to/thumbnails \ #Root folder for the thumbnails.
    ++wsi_extension=.svs #Extension of the slide files (including the '.').

Creating binary masks:

python3 tools/binarize-thumbnails.py \
    ++input_path=/path/to/thumbnails \ #Root folder of thumbnails. Can have nested directories.
    ++output_path=/path/to/binary/masks \ #Root folder for the binary masks.
    ++thumbnail_extension=.jpg #Extension of the thumbnail image files (including the '.').

Extracting tissue patches:

python3 tools/extract-patches.py \
    ++wsis_path=/path/to/wsis \ #Root folder of WSIs. Can have nested directories..
    ++masks_path=/path/to/binary/masks \ #Root folder of the binary masks. Should respect structure of wsis_path.
    ++output_path \ #Root folder of the .tar files containing the tissue patches.
    ++wsi_extension=.svs \ #Extension of the slide files (including the '.').
    ++masks_extension=.jpg \ #Extension of the binary masks (including the '.').
    ++target_magnification=20 \ #The magnification at which the WSI will be tiled.
    ++tile_size=256 \ #Size of the output tissue patches.
    ++patch_content_threshold=0.6 #Fraction of tissue pixels in one tile to be saved.

Extracting features:

python3 tools/extract-features.py \
    ++input_path=/path/to/tarfiles \ #Root folder of .tar files with tissue patches. Can have nested directories.
    ++output_path=/path/to/feature/vectors \ #Root folder for the feature vectors of the patches.
    ++model=vit_l_16_imagenet #Name of a feature extractor model registered in the framework.

Why this way?

  • Doing patch extraction based on "where is content" is reasonable.
  • There is room for experimentation. It is easy to add a new model or binarization functions to the framework and we encourage to do so.
  • The steps are abstracted from each other. You could extract patches at x20 magnification once, and subsequently try different feature extractors.
  • An important bottleneck of WSI preprocessing is the I/O in tiling/feature extraction:
    • Reading thousands of patches from a single slide is slow.
    • Reading and saving thousands of image patches/feature vectors is slow.
    • Reading and saving a single .tar with thousands of patches is considerably faster.
  • Although creating a .tar of patches is a sort of intermediate step, it is useful for the purpose of resource management:
    • A GPU running feature extraction directly from a slide would be idle most of the time, because slow WSI I/O. This GPU could be working on something else instead.
    • The overhead of introducing the "intermediate step" is depreciable.

Extending the package

It is possible to implement and add new binarization functions, models and data augmentations to the framework, just by decorating functions:

@register_model
def dummy_model():
	return {'model': torch.nn.Identity(), 'transform': torchvision.transforms.ToTensor()}

now dummy_model is available to be passed as the model argument to tools/extract_features.py.
Check the code in src/sample_processor/[binarization-models-data_augmentations] to see how the currently available options were implemented.

Furthermore, if you desire to process all files in one folder differently, you can implement your own classes that inherit from PoolFolderProcessor and ShardingFolderProcessor to take advantage of multiprocessing.

About

Preprocessing steps for Whole Slide Images, with Multiple Instance Learning problems in mind.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0