sispca is a Python package designed to learn linear representations capturing variations associated with factors of interest in high-dimensional data. It extends the Principal Component Analysis (PCA) to multiple subspaces and encourage subspace disentanglement by maximizing the Hilbert-Schmidt Independence Criterion (HSIC). The model is implemented in PyTorch and uses the Lightning framework for training. See the documentation for more details.
For more theoretical connections and applications, please refer to our paper Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis.
- v1.1.0 (2025-02-27): Memory-efficient handling of supervision kernel for large datasets.
- v1.0.0 (2024-10-11): Initial release.
Via GitHub (latest version):
pip install git+https://github.com/JiayuSuPKU/sispca.git#egg=sispca
Via PyPI (stable version):
pip install sispca
Basic usage:
import numpy as np
import torch
from sispca import Supervision, SISPCADataset, SISPCA
# simulate random inputs
x = torch.randn(100, 20)
y_cont = torch.randn(100, 5) # continuous target
y_group = np.random.choice(['A', 'B', 'C'], 100) # categorical target
# simulate custom kernel K_y
# in general, K_y should be either sparse, i.e. a graph Laplacian kernel, or low-rank, i.e. K_y = L @ L.T
L = torch.randn(100, 20)
K_y = L @ L.T # (n_sample, n_sample)
# create a dataset with supervision
sdata = SISPCADataset(
data = x.float(), # (n_sample, n_feature)
target_supervision_list = [
Supervision(target_data=y_cont, target_type='continuous'),
Supervision(target_data=y_group, target_type='categorical'),
# Supervision(target_data=None, target_type='custom', target_kernel_K = K_y)
Supervision(target_data=None, target_type='custom', target_kernel_Q = L) # equivalent to the above
]
)
# fit the sisPCA model
sispca = SISPCA(
sdata,
n_latent_sub=[3, 3, 3, 3], # the last subspace will be unsupervised
lambda_contrast=10,
kernel_subspace='linear',
solver='eig'
)
sispca.fit(batch_size = -1, max_epochs = 100, early_stopping_patience = 5)
Tutorials:
- Feature selection using sisPCA on the Breast Cancer Wisconsin dataset.
- Learning unsupervised residual subspace in simulation.
- Learning interpretable infection subspaces in scRNA-seq data using sisPCA. It takes approximately 1 min (M1 Macbook Air) to fit a single sisPCA-linear model on a scRNA-seq dataset with 20,000 cells and 2,000 genes.
For additional details, please refer to the documentation.
If you find sisPCA useful in your research, please consider citing our paper:
@misc{su2024disentangling,
title={Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis},
author={Jiayu Su and David A. Knowles and Raul Rabadan},
year={2024},
eprint={2410.23595},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/2410.23595},
}