David Nordström · Johan Edstedt · Fredrik Kahl · Georg Bökman
Incorporating octic layers into Vision Transformers (ViTs) reduces the computational complexity while maintaining or improving representational power. We provide a PyTorch implementation for easy integration into existing ViT pipelines.
In the octic_vits folder you find all the components to build octic-equivariant Vision Transformers (intended to be compatible with the timm library). For example, to create an octic ViT-H you can run the following:
from octic_vits import OcticVisionTransformer
model = OcticVisionTransformer(embed_dim=1280, depth=32, num_heads=16)
This will default to a hybrid model with the first half of its block being octic and the remaining standard (i.e. this model will have approx. 40% less FLOPs than a regular ViT-H). To instead obtain an invariant model, simply set invariant=True
. You can further decide on the number of octic blocks k
by setting octic_equi_break_layer=k
.
Code based on the official repo has been placed in the deit folder.
Code based on the official repo has been placed in the dinov2 folder.
Code to reproduce the experiments can be found in the experiments folder. Below follows general instruction on how to run it and how to obtain pretrained model weights.
All the code is written with the intent to be run on a Slurm cluster using submitit. So first you must set up the cluster settings in utils/cluster.py. If you intend to run it using torchrun
instead, it should work straightforwardly. Also, make sure to run export PYTHONPATH=$(pwd)
in the root folder of this directory to ensure relative imports work as intended.
For DINOv2 we use the same environment as in the original repo and same goes for deit. Additional miscellaneous installations, e.g. submitit, need to be additionally downloaded.
Since DeiT III is deprecated we provide some additional guidance on its installation in DEIT_ENV.md
We follow the DINOv2 IN1K data structure. As such, the root directory of the dataset should hold the following contents:
<ROOT>/test/ILSVRC2012_test_00000001.JPEG
<ROOT>/test/[..]
<ROOT>/test/ILSVRC2012_test_00100000.JPEG
<ROOT>/train/n01440764/n01440764_10026.JPEG
<ROOT>/train/[...]
<ROOT>/train/n15075141/n15075141_9993.JPEG
<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
<ROOT>/val/[...]
<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG
<ROOT>/labels.txt
The provided dataset implementation expects a few additional metadata files to be present under the extra directory:
<EXTRA>/class-ids-TRAIN.npy
<EXTRA>/class-ids-VAL.npy
<EXTRA>/class-names-TRAIN.npy
<EXTRA>/class-names-VAL.npy
<EXTRA>/entries-TEST.npy
<EXTRA>/entries-TRAIN.npy
<EXTRA>/entries-VAL.npy
These metadata files can be generated (once) with the following lines of Python code:
from dinov2.data.datasets import ImageNet
for split in ImageNet.Split:
dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
dataset.dump_extra()
For segmentation evaluation we use the code from capi and as the creator of said repository is very helpful, he has enabled automatic downloading of the datasets. For more information consult the original repo.
Download the weights from here to reproduce the evaluation metrics. The DINOv2 weights only include the teacher backbone. HERE is a link to a Google Drive that contains all files of interest.
model | # of params |
# of FLOPs |
ImageNet Top-1 |
weights | logs |
---|---|---|---|---|---|
Hybrid ViT-H/14 | 356 M | 102 G | 85.0% | weights | logs |
Invariant ViT-H/14 | 362 M | 104 G | 84.7% | weights | logs |
Hybrid ViT-L/16 | 171 M | 38 G | 84.5% | weights | logs |
Invariant ViT-L/16 | 175 M | 39 G | 84.0% | weights | logs |
model | # of FLOPs |
ImageNet linear |
ImageNet knn |
weights | logs |
---|---|---|---|---|---|
ViT-H/16 | 128 G | 81.7% | 81.0% | weights | logs |
Hybrid ViT-H/16 | 78 G | 82.2% | 81.4% | weights | logs |
Invariant ViT-H/16 | 78 G | 81.9% | 80.9% | weights | logs |
ViT-L/16 | 62 G | 80.9% | 80.5% | weights | logs |
Hybrid ViT-L/16 | 38 G | 81.3% | 80.8% | weights | logs |
Invariant ViT-L/16 | 38 G | 81.2% | 80.4% | weights | logs |
We will use the Hybrid ViT-H model as an example (as it is the best performing) to showcase how you can perform evaluation. You can replace the model with the one that you want to test.
After downloading the weights you should be able to run the following command:
python experiments/eval_deit.py --model hybrid_deit_huge_patch14 --eval pretrained_models/hybrid_deit_huge_patch14.pth
This should give:
* Acc@1 84.996 Acc@5 96.390 loss 0.799
For classification, run:
python experiments/eval_dinov2_classification.py output_dir --config-file dinov2/configs/eval/hybrid_vith16.yaml --pretrained-weights pretrained_models/hybrid_dinov2_huge_patch16.pth
This should give an accuracy of 82.2% and 81.4% for linear and knn, respectively.
For segmentation, run:
python experiments/eval_dinov2_segmentation.py model_path=dinov2/eval/segmentation/dinov2_loader.py model_loader_kwargs.model_name=dinov2_hybrid_vith16 model_loader_kwargs.weights=pretrained_models/hybrid_dinov2_huge_patch16.pth distributed=True ntasks_per_node=4 account=... gpus-per-node=4 nodes=1 output_dir=./output_dir
This should give an mIoU of 35.1 (linear) and 31.1 (knn) for ADE20K and 70.8 (linear) and 61.7 (knn) for VOC2012.
Per-GPU batch sizes are adjusted to work well on A100-40GB. Feel free to adjust for your settings while making sure the effective batch size remains the same (2048 for DeiT III and 1024 for DINOv2).
To launch distributed training, run:
python experiments/train_deit.py --model hybrid_deit_huge_patch14
To launch distributed training, run:
python experiments/train_dinov2.py --config-file dinov2/configs/train/hybrid_vith16.yaml --ngpus 4 --nodes 2
We have provided a utility file to verify octic equivariance (and invariance). Simply run:
python experiments/test_equivariance.py
In the paper we present the throughput. To replicate these figures run the following command on a A100-80GB:
python experiments/complexity.py --amp --compile
- Release the D8 models + weights
- Add to timm library
Stronger ViTs with Octic Equivariance code is released under the Apache License 2.0. See LICENSE for additional details. Training recipes are taken from DeiT III and DINOv2, and evaluation is taken from capi, all released under the Apache License 2.0.
Code structure is inspired by capi and RoMa.
If you find this repository useful, please consider giving a star â and citation đ:
@misc{nordström2025strongervitsocticequivariance,
title={Stronger ViTs With Octic Equivariance},
author={David Nordström and Johan Edstedt and Fredrik Kahl and Georg Bökman},
year={2025},
eprint={2505.15441},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.15441},
}