Stronger ViTs With Octic Equivariance

David Nordström · Johan Edstedt · Fredrik Kahl · Georg Bökman

Paper

Incorporating octic layers into Vision Transformers (ViTs) reduces the computational complexity while maintaining or improving representational power. We provide a PyTorch implementation for easy integration into existing ViT pipelines.

Structure

Octic ViTs

In the octic_vits folder you find all the components to build octic-equivariant Vision Transformers (intended to be compatible with the timm library). For example, to create an octic ViT-H you can run the following:

from octic_vits import OcticVisionTransformer

model = OcticVisionTransformer(embed_dim=1280, depth=32, num_heads=16)

This will default to a hybrid model with the first half of its block being octic and the remaining standard (i.e. this model will have approx. 40% less FLOPs than a regular ViT-H). To instead obtain an invariant model, simply set invariant=True. You can further decide on the number of octic blocks k by setting octic_equi_break_layer=k.

DeiT III

Code based on the official repo has been placed in the deit folder.

DINOv2

Code based on the official repo has been placed in the dinov2 folder.

Reproducing Results

Code to reproduce the experiments can be found in the experiments folder. Below follows general instruction on how to run it and how to obtain pretrained model weights.

Setup

All the code is written with the intent to be run on a Slurm cluster using submitit. So first you must set up the cluster settings in utils/cluster.py. If you intend to run it using torchrun instead, it should work straightforwardly. Also, make sure to run export PYTHONPATH=$(pwd) in the root folder of this directory to ensure relative imports work as intended.

Environment

For DINOv2 we use the same environment as in the original repo and same goes for deit. Additional miscellaneous installations, e.g. submitit, need to be additionally downloaded.

Since DeiT III is deprecated we provide some additional guidance on its installation in DEIT_ENV.md

Data

ImageNet-1K

We follow the DINOv2 IN1K data structure. As such, the root directory of the dataset should hold the following contents:

<ROOT>/test/ILSVRC2012_test_00000001.JPEG
<ROOT>/test/[..]
<ROOT>/test/ILSVRC2012_test_00100000.JPEG
<ROOT>/train/n01440764/n01440764_10026.JPEG
<ROOT>/train/[...]
<ROOT>/train/n15075141/n15075141_9993.JPEG
<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
<ROOT>/val/[...]
<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG
<ROOT>/labels.txt

The provided dataset implementation expects a few additional metadata files to be present under the extra directory:

<EXTRA>/class-ids-TRAIN.npy
<EXTRA>/class-ids-VAL.npy
<EXTRA>/class-names-TRAIN.npy
<EXTRA>/class-names-VAL.npy
<EXTRA>/entries-TEST.npy
<EXTRA>/entries-TRAIN.npy
<EXTRA>/entries-VAL.npy

These metadata files can be generated (once) with the following lines of Python code:

from dinov2.data.datasets import ImageNet

for split in ImageNet.Split:
    dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
    dataset.dump_extra()

ADE20K / VOC2012

For segmentation evaluation we use the code from capi and as the creator of said repository is very helpful, he has enabled automatic downloading of the datasets. For more information consult the original repo.

Weights

Download the weights from here to reproduce the evaluation metrics. The DINOv2 weights only include the teacher backbone. HERE is a link to a Google Drive that contains all files of interest.

DeiT III

model	# of params	# of FLOPs	ImageNet Top-1	weights	logs
Hybrid ViT-H/14	356 M	102 G	85.0%	weights	logs
Invariant ViT-H/14	362 M	104 G	84.7%	weights	logs
Hybrid ViT-L/16	171 M	38 G	84.5%	weights	logs
Invariant ViT-L/16	175 M	39 G	84.0%	weights	logs

DINOv2

model	# of FLOPs	ImageNet linear	ImageNet knn	weights	logs
ViT-H/16	128 G	81.7%	81.0%	weights	logs
Hybrid ViT-H/16	78 G	82.2%	81.4%	weights	logs
Invariant ViT-H/16	78 G	81.9%	80.9%	weights	logs
ViT-L/16	62 G	80.9%	80.5%	weights	logs
Hybrid ViT-L/16	38 G	81.3%	80.8%	weights	logs
Invariant ViT-L/16	38 G	81.2%	80.4%	weights	logs

Evaluation

We will use the Hybrid ViT-H model as an example (as it is the best performing) to showcase how you can perform evaluation. You can replace the model with the one that you want to test.

Deit III

After downloading the weights you should be able to run the following command:

python experiments/eval_deit.py --model hybrid_deit_huge_patch14 --eval pretrained_models/hybrid_deit_huge_patch14.pth

This should give:

* Acc@1 84.996 Acc@5 96.390 loss 0.799

DINOv2

For classification, run:

python experiments/eval_dinov2_classification.py output_dir --config-file dinov2/configs/eval/hybrid_vith16.yaml --pretrained-weights pretrained_models/hybrid_dinov2_huge_patch16.pth

This should give an accuracy of 82.2% and 81.4% for linear and knn, respectively.

For segmentation, run:

python experiments/eval_dinov2_segmentation.py model_path=dinov2/eval/segmentation/dinov2_loader.py model_loader_kwargs.model_name=dinov2_hybrid_vith16 model_loader_kwargs.weights=pretrained_models/hybrid_dinov2_huge_patch16.pth distributed=True ntasks_per_node=4 account=... gpus-per-node=4 nodes=1 output_dir=./output_dir

This should give an mIoU of 35.1 (linear) and 31.1 (knn) for ADE20K and 70.8 (linear) and 61.7 (knn) for VOC2012.

Training

Per-GPU batch sizes are adjusted to work well on A100-40GB. Feel free to adjust for your settings while making sure the effective batch size remains the same (2048 for DeiT III and 1024 for DINOv2).

DeiT III

To launch distributed training, run:

python experiments/train_deit.py --model hybrid_deit_huge_patch14

DINOv2

To launch distributed training, run:

python experiments/train_dinov2.py --config-file dinov2/configs/train/hybrid_vith16.yaml --ngpus 4 --nodes 2

Equivariance

We have provided a utility file to verify octic equivariance (and invariance). Simply run:

python experiments/test_equivariance.py

Throughput

In the paper we present the throughput. To replicate these figures run the following command on a A100-80GB:

python experiments/complexity.py --amp --compile

Checklist

Release the D8 models + weights
Add to timm library

License

Stronger ViTs with Octic Equivariance code is released under the Apache License 2.0. See LICENSE for additional details. Training recipes are taken from DeiT III and DINOv2, and evaluation is taken from capi, all released under the Apache License 2.0.

Credit

Code structure is inspired by capi and RoMa.

Cite

If you find this repository useful, please consider giving a star ⭐ and citation 🐙:

@misc{nordström2025strongervitsocticequivariance,
      title={Stronger ViTs With Octic Equivariance}, 
      author={David Nordström and Johan Edstedt and Fredrik Kahl and Georg Bökman},
      year={2025},
      eprint={2505.15441},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.15441}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
deit		deit
dinov2		dinov2
experiments		experiments
octic_vits		octic_vits
utils		utils
.gitignore		.gitignore
DEIT_ENV.md		DEIT_ENV.md
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
submit.py		submit.py
test.py		test.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stronger ViTs With Octic Equivariance

Paper

Structure

Octic ViTs

DeiT III

DINOv2

Reproducing Results

Setup

Environment

Data

ImageNet-1K

ADE20K / VOC2012

Weights

DeiT III

DINOv2

Evaluation

Deit III

DINOv2

Training

DeiT III

DINOv2

Equivariance

Throughput

Checklist

License

Credit

Cite

About

Uh oh!

Releases

Packages

Languages

License

davnords/octic-vits

Folders and files

Latest commit

History

Repository files navigation

Stronger ViTs With Octic Equivariance

Paper

Structure

Octic ViTs

DeiT III

DINOv2

Reproducing Results

Setup

Environment

Data

ImageNet-1K

ADE20K / VOC2012

Weights

DeiT III

DINOv2

Evaluation

Deit III

DINOv2

Training

DeiT III

DINOv2

Equivariance

Throughput

Checklist

License

Credit

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages