8000 GitHub - seanshpark/schh_vit: Simple and easy to understand PyTorch implementation of Vision Transformer (ViT) from scratch with detailed steps. Tested on datasets: MNIST, FashionMNIST, SVHN, CIFAR10, and CIFAR100.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Simple and easy to understand PyTorch implementation of Vision Transformer (ViT) from scratch with detailed steps. Tested on datasets: MNIST, FashionMNIST, SVHN, CIFAR10, and CIFAR100.

License

Notifications You must be signed in to change notification settings

seanshpark/schh_vit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Transformer from Scratch in PyTorch

This is a simplified Scratch Pytorch Implementation of Vision Transformer (ViT) with detailed Steps (Refer to model.py)

Overview:

  • The default network is a scaled-down version of the original ViT architecture from the ViT Paper.
  • Has only 200k-800k parameters depending upon the embedding dimension (Original ViT-Base has 86 million).
  • Tested on MNIST, FashionMNIST, SVHN, CIFAR10, and CIFAR100 datasets.
  • Uses a smaller patch size of 4.
  • Can be used with bigger datasets by increasing the model parameters and patch size.
  • Option to use PyTorch's inbuilt transformer layers in-place of the implemented one to define the ViT.

Run commands (also available in scripts.sh):

Dataset Run command Test Acc
MNIST python main.py --dataset mnist --epochs 100 99.5
Fashion MNIST python main.py --dataset fmnist 92.3
SVHN python main.py --dataset svhn --n_channels 3 --image_size 32 --embed_dim 128 96.2
CIFAR10 python main.py --dataset cifar10 --n_channels 3 --image_size 32 --embed_dim 128 86.3 (82.5 w/o RandAug)
CIFAR100 python main.py --dataset cifar100 --n_channels 3 --image_size 32 --embed_dim 128 59.6 (55.8 w/o RandAug)

use_torch_transformer_layers argument (in main.py) switches between PyTorch's inbuilt transformer layers and the implemented one for defining the Vision Transformer's Encoder and its layers (code at model.py).

Transformer Config:

Config MNIST and FMNIST SVHN and CIFAR
Input Size 1 X 28 X 28 3 X 32 X 32
Patch Size 4 4
Sequence Length 7*7 = 49 8*8 = 64
Embedding Size 64 128
Parameters 210k 820k
Num of Layers 6 6
Num of Heads 4 4
Forward Multiplier 2 2
Dropout 0.1 0.1
Further optimizing the configuration can provide additional performance gains.

About

Simple and easy to understand PyTorch implementation of Vision Transformer (ViT) from scratch with detailed steps. Tested on datasets: MNIST, FashionMNIST, SVHN, CIFAR10, and CIFAR100.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Shell 1.0%
0