Visual Attention Analysis with Spatial Transformer Networks for Handwritten Digit Classification on MNIST
git clone https://github.com/biswassanket/STN_FGC.git
cd STN_FGC- To create conda environment:
conda env create -f environment.yml
conda activate stn_fgc- To run base STN with standard Conv layers:
$ python main.py --stn- To run STN with Coordconv layers:
$ python main.py --stncoordconv --localization$ python main.py --vitStep 6: For the detailed analysis on the experimented visual attention models, here is the complete report
| Model Variant | Accuracy | Best Epoch |
|---|---|---|
| Simple Conv | 0.9879 | 48 |
| Simple STN+Conv | 0.9889 | 44 |
| Simple STN+CoordConv | 0.9850 | 43 |
| Simple STN+CoordConv+localization | 0.9910 | 47 |
| Simple STN=CoordConv+localization+r-channel | 0.9868 | 40 |
| Vision Transformers | 0.9844 | 49 |
Enjoyed playing with the models. Stay tuned, more implementations of visual attention models on fine-grained image classification task is coming soon. Thank you and sorry for the bugs,as usual.