Focal Transformer

This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transformers", by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

Introduction

Our Focal Transfomer introduced a new self-attention mechanism called focal self-attention for vision transformers. In this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively.

With our Focal Transformers, we achieved superior performance over the state-of-the-art vision Transformers on a range of public benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.6 and 84.0 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art methods for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation.

Benchmarking

Image Classification on ImageNet-1K

Model Pretrain Use Conv Resolution acc@1 acc@5 #params FLOPs Checkpoint Config Focal-T IN-1K No 224 82.2 95.9 28.9M 4.9G download yaml Focal-T IN-1K Yes 224 82.7 96.1 30.8M 4.9G download yaml Focal-S IN-1K No 224 83.6 96.2 51.1M 9.4G download yaml Focal-S IN-1K Yes 224 83.8 96.5 53.1M 9.4G download yaml Focal-B IN-1K No 224 84.0 96.5 89.8M 16.4G download yaml

Object Detection and Instance Segmentation on COCO

Mask R-CNN

Backbone Pretrain Lr Schd #params FLOPs box mAP mask mAP Focal-T ImageNet-1K 1x 49M 291G 44.8 41.0 Focal-T ImageNet-1K 3x 49M 291G 47.2 42.7 Focal-S ImageNet-1K 1x 71M 401G 47.4 42.8 Focal-S ImageNet-1K 3x 71M 401G 48.8 43.8 Focal-B ImageNet-1K 1x 110M 533G 47.8 43.2 Focal-B ImageNet-1K 3x 110M 533G 49.0 43.7

RetinaNet

Backbone Pretrain Lr Schd #params FLOPs box mAP Focal-T ImageNet-1K 1x 39M 265G 43.7 Focal-T ImageNet-1K 3x 39M 265G 45.5 Focal-S ImageNet-1K 1x 62M 367G 45.6 Focal-S ImageNet-1K 3x 62M 367G 47.3 Focal-B ImageNet-1K 1x 101M 514G 46.3 Focal-B ImageNet-1K 3x 101M 514G 46.9

Other detection methods

Backbone Pretrain Method Lr Schd #params FLOPs box mAP Focal-T ImageNet-1K Cascade Mask R-CNN 3x 87M 770G 51.5 Focal-T ImageNet-1K ATSS 3x 37M 239G 49.5 Focal-T ImageNet-1K RepPointsV2 3x 45M 491G 51.2 Focal-T ImageNet-1K Sparse R-CNN 3x 111M 196G 49.0

Semantic Segmentation on ADE20K

Backbone Pretrain Method Resolution Iters #params FLOPs mIoU mIoU (MS) Focal-T ImageNet-1K UPerNet 512x512 160k 62M 998G 45.8 47.0 Focal-S ImageNet-1K UPerNet 512x512 160k 85M 1130G 48.0 50.0 Focal-B ImageNet-1K UPerNet 512x512 160k 126M 1354G 49.0 50.5 Focal-L ImageNet-22K UPerNet 640x640 160k 240M 3376G 54.0 55.4

Getting Started

Please follow get_started_for_image_classification.md to get started for image classification.
Please follow get_started_for_object_detection.md to get started for object detection.
Please follow get_started_for_semantic_segmentation.md to get started for semantic segmentation.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2021focal,
    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
    year={2021},
    eprint={2107.00641},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

Our codebase is built based on Swin-Transformer. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

GitHub - microsoft/Focal-Transformer: Official code for "Focal Self-attenti...

Focal Transformer

Introduction

Benchmarking

Image Classification on ImageNet-1K

Object Detection and Instance Segmentation on COCO

Mask R-CNN

RetinaNet

Other detection methods

Semantic Segmentation on ADE20K

Getting Started

Citation

Acknowledgement

Contributing

Trademarks

Recommend

Immutable X (IMX)将于CoinList启动公开发行，单价10美元起售

DAO 全景分析：市场格局、金融和社会资本的关系

对传统衍生品的重构：如何理解Paradigm的乘方永续合约？

早报 | NFT交易平台OpenSea总交易额突破30亿美元，创下历史新高

Stripping Times from Dates in C#

GitHub - IlanKalendarov/PyHook: PyHook is an offensive API hooking tool written...

GitHub - yanx27/Pointnet_Pointnet2_pytorch: PointNet and PointNet++ implemented...

WinForm嵌入Web网页的解决方案

ASP.NET Core 修改开源协议为MIT，.NET全平台 MIT协议开源了

「但凡有好的，那皆是好的」和「但凡有坏的，那皆是坏的」都不是常态，不要期待「完美...

About Joyk