ViT | Tokenless

May 08, 2026

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

Isaac Robinson from Roboflow explains why Vision Transformers (ViTs), despite their initial disadvantages in computational complexity and lack of inductive bias, ultimately surpassed Convolutional Neural Networks (CNNs) for computer vision tasks. The talk covers the critical roles of massive, ViT-specific pre-training methods like MAE and DINO, the architectural evolution through models like Swin, ConvNeXt, and Hiera, and optimizations borrowed from the LLM ecosystem. It culminates in a discussion on the practical deployment challenges of large foundation models like SAM and how Neural Architecture Search can bridge the gap.

Vi t

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow