This paper demonstrates an approach for learning highly semantic image
representations without relying on hand-crafted data-augmentations. We
introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a
non-generative approach for self-supervised learning from images. The idea
behind I-JEPA is simple: from a single context block, predict the
representations of various target blocks in the same image. A core design
choice to guide I-JEPA towards producing semantic representations is the
masking strategy; specifically, it is crucial to (a) predict several target
blocks in the image, (b) sample target blocks with sufficiently large scale
(occupying 15%-20% of the image), and (c) use a sufficiently informative
(spatially distributed) context block. Empirically, when combined with Vision
Transformers, we find I-JEPA to be highly scalable. For instance, we train a
ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong
downstream performance across a wide range of tasks requiring various levels of
abstraction, from linear classification to object counting and depth
prediction.