FT-DINOSAUR: Zero-Shot Object-Centric Representation Learning

1Max Planck Institute for Intelligent Systems, Tübingen
and University of Tübingen
2Mila, Université de Montréal
3University of Amsterdam
4University of Colorado, Boulder
*Equal Contribution

Summary

FT-DINOSAUR, which stands for Finetuned DINOSAUR, is a model for learning general-purpose object-centric representations. Once trained on a real-world dataset like COCO, FT-DINOSAUR transfers to a diverse range of real-world and synthetic domains. To achieve this, FT-DINOSAUR builds upon DINOSAUR and finetunes the pre-trained model used in DINOSAUR for the task of object discovery. In the paper, we show that FT-DINOSAUR achieves state-of-the-art object discovery performance and also exhibits strong zero-shot transerability across 8 different datasets.

Model

FT-DINOSAUR model

The FT-DINOSAUR model is trained in two stages:
  • In the first stage, the image is passed through a trainable DINOv2 encoder which outputs a set of patch features. These patch features are grouped by slot attention into a set of slots. The model is trained by reconstructing the original (pre-trained) set of DINOv2 features.
  • In the second stage, the model is re-initialized from the parameters obtained from the first stage and employ a similar training pipeline as the first stage, while providing input images at a higher resulution of 518 x 518.

Examples

FT-DINOSAUR is trained on COCO and transferred to 7 other datasets. We visualize the masks obtained from FT-DINOSAUR in comparison to other baselines.

Comparison to Prior Work

Comparison performed on the COCO dataset. FT-DINOSAUR uses a ViT-B/14 encoder with top-k decoding and high-resolution adaptation. Results for DINOSAUR and FT-DINOSAUR average over 3 seeds. SPOT, SlotDiffusion, SAM were evaluated using official checkpoints. We evalute foreground adjusted rand index (FG-ARI), mean best overlap (mBO), panoptic adjusted rand index (P-ARI), and panoptic quality (PQ). Refer to the paper for details.

Model FG-ARI mBO P-ARI PQ
DINOSAUR 40.5 27.7 37.1 14.4
SlotDiffusion 37.3 31.4 47.6 21.0
SPOT 37.0 34.8 52.4 21.3
FT-DINOSAUR 48.8 36.3 49.4 23.9
SAM (comp.) 12.1 19.0 10.8 9.4
SAM (best) 44.9 56.9 54.4 10.9

Zero-shot Performance

We study the zero-shot transfer capabilities of the proposed FT-DINOSAUR approach in comparison to state-of-the-art of unsupervised and supervised object-discovery approaches. All the unsupervised approaches are trained on COCO.

MOVi-C MOVi-E ScanNet YCB
Model FG-ARI mBO FG-ARI mBO FG-ARI mBO FG-ARI mBO
DINOSAUR 67.0 34.5 71.1 24.2 57.4 40.8 60.2 42.2
SlotDiffusion 66.9 43.6 67.6 26.4 52.0 51.7 62.5 59.2
SPOT 63.0 40.8 47.8 21.5 48.6 43.2 52.9 45.1
FT-DINOSAUR (ViT-S/14) 71.3 44.2 71.1 29.9 54.8 48.4 67.4 54.5
FT-DINOSAUR (ViT-B/14) 73.3 42.9 69.7 27.9 55.8 48.6 70.1 54.1
SAM (comp.) 57.6 45.3 38.5 27.4 45.8 45.5 46.9 40.9
SAM (best) 79.7 73.5 84.7 69.7 62.2 64.7 69.4 69.8
ClevrTex PASCALVOC EntitySeg Average
Model FG-ARI mBO FG-ARI mBO FG-ARI mBO FG-ARI mBO
DINOSAUR 82.5 35.2 24.0 37.2 43.5 19.4 65.2 33.0
SlotDiffusion 77.0 45.0 21.1 42.0 43.7 25.1 62.6 41.3
SPOT 63.3 40.0 21.2 50.6 41.7 27.4 53.0 37.0
FT-DINOSAUR (ViT-S/14) 86.0 50.1 24.0 37.6 48.1 28.4 67.8 42.5
FT-DINOSAUR (ViT-B/14) 83.9 45.9 25.9 37.8 49.7 29.0 68.0 40.8
SAM (comp.) 82.9 70.3 31.0 51.5 25.9 16.5 53.5 45.7
SAM (best) 94.0 90.0 31.1 64.2 53.4 51.0 76.1 73.2

Implementation

Related Projects

  • DINOSAUR (ICLR 2023): object-centric representations for real-world images using self-supervised feature reconstruction.
  • VideoSAUR (NeurIPS 2023): object-centric representations for real-world videos using a DINOSAUR-style framework and a novel temporal similarity loss

BibTeX


@article{Didolkar2024ZeroShotOCRL,
  title={Zero-Shot Object-Centric Representation Learning},
  author={Didolkar, Aniket and Zadaianchuk, Andrii and Goyal, Anirudh and Mozer, Mike and Bengio, Yoshua and Martius, Georg and Seitzer, Maximilian},
  year={2024},
  journal={arXiv:2408.09162},
  url={https://arxiv.org/abs/2408.09162}
}