FT-DINOSAUR: Zero-Shot Object-Centric Representation Learning

Aniket Didolkar*²     Andrii Zadaianchuk³
Anirudh Goyal²     Michael Mozer⁴     Yoshua Bengio²
Georg Martius¹     Maximilian Seitzer*¹

¹Max Planck Institute for Intelligent Systems, Tübingen
and University of Tübingen
²Mila, Université de Montréal
³University of Amsterdam
⁴University of Colorado, Boulder

*Equal Contribution

arXiv Code Contact

Summary

FT-DINOSAUR, which stands for Finetuned DINOSAUR, is a model for learning general-purpose object-centric representations. Once trained on a real-world dataset like COCO, FT-DINOSAUR transfers to a diverse range of real-world and synthetic domains. To achieve this, FT-DINOSAUR builds upon DINOSAUR and finetunes the pre-trained model used in DINOSAUR for the task of object discovery. In the paper, we show that FT-DINOSAUR achieves state-of-the-art object discovery performance and also exhibits strong zero-shot transerability across 8 different datasets.

Model

The FT-DINOSAUR model is trained in two stages:

In the first stage, the image is passed through a trainable DINOv2 encoder which outputs a set of patch features. These patch features are grouped by slot attention into a set of slots. The model is trained by reconstructing the original (pre-trained) set of DINOv2 features.
In the second stage, the model is re-initialized from the parameters obtained from the first stage and employ a similar training pipeline as the first stage, while providing input images at a higher resulution of 518 x 518.

Examples

FT-DINOSAUR is trained on COCO and transferred to 7 other datasets. We visualize the masks obtained from FT-DINOSAUR in comparison to other baselines.

Comparison to Prior Work

Comparison performed on the COCO dataset. FT-DINOSAUR uses a ViT-B/14 encoder with top-k decoding and high-resolution adaptation. Results for DINOSAUR and FT-DINOSAUR average over 3 seeds. SPOT, SlotDiffusion, SAM were evaluated using official checkpoints. We evalute foreground adjusted rand index (FG-ARI), mean best overlap (mBO), panoptic adjusted rand index (P-ARI), and panoptic quality (PQ). Refer to the paper for details.

Model	FG-ARI	mBO	P-ARI	PQ
DINOSAUR	40.5	27.7	37.1	14.4
SlotDiffusion	37.3	31.4	47.6	21.0
SPOT	37.0	34.8	52.4	21.3
FT-DINOSAUR	48.8	36.3	49.4	23.9
SAM (comp.)	12.1	19.0	10.8	9.4
SAM (best)	44.9	56.9	54.4	10.9

Zero-shot Performance

We study the zero-shot transfer capabilities of the proposed FT-DINOSAUR approach in comparison to state-of-the-art of unsupervised and supervised object-discovery approaches. All the unsupervised approaches are trained on COCO.

	MOVi-C		MOVi-E		ScanNet		YCB
Model	FG-ARI	mBO	FG-ARI	mBO	FG-ARI	mBO	FG-ARI	mBO
DINOSAUR	67.0	34.5	71.1	24.2	57.4	40.8	60.2	42.2
SlotDiffusion	66.9	43.6	67.6	26.4	52.0	51.7	62.5	59.2
SPOT	63.0	40.8	47.8	21.5	48.6	43.2	52.9	45.1
FT-DINOSAUR (ViT-S/14)	71.3	44.2	71.1	29.9	54.8	48.4	67.4	54.5
FT-DINOSAUR (ViT-B/14)	73.3	42.9	69.7	27.9	55.8	48.6	70.1	54.1
SAM (comp.)	57.6	45.3	38.5	27.4	45.8	45.5	46.9	40.9
SAM (best)	79.7	73.5	84.7	69.7	62.2	64.7	69.4	69.8

	ClevrTex		PASCALVOC		EntitySeg		Average
Model	FG-ARI	mBO	FG-ARI	mBO	FG-ARI	mBO	FG-ARI	mBO
DINOSAUR	82.5	35.2	24.0	37.2	43.5	19.4	65.2	33.0
SlotDiffusion	77.0	45.0	21.1	42.0	43.7	25.1	62.6	41.3
SPOT	63.3	40.0	21.2	50.6	41.7	27.4	53.0	37.0
FT-DINOSAUR (ViT-S/14)	86.0	50.1	24.0	37.6	48.1	28.4	67.8	42.5
FT-DINOSAUR (ViT-B/14)	83.9	45.9	25.9	37.8	49.7	29.0	68.0	40.8
SAM (comp.)	82.9	70.3	31.0	51.5	25.9	16.5	53.5	45.7
SAM (best)	94.0	90.0	31.1	64.2	53.4	51.0	76.1	73.2

Implementation

Pre-trained FT-DINOSAUR models and inference code
Training code for FT-DINOSAUR: coming soon
Zero-shot benchmark: coming soon

Related Projects

DINOSAUR (ICLR 2023): object-centric representations for real-world images using self-supervised feature reconstruction.
VideoSAUR (NeurIPS 2023): object-centric representations for real-world videos using a DINOSAUR-style framework and a novel temporal similarity loss

BibTeX


@article{Didolkar2024ZeroShotOCRL,
  title={Zero-Shot Object-Centric Representation Learning},
  author={Didolkar, Aniket and Zadaianchuk, Andrii and Goyal, Anirudh and Mozer, Mike and Bengio, Yoshua and Martius, Georg and Seitzer, Maximilian},
  year={2024},
  journal={arXiv:2408.09162},
  url={https://arxiv.org/abs/2408.09162}
}