FT-DINOSAUR, which stands for Finetuned DINOSAUR, is a model for learning general-purpose object-centric representations. Once trained on a real-world dataset like COCO, FT-DINOSAUR transfers to a diverse range of real-world and synthetic domains. To achieve this, FT-DINOSAUR builds upon DINOSAUR and finetunes the pre-trained model used in DINOSAUR for the task of object discovery. In the paper, we show that FT-DINOSAUR achieves state-of-the-art object discovery performance and also exhibits strong zero-shot transerability across 8 different datasets.
FT-DINOSAUR is trained on COCO and transferred to 7 other datasets. We visualize the masks obtained from FT-DINOSAUR in comparison to other baselines.
Comparison performed on the COCO dataset. FT-DINOSAUR uses a ViT-B/14 encoder with top-k decoding and high-resolution adaptation. Results for DINOSAUR and FT-DINOSAUR average over 3 seeds. SPOT, SlotDiffusion, SAM were evaluated using official checkpoints. We evalute foreground adjusted rand index (FG-ARI), mean best overlap (mBO), panoptic adjusted rand index (P-ARI), and panoptic quality (PQ). Refer to the paper for details.
Model | FG-ARI | mBO | P-ARI | PQ |
---|---|---|---|---|
DINOSAUR | 40.5 | 27.7 | 37.1 | 14.4 |
SlotDiffusion | 37.3 | 31.4 | 47.6 | 21.0 |
SPOT | 37.0 | 34.8 | 52.4 | 21.3 |
FT-DINOSAUR | 48.8 | 36.3 | 49.4 | 23.9 |
SAM (comp.) | 12.1 | 19.0 | 10.8 | 9.4 |
SAM (best) | 44.9 | 56.9 | 54.4 | 10.9 |
We study the zero-shot transfer capabilities of the proposed FT-DINOSAUR approach in comparison to state-of-the-art of unsupervised and supervised object-discovery approaches. All the unsupervised approaches are trained on COCO.
MOVi-C | MOVi-E | ScanNet | YCB | |||||
---|---|---|---|---|---|---|---|---|
Model | FG-ARI | mBO | FG-ARI | mBO | FG-ARI | mBO | FG-ARI | mBO |
DINOSAUR | 67.0 | 34.5 | 71.1 | 24.2 | 57.4 | 40.8 | 60.2 | 42.2 |
SlotDiffusion | 66.9 | 43.6 | 67.6 | 26.4 | 52.0 | 51.7 | 62.5 | 59.2 |
SPOT | 63.0 | 40.8 | 47.8 | 21.5 | 48.6 | 43.2 | 52.9 | 45.1 |
FT-DINOSAUR (ViT-S/14) | 71.3 | 44.2 | 71.1 | 29.9 | 54.8 | 48.4 | 67.4 | 54.5 |
FT-DINOSAUR (ViT-B/14) | 73.3 | 42.9 | 69.7 | 27.9 | 55.8 | 48.6 | 70.1 | 54.1 |
SAM (comp.) | 57.6 | 45.3 | 38.5 | 27.4 | 45.8 | 45.5 | 46.9 | 40.9 |
SAM (best) | 79.7 | 73.5 | 84.7 | 69.7 | 62.2 | 64.7 | 69.4 | 69.8 |
ClevrTex | PASCALVOC | EntitySeg | Average | |||||
---|---|---|---|---|---|---|---|---|
Model | FG-ARI | mBO | FG-ARI | mBO | FG-ARI | mBO | FG-ARI | mBO |
DINOSAUR | 82.5 | 35.2 | 24.0 | 37.2 | 43.5 | 19.4 | 65.2 | 33.0 |
SlotDiffusion | 77.0 | 45.0 | 21.1 | 42.0 | 43.7 | 25.1 | 62.6 | 41.3 |
SPOT | 63.3 | 40.0 | 21.2 | 50.6 | 41.7 | 27.4 | 53.0 | 37.0 |
FT-DINOSAUR (ViT-S/14) | 86.0 | 50.1 | 24.0 | 37.6 | 48.1 | 28.4 | 67.8 | 42.5 |
FT-DINOSAUR (ViT-B/14) | 83.9 | 45.9 | 25.9 | 37.8 | 49.7 | 29.0 | 68.0 | 40.8 |
SAM (comp.) | 82.9 | 70.3 | 31.0 | 51.5 | 25.9 | 16.5 | 53.5 | 45.7 |
SAM (best) | 94.0 | 90.0 | 31.1 | 64.2 | 53.4 | 51.0 | 76.1 | 73.2 |
@article{Didolkar2024ZeroShotOCRL,
title={Zero-Shot Object-Centric Representation Learning},
author={Didolkar, Aniket and Zadaianchuk, Andrii and Goyal, Anirudh and Mozer, Mike and Bengio, Yoshua and Martius, Georg and Seitzer, Maximilian},
year={2024},
journal={arXiv:2408.09162},
url={https://arxiv.org/abs/2408.09162}
}