Exploring Self-Supervised Pretraining Datasets for Complex Scene Understanding

Document Type : Original Article

Authors

1 Computer Science Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

2 Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

3 Computer Science Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt.

Abstract

With the rapid advancements of deep learning research, there have been many milestones achieved in the field of computer vision. However, most of these advances are only applicable in cases where hand-annotated datasets are available. This is considered the current bottleneck of deep learning that self-supervised learning aims to overcome. The self-supervised framework consists of proxy and target tasks. The proxy task is a self-supervised task pretrained on unlabeled data, the weights of which are transferred to the target task. The prevalent paradigm in self-supervised research is to pretrain using ImageNet which is a single-object centric dataset. In this work, we investigate whether this is the best choice when the target task is multi-object centric. We pretrain “SimSiam” which is a non-contrastive self-supervised algorithm using two different pretraining datasets: ImageNet100 (single-object centric) and COCO (multi-object centric). The transfer performance of each pretrained model is evaluated on the target task of multi-label classification using PascalVOC. Furtherly, we evaluate the two pretrained models using CityScapes; an autonomous driving dataset in order to study the implications of the chosen pretraining datasets in different domains. Our results showed that the SimSiam model pretrained using COCO consistently outperformed the ImageNet100 pretrained model by ~+1 percent (57.4 vs 58.3 mAP for CityScapes). This is significant since COCO is smaller in size. We conclude that using multi-object centric datasets for pretraining self-supervised learning algorithms is more efficient in cases where the target task is multi-object centric and in complex scene understanding tasks such as autonomous driving applications.

Keywords