Abstract: The Contrastive Language-Image Pretraining (CLIP) model has been widely used in various downstream vision tasks. The few-shot learning paradigm has been widely adopted to augment its ...
Abstract: The Visual Semantic Navigation(VSN) requires the agent to navigate to a target object of specified category in a previously unseen scene. To tackle this task, the agent must learn a nimble ...