Google AI BlogJuly 6, 2023

Pic2Word: Mapping pictures to words for zero-shot composed image retrieval

Composed image retrieval (CIR) is a task that retrieves images based on a query that combines an image and a text sample as instructions for modifying the image to fit the intended retrieval target. However, CIR methods require large amounts of labeled data, making it difficult to scale.

To address this, we propose a method called Pic2Word that aims to perform a variety of CIR tasks without requiring labeled triplet data. Instead, we train a retrieval model using large-scale image-caption pairs and unlabeled images.

We use a lightweight mapping sub-module in CLIP to map an input image to a word token in the textual input space. We train the mapping network to reconstruct the image embedding in the language embedding, optimizing only the mapping network with frozen visual and text encoders.

With the trained mapping network, we can flexibly compose joint image-text queries by pairing an image as a word token with a text description.

We compared Pic2Word with three approaches that do not require supervised training data: image only, text only, and image + text.

Our experiments have shown that training on an image-caption dataset can build a powerful CIR model, achieving impressive results in various CIR tasks such as object composition, attribute editing, and domain conversion.

One potential future research direction is utilizing caption data to train the mapping network, further improving the effectiveness of the model.