Conventional Image Search engines use a lexical approach and do not perform well for long queries. The reason behind this is that the search engines look for individual tokens, their frequency in the same document, and the rarity of the tokens among all documents. Also, the search keywords are matched with image textual metadata like tags and titles. For longer queries, we need a search engine that can understand the wholistic meaning of text query and expresses text and image embedding in the same space so that it can perform a search by using the nearest neighbor approach.
Contrastive Laniage Image Pre-training model from OpenAI provides similarity capability across multi-modal input such as text and image and express their embeddings in the same space. It leverages existing data from imagenet, internet and uses a contrastive approach to training. It's zero shot modal which means we don't have to retrain on our own corpus. We can just use it to generate text and visual embeddings and just sort the result based on the distance between vectors.
In this course, we'll use the CLIP model from openAI, PyTorch for preprocessing images and text, and simple nearest neighbor to illustrate the effect of longer queries on the Unsplash image search engine with and without the CLIP model.
This course provides basic understanding of deep learning and embeddings.