Product
Image Search Engine

Image Search & Discovery Engine

I'm going to show you how to build an AI powered search engine. This one runs on my laptop, it could probably run on my phone, soon enough it'll run just about anywhere. I'll tell you how it works, how to talk to it (politely), and maybe how to hack the system entirely. The demos are interactive so you should click on them, and there's a photo of me with the caption "a creeping sense of dread". I also cover why a recommendation algorithm might want to go a little bit Monty Python and show you something completely different. Let's begin...

Product Requirements

We're going to have two requirements for our system. We want to be able to search with text, like you would with any normal search bar, and we also want to be able to search by image. This leaves us with a pretty clear path towards our chosen solution. We won't linger on the technical terms for too long but spoiler alert the answer is: a multi-modal AI model capable of zero-shot image classification.

For artists, the Curators Collective platform is about making their artwork discoverable online and getting it exhibited out in the real world. An artwork search engine therefore needs to blend accurate search results with a recommendation algorithm that enables natural discoverability of related or similar artwork. Next, we're going to talk about Cillian Murphy.

Natural Language Search

A natural language search term is a query using everyday language, as opposed to specific keywords or phrases. It's the type of question you might ask someone if you were having a conversation with them - "who is that guy in Oppenheimer and Peaky Blinders" - there's a ton of context and understanding that goes unsaid in a question like that. A human would intuitively reason that "Oppenheimer" refers to the film about the person - not the person. "Peaky Blinders" refers to a tv show and therefore we are looking for the name of an actor that appears in both. To be useful to our users the search engine should be able to interpret a phrase in plain, conversational English and return accurate results. To show you how this works I've entered four basic search terms into the system and returned the eight highest matches for each.

Natural Language Search
Interactive Demo
Example Search Term
Match30.70%
Match30.13%
Match29.42%
Match29.41%
Match29.02%
Match28.95%
Match28.85%
Match28.82%

All of the demos have been populated using a public domain set of around 1000 artworks, with all embeddings and search results pre-calculated for the demos

The beauty of this search engine is that we haven't done any categorising, describing, indexing, or otherwise labelling of the images in our database. When a user uploads an image of their artwork to the platform it is instantly available in our discovery & recommendation system.

"The ginger artist who cut off his ear"

Let's start really demonstrating the power of a natural language search - the previous example used some fairly generic phrases that you would expect a traditional keyword search engine to perform well with. However, when searching for artwork our users may not know exactly what it is they want to find. They might want to search by something intangible like vibe or an emotion they want to evoke... or by that artist they've half remembered, you know... the one who was Dutch and painted sunflowers.

The remarkable thing is that by typing in the search term "The ginger artist who cut off his ear" the system returns a self-portrait by Vincent van Gogh alongside work from fellow Dutch artists Johannes Vermeer and Rembrandt, and his French contemporary Edouard Manet. It makes sense that all of these artworks are suggested together because they are semantically or thematically similar. And that's what our system is capable of expressing - semantic similarity. With semantic search a picture is worth 1000 keywords.

Semantic Similarity
Interactive Demo
Example Search Term
Match22.02%
Match21.49%
Match21.02%
Match20.94%
Match20.74%
Match20.35%
Match20.28%
Match20.13%

Natural Discovery

A classic pattern you'll see on an ecommerce product page is a section called "View similar items" or "Customers who viewed this item also viewed". We can do something similar for our artwork discovery mechanism. Below is an artwork pulled from the demo database by Henri Rousseau alongside the four closest matches to it. The idea is simple: if someone is interested enough to click on this particular artwork, they may also be interested in similar pieces. This creates a digital experience with no dead-ends, each subsequent page has links to related or similar artwork which in turn has more related links and so on. This time, the results weren't generated using a search term, it was generated by matching against the image of the artwork itself.

In the top eight results for the Rousseau painting we get four artworks from Paul Gaugin, a fellow French Post-Impressionist. If we had been interested in the works of Rousseau, we've now been exposed to the work of Gaugin. We follow the thread and view one of the Rousseau paintings, with this we have eight more recommendations this time featuring three pieces from Modigliani. With very little work we already have the makings of an engaging discovery mechanism for new artists.

You might also like...
Tropical Forest with Monkeysby Henri RousseauOil on canvas
Match80.64%
Match74.43%
Match73.62%
Match73.28%

Will it blend?

Initially, there will be a static weighting for our recommendations, this means that we won't be responding to user actions as part of our algorithm. There simply won't be enough data for a while to make this practical, though I intend to utilise some cache replacement algorithms in the future for this job (that's for a future write up). Instead we're going to lean on our insights from earlier whilst taking into account several key factors:

Is this a keyword search?

If the user has typed in a specific artist's name that exists in our database such as "Paul Gaugin", then we should obviously prioritise them in the results. The AI powered semantically similar search results are great for some things - but not all things. A keyword search on an artist's name that doesn't surface them in the results would be an extremely annoying user experience.

Semantic Similarity

The results from the AI model. Ensure that semantically similar results are sufficiently visible in our results, if the user is searching for 'A countryside landscape' then these results need to take precedence.

Extract meaningful data points from Semantic Similarity

Here's where we would allow interesting emergent trends such as the Rousseau -> Gaugin -> Modigliani connection to be explorable, and really lean into the power of the AI model to give us great recommendations.

A wildcard

One of the first things I built to test how well the semantic similarity search was working was a randomiser. A way of selecting completely at random a single artwork from the database. As it turns out this is quite a fun way of navigating through a database of art, and also allows an escape hatch for a user who is just browsing.

How does it work?

Our search engine relies largely on an AI model that is trained to be very good at matching an image to a caption. That's essentially all that is going on under the hood. When the user enters a search term we treat it like the user is submitting a caption, the model can then help us to find the images in our database of artwork that best match the caption. The AI model is good at this because it has been trained with 400,000,000 image & caption pairs, which is an awful lot of data to learn from, in fact we could download a dataset of that size ourselves if we wanted to - except it would take roughly 3.5 days to complete and require ~10TB of storage space (opens in a new tab).

This still doesn't quite explain how the search term "The ginger artist who cut off his ear" got the interesting results that it did. It is possible that some of that information appeared in the AI model's training data, afterall those images are in the public domain, so it's possible that the model has 'remembered' some of that learnt information. Two of the results were self-portraits so it's also possible that the model responded by literally returning images of artists. Maybe it simply returned portraits of people with an ear showing. When you go back and look at those results the ears are quite prominent come to think of it... The answer is probably 'all of the above' - we can't really know exactly how we got to the results we got to, but we can evaluate them - and they're very good.

Comparing Apples to Oranges

In order to compare images to images and text to images we need some sort of common ground that enables a direct comparison. The AI model encodes the text and image data in a way that makes them mathematically comparable, and as you can see below this takes the form of a high-dimensional vector embedding. Roughly speaking, the closer two vectors are to each other when we measure them the more semantically similar the content they represent is.

Same same but different
Input
Embedding
Input
Embedding
Input

A creeping sense of dread

Embedding
These embedding vectors are now directly comparable with each other
Match: 16.39%
A creeping sense of dread
Match: 18.53%
A creeping sense of dread
Match: 32.67%

Implementation & Experimentation Notes

UI for AI

There is a limited amount of text that the AI model can handle in a single go. It's about two or three sentences of text, although it can vary a lot. The first stage of translating the text into something the model can understand is to split the text down into 'tokens'. As the code snippet below shows, the tokens generated for each search term are discarded if they exceed the context_length of 77 tokens. Reading the CLIP Paper (opens in a new tab) seems to indicate that a single sentence is likely to get the best results considering the way in which the model was trained - and so this is what we must gently nudge the user into giving us.

clip.py
 
for i, tokens in enumerate(all_tokens):
    if len(tokens) > context_length:
        if truncate:
            # Delete all tokens after the context_length
            tokens = tokens[:context_length]
            tokens[-1] = eot_token
        else:
            raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
 

One of the joys of a soft limit like this is in building UI components that visually communicate these constraints without needing to dive into a paragraph like the one above about tokens and truncation. The custom textarea component below lets our artists write descriptions during the artwork upload process whilst keeping them informed of the 75 token budget. We're able to kill two birds with one stone here, we want our artwork images to have accurate, descriptive alt text for accessibility reasons, and as it happens this is exactly the kind of caption that works best with our search model. Meta, for example, have been using AI generated alt text (opens in a new tab) for years now.

AI UI Components
Interactive Demo
What the user sees...
...under the hood

The code below shows you exactly how the textarea component is built but using a character count as opposed to a token count just to keep things simple, the concept works in exactly the same way you just introduce a tokenLimit rather than a characterLimit.

CustomTextarea.jsx
 
import { useState } from "react"
 
export default function CustomTextarea() {
 
    let initialTextareaValue = "A white path and gray stone wall wind through ochre and harvest-yellow fields beneath a brilliant, turquoise sky in this stylized, horizontal landscape painting. The scene is created with long, visible brushstrokes of vivid color.";
    const [characterCount, setCharacterCount] = useState(initialTextareaValue.length);
    const characterLimit = 380;
    const circleRadius = 30;
    
    const calculateCircumference = (radius) => {
        return 2 * Math.PI * radius;
    }
    
    const dashOffset = calculateCircumference(circleRadius);
 
    const calculateDashArray = (count) => {
        return dashOffset + (count / characterLimit) * 100;
    }
 
    return (
        <>
            <label htmlFor="description" className="block text-sm font-medium leading-5 text-gray-900 mr-8">
                Say what you see: Write a sentence or two to describe the content of your artwork...
            </label>
            <div className="mt-2 relative">
                <textarea defaultValue={initialTextareaValue} onChange={(e) => setCharacterCount(e.target.value.length)} rows="4" name="description" id="description" className="bg-white block w-full rounded-md p-3 pr-16 text-gray-900 shadow-sm ring-1 ring-slate-400/30 focus-within:ring-2 focus-within:ring-offset-2 focus-within:ring-offset-transparent focus-within:ring-blue-700/30 border border-transparent focus-within:border-blue-500 focus-visible:outline-none sm:text-sm sm:leading-6 min-h-[75px]" />
                <svg data-limit={characterCount >= characterLimit} className="group absolute top-2 right-2 -rotate-90 w-14 h-14 -mx-1" viewBox="0 0 100 100">
                    <circle cx="50" cy="50" r={circleRadius} pathLength="100" className="group-data-[limit=true]:stroke-red-500 group-data-[limit=false]:stroke-slate-500/30 fill-transparent stroke-[10px]" />
                    <circle cx="50" cy="50" r={circleRadius} pathLength="100" strokeDasharray={calculateDashArray(characterCount)} strokeDashoffset={dashOffset} className="group-data-[limit=true]:stroke-red-500 group-data-[limit=false]:stroke-green-500 fill-transparent stroke-[10px]" />
                </svg>
            </div>
        </>
    )
 
}
 

Cogtainers

Cog, built by Replicate, is "an open-source tool that lets you package machine learning models in a standard, production-ready container". It's really useful and you can check it out here on Github (opens in a new tab).

The Model

Modelopenai/clip-vit-large-patch14 (opens in a new tab)
Embedding Dimension768
Input Resolution224 x 224 pixels

The CLIP Paper (opens in a new tab) has some interesting benchmarks that show how prompt engineering can improve zero-shot performance. In this instance we can think of 'prompt engineering' as 'how to structure a search term to get the best search results'. Their paper shows that "compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets". This means that we could potentially improve results just by prefixing/suffixing our user's search requests with additional text. In our narrow use case as an artwork search engine this probably won't lead to markedly better results, but by nudging the user to search in a specific way we may be able to help them achieve better results. Take the examples below, even tiny incremental changes to the search query result in better search results. A way to improve this search engine in the future would be to offer 'suggested searches' that can be dynamically generated.

Prompt Engineering
Interactive Demo
Example Search Term
Match25.35%
Match25.16%
Match24.11%
Match23.71%
Match23.40%
Match23.34%
Match23.29%
Match23.12%

Problems for future me?

The Database

Using Postgres and installing the pgvector extension (opens in a new tab) suits our use case down to the ground. Having our app data and our embeddings data in the same database reduces the complexity of both the code and server configuration. We also know that we're never likely to be storing millions of embeddings. However there are ways of improving database performance if we ever need to, an excellent talk (opens in a new tab) by Jonathan Katz highlights a few strategies for this. Since Postgres will TOAST (The Oversized-Attribute Storage Technique) data larger than 2KB by default we can safely assume that our 768 dimension 4-byte float vectors will be TOASTed. There are also gotchas around filtering within a query that might cause the database to ignore indexes entirely. These aren't problems we're going to have to worry about anytime soon with our search engine, but it's worth bearing in mind.

A paper (opens in a new tab) by some researchers at Netflix did the rounds recently that reinforced the idea that measuring the semantic similarity of vectors using cosine-similarity can sometimes produce arbitrary and therefore meaningless similarities. Fortunately we don't have to worry about this because the CLIP model was specifically designed to use a 'Contrastive Loss' function that maximises cosine-similarity between text and image embeddings. In the future we may wish to switch out the underlying model powering our search engine so any hard coded references to cosine-similarity in our database queries may need to be reassessed.

This document is the first to be made publicly available in a series detailing the product, design, and code choices made while building out the fledgling Curators Collective tech stack. Writing these documents has been a valuable way for the team to work asynchronously, and to communicate the complex technical solutions we provide to our early adopters and team members alike. Want artwork from local Norfolk artists brightening up your office walls? Get in touch - hello@thecuco.co.uk