Image deduplication using OpenAI’s CLIP and Community Detection

Theodoros Ntakouris
3 min readApr 9, 2023

A short guide on how to use image embeddings from OpenAI’s CLIP and clustering techniques in order to group near-duplicate images together.



  • We’re going to use huggingface/transformers to quickly load openai/clip-vit-base-patch32 from the hub
  • We’re going to perform inference in order to extract image features
  • We will use those feature vectors to perform community detection in order to place near-similar images into the same buckets.


CLIP is trained by trying to align image <> text embedding pairs, or “learning visual representations from natural language supervision”.

You can use it’s text or image embeddings to accomplish a lot of different tasks, such as zero-shot image classification! It’s embeddings are pretty powerful.

Dataset & Evaluation

For this task, we’re going to use the AirBnB Duplicate Image Dataset, available on Kaggle. For ease of use, we’re only going to use one directory which contains 14 images: Test Data/bathroom . Don’t worry, the code easily generalizes to many images.

We’re only going to qualitatively evaluate this (just by looking into the output images), to get the gist of it. You can design your own evaluation criteria given a dataset, the task at hand and other business-problem related assumptions.


Without further delay, let’s dive straight in to the code. First, we’ll load the models from the hugging face hub.

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Then, let’s just load the images under our directory and generate our embeddings.

import torch
from pathlib import Path

demo_directory = Path(".../bathroom")
# {"<name>": ".../bathroom/<name>.jpg }
images_to_paths = {image_path.stem: image_path for image_path in demo_directory.iterdir()}

images = [ for path in images_to_paths.values()]
inputs = processor(images=images, return_tensors="pt", padding=True)

with torch.no_grad():
outputs = model.get_image_features(**inputs)

images_to_embeddings = {image_id: tensor_embedding.detach().numpy() for image_id, tensor_embedding in zip(images_to_paths.keys(), outputs)}

Now, it’s time to perform community detection by clustering. There are lot’s of clustering algorithms out there, and one of the popular kids in the block is KMeans. Beware, we will avoid clustering methods that use a predefined number of clusters as an input, as this assumes that we know how many groups of similar images our dataset includes!

import numpy as np
from sklearn.cluster import DBSCAN
from collections import defaultdict

# tune eps to fit your needs
clustering = DBSCAN(min_samples=2, eps=3).fit(np.stack(images_to_embeddings.values()))

# postprocess cluster labels into groups of similar images
image_id_communities = defaultdict(set)
independent_image_ids = set()

for image_id, cluster_idx in zip(images_to_paths.keys(), clustering.labels_):
cluster_idx = int(cluster_idx)
if cluster_idx == -1:


Let’s plot the results.

len(independent_image_ids) # = 10

image_id_communities # =
# {
# 0: {'berlin_1583556_1', 'berlin_1583556_2'},
# 1: {'berlin_969200_1', 'berlin_969200_2'}
# }
for image_id_community in image_id_communities.values():
for image_id in image_id_community:

# images that have not got a cluster with similar images assigned to them
for image_id in independent_image_ids:

You can see that from the total 14 images of the dataset, we detect 2 groups that contain 2 similar images each. Quickly skim through all the other images (10 in total) and you’ll see that they are not near-duplicates.

Here are the plotted similar groups of images: pretty tough examples for a near-duplicate detection system!

Next Steps

For the interested data scientist, here are some take-home exercises or things to think about:

  • Use a projection method to plot the 512-d embeddings of the images, and estimate nearest-neighbour distance (`eps` of the clustering algorithm)
  • Design a pipeline that evaluates the performance of near-duplicate community detection algorithms
  • Experiment with different model embeddings and different clustering algorithms. What’s the best combination to use?

Thanks for reading!