in Development

Detecting duplicate images using Python

With thousands of icons being uploaded to Iconfinder.com every month, the risk of pirated content also increases. To keep out the swindlers we have been working on a new clever image duplication technique.

One of the features that came out of our little hackaton and will be rolling out in the next couple of weeks is the ability to detect duplicate icons upon submission. So, for example if a user downloads an existing icon and tries to re-upload it in order to sell it and make a profit (yes, we had some of those cases) we will be able to detect it as a duplicate of an already existing icon and mark the account as fraudulent.

A frequently used solution to identify if a given file already exists in a large collection of files is to calculate a hash for each individual file in the dataset, store the hashes in a database and then when we want to locate a particular file calculate the hash for that file and look up to see if it’s hash value exists in the database.

Picking a hashing algorithm

A commonly used type of hash algorithms for this are the cryptographic hashing algorithms. Implementations of the most common used ones like MD5, SHA-1, SHA-256 are available in the standard library of almost any programming language and they are really easy to use and work really well for the most simple use cases.

For instance, in Python you would simply import the hashlib module and call a function to generate the hash value for a particular string or file.

This would work fine and dandy if we were sure the files uploaded aren’t tampered with but due to the nature of how cryptographic hashing algorithms work even the slightest change to the input will result in an avalanche effect so the generated hash of the new file will be totally different from the one of the original file.

Let’s take an example where we add a period at the end of the sentence.

As you can see in the example above, the values 9e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are almost completely different (except for few characters).

In the case of images if the background color is changed, the image is cropped or rotated or if just one pixel is modified out of the original image, we won’t be able to match the hash of the image to an already existing one. This is not really practical since we want to detect similar images, even if they have been modified a little.

For example when we modify the color of the nose on the cat the computed hash value will be mostly different.

Original image

Original image

Modified image

Modified image



What we needed for our use case is a perceptual hashing algorithm. A perceptual hashing algorithm that takes a fingerprint of a multimedia file by deriving it from various features from its content so it can take into account transformations on a given input and yet be flexible enough to distinguish between dissimilar files.

There are a number of perceptual hashing algorithms but the one that I’m gonna present is the dHash (difference hashing) algorithm which computes the difference in brightness between adjacent pixels, identifying the relative gradient direction.

dHash

Before we dive into the algorithm, let’s start with some basics. A color image is comprised out of red, green and blue (RGB) pixels and we can think of it as a list of sets of red, green and blue values. For example, let’s take an image, load it using Python’s imaging library (PIL) and print the pixel values.

Test image

Test image



The first three pixels have an intensity value of 255 for red, green and blue respectively and 0 for other values, white has all three values at their maximum intensity (255) and black has all three values at their lowest intensity (0). The rest of the color pixels are a combination of red, green and blue at different intensities.

Now let’s get back on track, the dHash algorithm has four steps, I’m gonna go through each one and illustrate it’s effects on the original and modified image.

1. Grayscale the image.

By grayscaling the image we reduce each pixel value to a luminous intensity value. For example, a white pixel (255, 255, 255) will be reduced to an intensity value of 255 and a black pixel (0, 0, 0) will be reduced to an intensity value of 0.

Original image (after step 1)

Original image (after step 1)

Modified image (after step 1)

Modified image (after step 1)

2. Shrink the image to a common size.

By shrinking the image to a common base size, for example 9×8 pixels, where the width is 1px larger than the height (you’ll understand why the odd size in step 3). This way we remove all the high frequencies and details of the image and we are left with a sample size of 72 intensity values. This also means that resizing or stretching an image won’t affect it’s hash value, all images are normalized to a common size.

Original image (after step 2)

Original image (after step 2)

Modified image (after step 2)

Modified image (after step 2)

3. Compare adjacent pixels.

After the previous two steps we are left with a list containing intensity values, we then compare each intensity value to it’s adjacent value for each row resulting in an array of binary values.

The first value (254) will be compared to the second value (254), the second value to the third and so on, giving us 8 boolean values per row.

4. Convert the difference into bits.

To make the hash easy to store and use, we convert it into a hexadecimal string. A True value will have a binary value of 1 and a False value will have one of 0.

Python implementation

A full implementation of the algorithm in Python:

Comparing hashes

In the most simple case where the images differ only slightly the hashes most likely will be the same so we can directly compare them.

If we kept an SQL database with our hashes and we wanted to see if the ’4c8e3366c275650f’ hash existed in our database we could simply do something like this:

Now, in the case that the images differ a bit more and the hashes are not exactly the same we need to compare them by calculating the minimum number of substitutions required to change one string into the other, this is called a hamming distance.

On the Wikipedia page there is some sample Python code that computes the hamming distance between two string but we can calculate the hamming distance and query based on it directly in MySQL:

It will do a binary XOR of the two values, the first one from the database and the second one the one that we query, and count the bits that are different amongst the two values. Because BIT_COUNT works only on integers and we stored the hash as a hex string (base 16) we have to convert it to a decimal value (base 10), the same goes for the value that we want to query by.

Ending words

Even though I’ve presented the algorithm using Python snippets the algorithm is pretty general and can be implemented in any other programming language.

As I said in the introduction, we will be using the algorithm in the future on Iconfinder to warn us if a submitted icon already exists in our collection but it can also have other practical uses. For example, because images with similar features have the hamming distance of their hashes close, it can also be used as a basic recommendation system where the recommended images are within a certain hashing distance of the current image.

Write a Comment

Comment

36 Comments

    • From what I see the algorithm should cope better with image cropping than the algorithm presented here (dhash).

      I’m gonna dive into the paper and check it out, thank you.

    • Mainly performance issues, CTPH is a rolling hash algorithm so you end up processing the image byte per byte. This has the advantage of and having less false positives and a better match for near similar images but it’s main trade-off is speed.

      We are also trying to move towards feature extraction and the dhash algorithm is a bit more useful and interesting to experiment with.

  1. Have you considered using a Bloom filter for fast lookups?
    http://www.eecs.harvard.edu/~michaelm/postscripts/alenex2006.pdf (Kirsch and Mitzenmacher: Distance-Sensitive Bloom Filters) solves pretty much this exact problem, and are quite effective for a small number of bit-wise differences.

    I am working on some python code for this, as part of a research project. Feel free to contact me, if you want a copy.

    /Martin Faartoft

  2. Hi Silviu,

    The quite same thing I am using in my final year engineering project. It is named as Content Based Image Retrieval.

    On the basis of color it will pick images from database.

    Happy that IconFinder is improving day by day…

  3. Awesome post.

    I noticed that following code will not print the pixels row by row

    for col in xrange(width):
    print pixels[col:col+width]

    Maybe you mean something like this:

    for row in xrange(height):
    col = row * width
    print pixels[col: col+width]

  4. As you can see in the example above, the values 9e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are completely different.

    Actually, there are two characters in the same location which share the same value, so they aren’t “completely different” just mostly.

  5. Errata:
    as ↝an↜ list of sets
    illustrate ↝it’s↜ effects
    value to ↝an↜ luminous intensity
    left with ↝an↜ list containing
    and can ↝↜ implemented in
    we will be ↝using↜ using the algorithm
    if ↝an↜ submitted icon

  6. To find all duplicates, you’ll need to compare O(N^2) hash pairs, though. Your SQL query is a full table scan.
    I like the focus on protecting the originality of submissions, though; good job!

    • Who would buy it, though? They’re trying to detect duplicate images to prevent people from stealing others’ intellectual property. Who’s going to buy some mirrored image?

      It seems like you’re not really thinking of the big picture, but rather trying to show how “clever” you are.

      • It’s also quite easy to mirror and rotate the analyzed data – so detecting flipped images is not a difficult problem to solve. It’s merely a question of adding an extra step to the algorithm.

  7. This is a very interesting read. I always wonder about the different methods sites use to detect duplicate content. It is definitely a major issue, and I like seeing the inside thought process like this.