Cuckoo is an image duplicate detection tool, designed to remove duplicate images from large datasets using locality-sensitive hashing. To achieve image deduplication at scale, an approximating search algorithm such as LSH offers a significant tradeoff for speed and efficiency at the cost of some accuracy, which can be adjusted through parameter tuning.
This program employs Locality Sensitive Hashing (LSH) to detect near-duplicate images within a directory. The ability to detect duplicates for deduplication at scale is crucial to maintaining good-quality image datasets. The program is structured into the following components:
-
- Handles image preprocessing tasks such as converting images to grayscale, resizing, and flipping to normalize the brightest quarters.
- Computes a perceptual hash (dhash) of the image to generate a signature for similarity comparison.
-
- Implements LSH for efficient similarity detection by dividing image signatures into bytes.
- Stores image signatures in buckets and provides methods to find potentially similar images.
-
- get_image_files: Retrieves a list of image files from a specified directory based on recognized file extensions.
- process_images: Processes each image in the directory using the ImageProcessor and populates the LSHProcessor with image signatures.
- find_near_duplicates: Coordinates the entire process by initializing components, and finding image duplicates using LSH.
-
- Reads input directory path, similarity threshold, hash size, and number of bands.
- Uses find_near_duplicates to identify near-duplicate images based on the provided threshold.
- Outputs a CSV file (results.csv) containing filenames and their corresponding similarity labels.
- ImageProcessor: Preprocesses images and computes their signatures.
- LSHProcessor: Implements LSH for efficient similarity detection.
- Utility Functions: Handle file operations and coordinate image processing tasks.
-
Inputs:
input_dir
: Directory path containing images to be analyzed.threshold
: Minimum similarity threshold (between 0 and 1) for considering images as near-duplicates.hash_size and bands
: Parameters for LSH configuration, affecting granularity and efficiency of similarity detection.
-
Output:
- Generates a CSV file (results.csv) containing image paths and labels with duplicates having the same label.
For detecting similarity between two images A
and B
at a threshold X
.
-
The
ImageProcessor
class is used to calculate the image signature/hash with thecalculate_signature
method.- The image is converted to grayscale and resized to
(hash_size+1, hash_size)
scale. - The image is then flipped to ensure the brightest quatre is always at the top left. to deal with image rotations.
- A difference hash is then calculated using hash_size, and then collapsed to a 1-dimensional array.
- This 1-dimensional array is returned as the signature of the image.
- The image is converted to grayscale and resized to
-
The
LSHProcessor
class is employed to ;-
Add each image path and signature to the bucket list,
hash_buckets_list
using theadd_signature
method. Theband size
androws
are used to iteratively calculate different signature bytes and stored in thehash_buckets_list
if a previous image has produced the same bytes, the image path is append to its list of image paths, in thehash_buckets_list
. This indicates the current row in the image is similar to the previous row of a different image.- NB:
hash_bucket_list
contains dicts of signature bytes as keys and a list of image paths as values
- NB:
-
Assign labels, For each similar image paths list in
hash_bucket_list
, we iteratively compare them to each other in pairs, and calculate a similarity score using thecalculate_similarity
method which useshamming distance
to calculate the similarity between image signatures. If the similarity score exceeds the threshold, the same label is assigned to both images. For images that are not assigned any labels through the previous step new labels are assigned.
-
-
For images
A
andB
if their similarity score exceeds thresholdX
, the same label is assigned.