PDF and Image Diff Tool

This tool is designed to compare the text content of two PDF files or images and generate an HTML file that displays the differences in a format similar to VSCode's Git Diff view.

Features

Extract text from PDF files.
Extract text from images (using Tesseract OCR).
Compare two text contents and generate a diff result.
Output the diff result in HTML format with highlighted additions, deletions, and unchanged content.
Automatically open the generated HTML file.

Dependencies

Python 3.x
pdfplumber library
pytesseract library
Pillow library
Tesseract OCR engine

Installation

Install Python 3.x(recommended version: Python 3.9).
Install the required Python libraries:

pip install -r requirements.txt

Install the Tesseract OCR engine:
- macOS:
```
brew install tesseract
```
- Linux:
```
sudo apt-get install tesseract-ocr
```
- Windows: Download and install the Tesseract OCR engine from [Tesseract OCR for Windows]

Usage

Prepare the two files (PDFs or images) you want to compare.
If you want to compare images, ensure set the Tesseract OCR engine path and language in the compare.py file.
Run the following command in the terminal:

python compare.py <file1> <file2>

Replace <file1> and <file2> with the paths to the files you want to compare. 4. The program will generate an HTML file output/diff_output.html and automatically open it in your default browser.

Examples

Compare two PDF files:

python compare.py ./input/pdf1.pdf ./input/pdf2.pdf

Compare a PDF file and an image:

python compare.py ./input/pdf1.pdf ./input/pdf2.png

Output

The generated HTML file will display the text content of the two files side by side, with differences highlighted:

Green: Added content.
Red: Deleted content.
White: Unchanged content.

Notes

If comparing image files, ensure the text in the images is clear to improve OCR accuracy.
The generated HTML file will be saved at output/diff_output.html.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Feel free to submit issues and pull requests. Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
assets		assets
.DS_Store		.DS_Store
.gitignore		.gitignore
DiffEngine.py		DiffEngine.py
LICENSE		LICENSE
PDFConverter.py		PDFConverter.py
PDFProcessor.py		PDFProcessor.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF and Image Diff Tool

Features

Dependencies

Installation

Usage

Examples

Output

Notes

License

Contributing

About

Releases

Packages

Languages

License

zong4/PDFAndImageDiffTool

Folders and files

Latest commit

History

Repository files navigation

PDF and Image Diff Tool

Features

Dependencies

Installation

Usage

Examples

Output

Notes

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages