PDF Metadata Scanner

A command-line tool to recursively scan folders for PDF files and extract:

PDF metadata (Info dictionary via pikepdf)
XMP and RDF metadata
Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)

🛠 Features

🔍 Recursive folder scanning
🧼 Clean separation of metadata output and error/warning logs
🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
📑 XMP/RDF metadata parsing
⚙️ Optional progress bar and verbose logging

📦 Installation

Install directly from PyPI:

pip install pdf-metadata-scanner

Or from source:

git clone https://github.com/yourname/pdf-metadata-scanner.git
cd pdf-metadata-scanner
pip install .

🚀 Usage

After installing:

pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

If running from source without installation:

python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]

Arguments:

Flag	Shorthand	Description	Default
`folder`		Folder to recursively scan for PDFs	required
`--log LOG_FILE`	`-l`	Log file for warnings/errors	`scanner_warnings.log`
`--out OUTPUT_FILE`	`-o`	Output file for extracted metadata	`pdf_metadata_output.txt`
`--verbose`	`-v`	Output logs to both file and console	(off)
`--progress`	`-p`	Show a live progress bar while scanning PDFs	(off)

🧾 Example

pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress

logs.txt: Contains only errors or warnings.
metadata.txt: Contains all extracted metadata.

📄 Output Format

[PDF Metadata] test.pdf
    /Author: Jane Doe
    /Title: Example Document

[XMP Metadata] test.pdf
    <dc:title>Example</dc:title>
    <dc:creator>Jane Doe</dc:creator>

[Image Metadata] test.pdf - Page 1 - Im0
    DateTimeOriginal: 2024:01:01 12:00:00
    DPI: (300, 300)

🧪 Testing

This project includes unit tests for core functionality.

Run tests:

pip install -r requirements-dev.txt
python -m unittest test_scanner.py

🔒 Notes

Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
PNG and JPEG/TIFF metadata is extracted where available.
The tool is read-only — it does not modify PDFs.

🧾 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
publiccode.yml		publiccode.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scanner.py		scanner.py
test_scanner.py		test_scanner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Metadata Scanner

🛠 Features

📦 Installation

🚀 Usage

Arguments:

🧾 Example

📄 Output Format

🧪 Testing

Run tests:

🔒 Notes

🧾 License

About

Resources

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

annejan/pdf-metadata-scanner

Folders and files

Latest commit

History

Repository files navigation

PDF Metadata Scanner

🛠 Features

📦 Installation

🚀 Usage

Arguments:

🧾 Example

📄 Output Format

🧪 Testing

Run tests:

🔒 Notes

🧾 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages