8000 GitHub - dmitrysobolev/pdf-redact-tool
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dmitrysobolev/pdf-redact-tool

Repository files navigation

PDF Redaction Tool

A professional-grade Python tool that securely redacts text from PDF files with advanced pattern matching, configuration management, and comprehensive optimization capabilities.

🚀 Features

Core Functionality

True Redaction: Permanently removes text from PDF content (not just visual covering)
Multiple Pattern Support: Redact multiple text patterns in a single command
Regular Expression Support: Advanced pattern matching with full regex capabilities
Case-Insensitive Matching: Flexible text search options
Whole Word Matching: Precise redaction control
File Size Optimization: Automatic compression and optimization with qpdf

Advanced Features

Dry-Run Mode: Preview changes before applying them
Interactive Progress Bars: Real-time progress tracking with detailed statistics
Configuration Management: JSON-based configuration with predefined pattern sets
Professional Logging: Configurable logging levels with detailed progress reporting
Modular Architecture: Clean, maintainable, and extensible codebase
Comprehensive Testing: Unit tests with mock support
Type Safety: Full type hints for better IDE support

🏗️ Architecture

The tool is built with a modular, professional architecture:

pdf-redact-tool/
├── main.py                    # CLI interface and entry point
├── pdf_redactor.py           # Core redaction logic (PDFRedactor class)
├── config.py                 # Configuration management
├── test_redactor.py          # Unit tests
├── example_config.json       # Configuration example
├── README.md                 # This file
└── pyproject.toml           # Project configuration

📋 Requirements

  • Python 3.8+
  • uv (Python package manager)
  • qpdf (PDF optimization tool)

🛠️ Installation

  1. Install qpdf (if not already installed):

    # macOS
    brew install qpdf
    
    # Ubuntu/Debian
    sudo apt-get install qpdf
    
    # Windows (with Chocolatey)
    choco install qpdf
  2. Clone and set up the project:

    git clone <repository-url>
    cd pdf-redact-tool
  3. Dependencies are managed automatically with uv and the project configuration.

🎯 Usage

Basic Usage

# Simple text redaction
uv run python main.py document.pdf "text to redact"

# Multiple patterns
uv run python main.py document.pdf "pattern1" "pattern2" "pattern3"

Advanced Usage

# Regular expressions with case-insensitive matching
uv run python main.py --regex --case-insensitive document.pdf "licensed to.*"

# Dry run to preview changes
uv run python main.py --dry-run document.pdf "confidential"

# Custom output file with verbose logging
uv run python main.py -v -o sanitized.pdf document.pdf "sensitive.*"

# Whole word matching only
uv run python main.py --whole-words document.pdf "John"

# Disable progress bars for scripting
uv run python main.py --no-progress document.pdf "confidential"

Command Line Options

usage: main.py [-h] [-o OUTPUT] [--regex] [--case-insensitive] [--whole-words] 
               [--dry-run] [-v] [--no-progress] input_file patterns [patterns ...]

options:
  -h, --help           Show help message
  -o, --output OUTPUT  Custom output file path
  --regex              Treat patterns as regular expressions
  --case-insensitive   Perform case-insensitive matching
  --whole-words        Match whole words only
  --dry-run            Preview changes without applying them
  -v, --verbose        Enable verbose output with detailed logging
  --no-progress        Disable progress bars

📊 Progress Tracking

The tool provides comprehensive progress tracking with interactive progress bars:

Features

  • Pattern Processing: Shows progress through multiple patterns with current statistics
  • Page Scanning: Real-time page-by-page progress for each pattern
  • Performance Metrics: Processing speed, estimated time remaining, and match counts
  • Customizable: Can be disabled with --no-progress for scripting environments

Progress Bar Components

  • Pattern Progress: Overall progress through all patterns
  • Page Progress: Current page being processed within each pattern
  • Statistics: Number of matches found and processing speed
  • Time Estimates: Completion time estimates for long operations

Integration

  • CLI Mode: Automatic progress bars for interactive use
  • Scripting Mode: Use --no-progress to disable for clean output
  • Programmatic: Control via show_progress parameter in PDFRedactor

📊 Sample Output

Dry Run Preview

INFO: Processing: document.pdf
INFO: Patterns to redact: ['www.example.com']
INFO: DRY RUN MODE - No changes will be made
INFO: Previewing redactions for: document.pdf
Previewing: 'www.example.com': 100%|██████████| 1/1 [00:01<00:00,  1.2pattern/s, found=15, total=15]
INFO: Preview Results:
INFO:   Pattern 'www.example.com': 15 instances on pages [1, 3, 5, 8, 12]
INFO: Total instances that would be redacted: 15
INFO: Pages that would be affected: [1, 3, 5, 8, 12]

Actual Redaction

INFO: Processing: document.pdf
INFO: Patterns to redact: ['confidential', 'internal use only']
INFO: Output will be: document_redacted.pdf
INFO: Opening PDF: document.pdf
Processing pattern: 'confidential': 50%|██████     | 1/2 [00:02<00:02, 2.1s/pattern, found=8, total=8]
INFO: Searching for pattern: 'confidential'
INFO: Redacted 8 instances of 'confidential'
Processing pattern: 'internal use only': 100%|██████████| 2/2 [00:04<00:00, 2.0s/pattern, found=3, total=11]
INFO: Searching for pattern: 'internal use only'
INFO: Redacted 3 instances of 'internal use only'
Applying redactions and optimizing...
INFO: Applying redactions and optimizing...
INFO: qpdf optimization completed successfully
INFO: Total instances redacted: 11
INFO: Original size: 3.2 MB
INFO: Final size: 2.7 MB
INFO: Size change: -15.6%
INFO: Redaction completed: document_redacted.pdf

⚙️ Configuration

Create a config.json file for advanced configuration:

{
  "case_sensitive": false,
  "use_regex": true,
  "create_backup": true,
  "log_level": "INFO",
  "pattern_sets": {
    "email": ["\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"],
    "phone": ["\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b"],
    "confidential": ["\\bCONFIDENTIAL\\b", "\\bPROPRIETARY\\b"]
  }
}

🔧 Programmatic Usage

Use the tool programmatically in your Python code:

from pdf_redactor import PDFRedactor

# Create redactor instance with progress bars
redactor = PDFRedactor("input.pdf", "output.pdf", show_progress=True)

# Redact multiple patterns
patterns = ["confidential", "internal.*only"]
count = redactor.find_and_redact_text(
    patterns, 
    case_sensitive=False, 
    use_regex=True
)

print(f"Redacted {count} instances")

# Get file size information
original, final, change = redactor.get_size_info()
print(f"Size change: {change:+.1f}%")

🛡️ Security

This tool performs true redaction:

  • Text is permanently removed from PDF content
  • Cannot be recovered by copy/paste, search, or PDF analysis tools
  • Text areas become empty/transparent (showing page background)
  • No traces of original text remain in the file structure
  • Enterprise-grade security suitable for confidential documents

📈 Performance & Optimization

The tool provides excellent performance and optimization:

  • Efficient processing of large PDFs (tested on 238-page documents)
  • Advanced compression often reduces file size by 15-25%
  • Memory-efficient processing with proper resource management
  • Parallel processing capabilities for multiple patterns
  • Optimized qpdf integration for maximum compression

🧪 Testing

Run the comprehensive test suite:

uv run python test_redactor.py

The test suite includes:

  • Unit tests for all core functionality
  • Mock-based testing for external dependencies
  • Edge case validation
  • Performance benchmarks

🔍 Real-World Example

Successfully tested on a 238-page PDF:

  • Removed: 238 instances of watermark text
  • File size reduction: 22.7% (11MB → 8.1MB)
  • Processing time: Under 30 seconds
  • Verification: 0 instances remaining after redaction

📁 Project Structure Details

  • main.py: Command-line interface with argparse
  • pdf_redactor.py: Core PDFRedactor class with all redaction logic
  • config.py: Configuration management with RedactionConfig class
  • test_redactor.py: Comprehensive unit test suite
  • example_config.json: Sample configuration file

🚨 Error Handling

Comprehensive error handling includes:

  • ✅ File existence validation
  • ✅ PDF format validation
  • ✅ qpdf installation checking
  • ✅ Regex pattern validation
  • ✅ Resource cleanup (temporary files)
  • ✅ Graceful failure with helpful error messages

🎯 Migration from Simple Script

If you were using the original simple script:

  • Basic usage remains the same: python main.py file.pdf "text" still works
  • All new features are optional: Existing workflows continue unchanged
  • Enhanced output: More detailed and informative reporting
  • Better reliability: Professional error handling and validation

🤝 Contributing

We welcome contributions! The modular architecture makes it easy to:

  • Add new redaction patterns
  • Enhance PDF processing capabilities
  • Improve performance optimizations
  • Extend configuration options

📄 License

This project is open source. Use responsibly and ensure you have rights to modify any PDFs you process.


📚 Additional Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0