This project converts images and PDF files into speech using OpenAI's GPT-4o for transcription and TTS-1 for text-to-speech conversion. It's designed to take unstructured visual data and turn it into easily understandable audio content, with special support for automated flyer processing.
- Supports both image (.png, .jpg, .jpeg) and PDF input files
- Uses GPT-4o to transcribe and describe the content of images and PDFs
- Converts transcriptions to speech using OpenAI's TTS-1 model
- Outputs MP3 audio files
- Automated flyer processing with date-based organization
- Support for multiple flyer sources (e.g., Target Weekly Ad)
- Python 3.7+
- OpenAI API key
- Chrome or Firefox browser (for Selenium)
- ChromeDriver or GeckoDriver (for Selenium)
- Clone this repository:
git clone https://github.com/access-news/tts.git
cd tts
- Install the required packages:
pip install -r requirements.txt
nix-shell shell.nix
- Install Selenium and WebDriver dependencies:
For Chrome:
# Install Chrome WebDriver
pip install webdriver-manager
For Firefox:
# Install Firefox GeckoDriver
# On Ubuntu/Debian:
sudo apt-get install firefox-geckodriver
# On macOS with Homebrew:
brew install geckodriver
- Create a
.env
file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
project/
├── main.py # Main script for processing files
├── transcription.py # Handles image/PDF transcription
├── input/ # Regular input folder for files
├── output/ # Output folder for MP3 files
├── flyers/ # Root folder for flyer processing
│ └── target_weekly_ad/ # Example flyer source
│ └── YYYYMMDD/ # Date-based folders
└── requirements.txt # Python dependencies
- Place your input files (images or PDFs) in the
input
folder. - Run the main script:
python main.py
- Select option 1 when prompted for input source.
- The script will process each file and generate corresponding MP3 files in the
output
folder.
- Ensure the proper folder structure exists under
flyers/
:
flyers/
└── target_weekly_ad/
└── YYYYMMDD/
└── [image files]
- Run the main script:
python main.py
- Select option 2 when prompted for input source.
- Choose the flyer type (e.g., target_weekly_ad).
- Select the date folder to process.
- The script reads files from the
input
folder. - For each file:
- If it's a PDF, it's converted to images.
- The image(s) are encoded to base64.
- GPT-4o is used to transcribe and describe the content.
- The transcription is converted to speech using OpenAI's TTS-1 model.
- The resulting audio is saved as an MP3 file in the
output
folder.
- Flyers are organized by source and date in the
flyers
directory. - Each flyer source (e.g., target_weekly_ad) has its own directory.
- Within each source directory, flyers are organized by date (YYYYMMDD format).
- The script processes all images in the selected date folder.
- Generated audio files are saved in the
output
folder.
The project includes support for automated flyer downloading using Selenium:
- Selenium automates the browser to navigate to flyer websites
- Downloads are organized by date in the appropriate flyer source folder
- The system is designed to add context and description to the transcriptions, making them more suitable for audio consumption.
- The TTS model uses the "nova" voice, but this can be changed in the
main.py
file. - Flyer processing is organized by date to maintain historical records.
- Selenium automation helps maintain up-to-date flyer content.
- Large PDF files may take longer to process due to the conversion to images.
- The quality of transcription and description depends on the clarity and complexity of the input images.
- Automated flyer downloads may break if websites change their structure.
- Selenium requires proper WebDriver setup and maintenance.
If you encounter issues with Selenium:
- Check WebDriver compatibility:
# For Chrome, check versions
google-chrome --version
chromedriver --version
# For Firefox, check versions
firefox --version
geckodriver --version
- Update WebDriver if needed:
# For Chrome
webdriver-manager update
# For Firefox
# Update through your package manager
- Common Selenium errors:
- "WebDriver not found": Ensure the correct WebDriver is installed and in your PATH
- "Version mismatch": Update WebDriver to match your browser version
- "Browser not found": Verify browser installation and PATH settings
- For memory issues with large files, try processing fewer files at once
Contributions are welcome! Please feel 8948 free to submit a Pull Request.
POST /convert
Converts images or PDFs to speech audio. Accepts file uploads, base64-encoded data, or URLs.
Request Body:
- Multipart form data with file upload:
file: [binary file data]
- OR JSON with base64 data:
{ "base64": "base64_encoded_string" }
- OR JSON with URL:
{ "url": "https://example.com/image.jpg" }
Supported File Types:
- Images: .png, .jpg, .jpeg
- Documents: .pdf
Response:
{
"audio": "base64_encoded_audio",
"message": "Conversion successful"
}
Error Response:
{
"error": "Error message description"
}
Status Codes:
- 200: Success
- 400: Bad request (invalid file type, missing file)
- 500: Server error
GET /fetch-target-ads
Fetches the latest Target weekly advertisements, processes them, and returns base64-encoded audio for each ad.
Response:
{
"message": "Target ads fetched and processed successfully",
"date": "20240320",
"audio_files": [
{
"filename": "target_ad_1.jpg",
"audio": "base64_encoded_audio_string"
},
{
"filename": "target_ad_2.jpg",
"audio": "base64_encoded_audio_string"
}
// ... more files
]
}
Error Response:
{
"error": "Error message description"
}
Status Codes:
- 200: Success
- 404: No ads found
- 500: Server error
Converting a File:
curl -X POST -F "file=@image.png" http://127.0.0.1:5000/convert
Converting from URL:
curl -X POST -H "Content-Type: application/json" \
-d '{"url":"https://example.com/image.jpg"}' \
http://127.0.0.1:5000/convert
Converting Base64 Data:
curl -X POST -H "Content-Type: application/json" \
-d '{"base64":"base64_encoded_string"}' \
http://127.0.0.1:5000/convert
Fetching Target Ads:
curl http://127.0.0.1:5000/fetch-target-ads