⚠️ Work in Progress - Technology Showcase
This is an experimental MVP demonstrating real-time AI capabilities. Not intended as a production-ready product.
Turn any live TV stream into a real-time movie script. AI watches, transcribes, and writes television as cinema.
Ever wondered what your favorite TV show would look like as a screenplay? tvtxt is an AI-powered pipeline that watches live television streams and transforms them into properly formatted movie scripts in real-time. Think of it as having a tireless scriptwriter that never blinks, never sleeps, and never misses a moment.
Watch the magic unfold in real-time: tvtxt live demo
This is a proof-of-concept showcase built to demonstrate the integration of several cutting-edge technologies:
- Real-time speech recognition.
- Vision-language understanding.
- Cloud-native AI inference.
- Live streaming media processing.
What this is:
- A technology demonstration.
- An experimental MVP.
- A learning playground for AI + media processing.
What this is NOT:
- A production-ready application.
- A commercial product.
- A fully-featured streaming service.
tvtxt combines cutting-edge AI models with cloud infrastructure to create a TV-to-screenplay transformation:
Modal handles our cloud GPU infrastructure, running two critical AI workloads:
- Parakeet ASR Model (NVIDIA) : Transcribes speech with remarkable accuracy and speed.
- Qwen2-VL Vision-Language Model: Describes visual scenes with cinematic flair.
Ensures our vision model outputs perfectly formatted JSON responses:
- Schema enforcement: Guarantees consistent screenplay structure.
- Azure Blob Storage: Temporarily stores captured video frames for visual analysis.
- Redis Cloud: Acts as the bridge between our backend pipeline and frontend display.
- FastHTML: Creates our live web interface with authentic screenplay styling.
- FFmpeg: The unsung hero that handles all media processing.
- 🎥 Stream Capture: FFmpeg latches onto a live TV stream, extracting both audio and video.
- 🎧 Audio Analysis: Every 10 seconds, audio chunks are sent to Modal's Parakeet ASR model for transcription.
- 📸 Frame Extraction: When speech is detected, FFmpeg captures a corresponding video frame.
- ☁️ Image Upload: The frame is uploaded to Azure Blob Storage and gets a public URL.
- 👁️ Visual Understanding: Modal's Qwen2-VL model analyzes the image and generates a screenplay-formatted scene description.
- 💾 Memory Update: The latest transcription and scene description are saved to Redis Cloud.
- 🖥️ Live Display: FastHTML serves a web page that auto-refreshes, showing the generated screenplay.
- 🔄 Repeat: The cycle continues, creating an ever-updating script of live television.
uv venv
source venv/bin/activate # or `.venv\Scripts\activate` on Windows
uv pip install -r requirements.txt
modal token new
Create a .env
file with your secret weapons:
# Azure Blob Storage (for frame storage)
AZURE_STORAGE_CONNECTION_STRING=your_azure_connection_string
# Redis Cloud (for state management)
REDIS_HOST=your_redis_host
REDIS_PORT=your_redis_port
REDIS_USERNAME=your_redis_username
REDIS_PASSWORD=your_redis_password
# HuggingFace (for model access)
HF_TOKEN=your_huggingface_token
# Modal endpoint (will be generated after deployment)
IMAGE_DESCRIBER_URL=your_modal_endpoint_url
Launch your scene description model to the cloud:
modal deploy scene_describer.py
Note: Copy the generated endpoint URL to your .env
file as IMAGE_DESCRIBER_URL
Fire up the transcription pipeline:
modal run ingest.py
Launch the web interface:
cd app
python main.py
Open your browser to http://localhost:5001
and watch as live TV transforms into screenplay format before your eyes!
tvtxt embraces ephemerality by design. Like live theater, each moment exists only in the present:
- No databases: Only the current scene matters.
- No history: Previous scripts vanish like morning mist.
- No storage: Frames and audio exist only long enough to be processed.
This project demonstrates real-time AI transcription and visual analysis using Al Jazeera English as a public live stream. No content is stored, archived, or redistributed. The system processes live broadcasts in real-time for educational and demonstration purposes only.