WHISPER API

A frontend service for audio transcription using WhisperX with authentication.

Architecture

The Whisper API follows a queue-based architecture with modular components:

HTTP Handler:
- Receives transcription requests containing audio files
- Stores the audio file in a temporary directory
- Adds the job to the queue manager's queue
- Responds immediately with a URL for checking transcription status
- Serves completed transcriptions when requested
- Requests cleanup of files after delivering results
Queue Manager:
- Manages a FIFO queue of transcription jobs
- Supports both sequential and concurrent processing modes (configurable via ENABLE_CONCURRENCY)
- Configurable number of concurrent jobs when concurrency is enabled (MAX_CONCURRENT_JOBS)
- Invokes the WhisperX command to transcribe audio files
- Stores results in a temporary directory
- Tracks job status (queued, processing, completed, failed)
- Handles cleanup of all job files (both temporary and output files)
- Automatically removes old jobs after the retention period (configurable, default: 48 hours)
Client Workflow:
- Client authenticates by including an Authorization header with a Bearer token
- Client submits audio for transcription and receives a job ID
- Client periodically checks status URL until transcription is ready
- When ready, client downloads the transcription result
- After successful download, all files (both temporary and output files) are cleaned up
- Alternatively, client can cancel a pending job if transcription is no longer needed

Configuration

Configuration File

The application uses a configuration system with the following priority (highest to lowest):

Environment variables
Configuration file (whisper_api.conf)
Default constants

The configuration file uses TOML format and should be placed in the same directory as the application. Here's an example of the configuration file:

# Whisper API Configuration File
WHISPER_API_HOST = "192.168.0.116"
WHISPER_API_PORT = "8181"
WHISPER_CMD = "/home/llm/whisper_api/whisperx.sh"
WHISPER_DEVICE = "cuda"
WHISPER_DEVICE_INDEX = "0"
# Additional configuration options...

At startup, the application will:

Try to load the configuration from whisper_api.conf
Set environment variables from the config file if they don't already exist
Use environment variables or fall back to default values

This allows for flexible configuration management across different environments.

Environment Variables

The application can be configured using the following environment variables:

Variable	Description	Default Value
`WHISPER_CMD`	Path to the WhisperX script	`/home/llm/whisper_api/whisperx.sh`
`WHISPER_MODELS_DIR`	Path to the models directory	`/models`
`WHISPER_OUTPUT_DIR`	Directory for WhisperX to store output files	`/home/llm/whisper_api/output`
`WHISPER_OUTPUT_FORMAT`	Default output format for transcriptions	`txt`
`WHISPER_DEVICE`	Device to use for PyTorch inference (cuda or cpu)	`cuda`
`WHISPER_DEVICE_INDEX`	Device index to use for FasterWhisper inference (used when multiple GPUs are available)	`0`
`WHISPER_TMP_FILES`	Directory for storing temporary files	`/home/llm/whisper_api/tmp`
`WHISPER_API_HOST`	Host address to bind the server	`127.0.0.1`
`WHISPER_API_PORT`	Port for the HTTP server	`8181`
`WHISPER_API_TIMEOUT`	Client disconnect timeout in seconds	`480`
`WHISPER_API_KEEPALIVE`	Keep-alive timeout in seconds	`480`
`HTTP_WORKER_NUMBER`	Number of HTTP worker processes (0 = use number of CPU cores)	`0`
`WHISPER_JOB_RETENTION_HOURS`	Number of hours to keep job files before automatic cleanup	`48`
`WHISPER_CLEANUP_INTERVAL_HOURS`	Interval in hours between cleanup runs	`1`
`MAX_FILE_SIZE`	Maximum file size for uploads in bytes	`536870912` (512MB)
`ENABLE_CONCURRENCY`	Enable concurrent job processing	`false`
`MAX_CONCURRENT_JOBS`	Maximum number of concurrent jobs (when enabled)	`6`
`ENABLE_AUTHORIZATION`	Require authentication for all API requests	`true`
`SYNC_REQUEST_TIMEOUT_SECONDS`	Timeout for synchronous transcription requests (0 = no timeout)	`1800`
`DEFAULT_SYNC_MODE`	Default mode when sync parameter is missing (true = sync, false = async)	`false`
`RUST_LOG`	Logging level (error, warn, info, debug, trace)	`info`
`HF_TOKEN`	Hugging Face API token for diarization models access (can alternatively be passed per-request or loaded from file)	None
`WHISPER_HF_TOKEN_FILE`	Path to file containing Hugging Face API token	`/home/llm/whisper_api/hf_token.txt`

Authentication

By default, all API requests must include an Authorization header with a Bearer token:

Authorization: Bearer your_token_here

Requests without a valid Authorization header will be rejected with a 401 Unauthorized response.

Note: Currently, the API uses a dummy verification that accepts any token, but the header must be present and properly formatted. Users can implement their own token validation mechanism by customizing the validate_token function in the authentication.rs file.

Authentication can be disabled by setting ENABLE_AUTHORIZATION=false in the configuration file or environment variables. When disabled, requests will be accepted without any Authorization header.

API Endpoints

OPTIONS for Transcription Resource

OPTIONS /transcription

Description: Returns the available HTTP methods and CORS headers for the /transcription resource. This endpoint is always accessible regardless of authentication settings to support CORS pre-flight requests.

Response Headers:

Allow: Lists all available HTTP methods for the resource
Access-Control-Allow-Methods: Same as Allow header, for CORS
Access-Control-Allow-Headers: Lists allowed headers, including Authorization and Content-Type
Access-Control-Max-Age: Caching duration for preflight response (86400 seconds = 24 hours)

Response:

HTTP/1.1 200 OK
Allow: OPTIONS, POST, GET, DELETE
Access-Control-Allow-Methods: OPTIONS, POST, GET, DELETE
Access-Control-Allow-Headers: Authorization, Content-Type
Access-Control-Max-Age: 86400

Submit a Transcription Job

POST /transcription

Form Parameters:

file: Audio file to transcribe (required)
language: Language code (optional, default: "fr")
model: Model name (optional, default: "large-v3")
diarize: Whether to apply speaker diarization (optional, default: true)
prompt: Initial text prompt for transcription (optional, default: "")
hf_token: Hugging Face API token for accessing diarization models (optional, if not provided will try to load from hf_token.txt file). Required for speaker diarization to work properly.
response_format: Format of transcription output (optional, values: "srt", "vtt", "txt", "tsv", "json", "aud", default: "txt")
device: Device to use for PyTorch inference (optional, values: "cuda", "cpu", default: value of WHISPER_DEVICE configuration)
device_index: Device index for inference when using CUDA (optional, default: value of WHISPER_DEVICE_INDEX configuration)
sync: Whether to process the request synchronously (optional, values: "true", "false", default: value of DEFAULT_SYNC_MODE configuration)

Response (Async Mode - default):

{
  "job_id": "uuid-string",
  "status_url": "/transcription/uuid-string"
}

Response (Sync Mode - when sync=true):

{
  "text": "Transcription text content",
  "language": "Detected language",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Segment text"
    }
  ]
}

Check Transcription Status

GET /transcription/{job_id}

Response:

{
  "status": "Queued|Processing|Completed|Failed",
  "queue_position": 1,
  "data": "Additional information (result or error)"
}

Notes:

The queue_position field is only included when the status is "Queued".
It indicates the job's position in the queue (1-based index), where 1 means it's the next job to be processed after the current one.
If a job is already processing or completed, the queue_position field will not be included in the response.
You can use this value to provide users with an estimated wait time or position in line.

Get Transcription Result

GET /transcription/{job_id}/result

Response:

{
  "text": "Transcription text content",
  "language": "Detected language",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Segment text"
    }
  ]
}

Security Note: When this endpoint is called, the system automatically removes all transcription files (both from the temporary directory and the WhisperX output directory) to ensure privacy and prevent data leakage.

Cancel a Transcription Job

DELETE /transcription/{job_id}

Description: Cancels a pending job and removes associated audio files. Jobs that are already processing cannot be canceled.

Response (Success):

{
  "success": true,
  "message": "Job canceled successfully"
}

Response (Error):

{
  "error": "Error message (job not found, already processing, etc.)"
}

Get API Status

GET /status

Description: Returns information about the API configuration and current queue state. This provides a snapshot of how the API is configured and the current job processing status.

Response:

{
  "server": {
    "host": "127.0.0.1",
    "port": "8181",
    "timeout": 480,
    "keepalive": 480,
    "worker_number": 2
  },
  "processing": {
    "concurrent_mode": false,
    "max_concurrent_jobs": 6,
    "device": "cuda",
    "device_index": "0",
    "default_output_format": "txt",
    "default_sync_mode": false,
    "sync_timeout": 1800
  },
  "resources": {
    "max_file_size": 536870912,
    "job_retention_hours": 48,
    "cleanup_interval_hours": 12
  },
  "security": {
    "authorization_enabled": false
  },
  "queue_state": {
    "queued_jobs": 2,
    "processing_jobs": 1
  },
  "error": null
}

Fields:

server: Server configuration details (host, port, timeouts, worker count)
processing: Processing configuration (sequential/concurrent mode, device settings, formats)
resources: Resource limits and retention policies
security: Security settings like authentication requirement
queue_state: Current number of queued and processing jobs
error: Any error that occurred when fetching queue state (null if no errors)

Configuration Example

A complete whisper_api.conf file might look like this:

# Server Configuration
WHISPER_API_HOST = "0.0.0.0"
WHISPER_API_PORT = "9000"
WHISPER_API_TIMEOUT = 600
WHISPER_API_KEEPALIVE = 600
HTTP_WORKER_NUMBER = 4  # Use 4 workers (limited to number of CPU cores)

# API Configuration
SYNC_REQUEST_TIMEOUT_SECONDS = 1800  # 30 minutes timeout for synchronous requests
DEFAULT_SYNC_MODE = false            # Default to asynchronous mode when 'sync' parameter is missing

# File Storage Configuration
WHISPER_TMP_FILES = "/data/whisper/tmp"
WHISPER_HF_TOKEN_FILE = "/data/secrets/hf_token.txt"

# WhisperX Configuration
WHISPER_CMD = "/opt/whisper/whisperx.sh"
WHISPER_MODELS_DIR = "/opt/whisper/models"
WHISPER_OUTPUT_DIR = "/data/whisper/output"
WHISPER_OUTPUT_FORMAT = "srt"
WHISPER_DEVICE = "cuda"
WHISPER_DEVICE_INDEX = "0"

# Job Management Configuration
WHISPER_JOB_RETENTION_HOURS = 24
WHISPER_CLEANUP_INTERVAL_HOURS = 6

# Upload Configuration
MAX_FILE_SIZE = 1073741824  # 1GB

# Concurrency Configuration
ENABLE_CONCURRENCY = true
MAX_CONCURRENT_JOBS = 6

# Security Configuration
ENABLE_AUTHORIZATION = false

Test Commands

Submit Transcription (Asynchronous Mode - Default)

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "language=fr" \
  -F "diarize=true" \
  -F "prompt=Meeting transcript:" \
  -F "hf_token=YOUR_HUGGINGFACE_TOKEN" \
  -F "response_format=txt" \
  -F "file=@/path/to/audio.wav"

Submit Transcription (Synchronous Mode)

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "language=fr" \
  -F "diarize=true" \
  -F "prompt=Meeting transcript:" \
  -F "hf_token=YOUR_HUGGINGFACE_TOKEN" \
  -F "sync=true" \
  -F "file=@/path/to/audio.wav"

Check Status (with Queue Position)

curl -X GET "http://localhost:8181/transcription/YOUR_JOB_ID" \
  -H "Authorization: Bearer your_token_here"

Note:

The maximum file size is configurable using the MAX_FILE_SIZE setting (default: 512 MB).
Job processing can be configured for concurrent operation by setting ENABLE_CONCURRENCY=true and MAX_CONCURRENT_JOBS to the desired number of simultaneous jobs (default: 6).
When concurrency is enabled, the queue position reflects the "batch" in which a job will be processed rather than its exact position in the queue.
The number of HTTP workers is configurable using the HTTP_WORKER_NUMBER setting (0 = use number of CPU cores, >0 = use specified number up to the number of CPU cores).
The API supports both synchronous and asynchronous transcription modes:
- Asynchronous Mode (default): Returns immediately with a job ID and requires polling for status/results
- Synchronous Mode: Waits for transcription to complete and returns the result directly (use sync=true parameter)
- Default mode when sync parameter is missing is controlled by DEFAULT_SYNC_MODE setting (default: false = async)
- Timeout for synchronous requests is configurable with SYNC_REQUEST_TIMEOUT_SECONDS (default: 1800 seconds, 0 = no timeout)

Examples

Check Job Status and Queue Position

curl -X GET "http://localhost:8181/transcription/123e4567-e89b-12d3-a456-426614174000" \
  -H "Authorization: Bearer your_token_here"

Example response for a queued job:

{
  "status": "Queued",
  "queue_position": 3
}

Example response for a processing job:

{
  "status": "Processing"
}

Disable Speaker Diarization

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "diarize=false" \
  -F "file=@/path/to/audio.wav"

Add Initial Prompt

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "prompt=This is an interview between John and Sarah:" \
  -F "file=@/path/to/audio.wav"

Specify Device and Device Index

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "device=cuda" \
  -F "device_index=0" \
  -F "file=@/path/to/audio.wav"

Use CPU Instead of GPU

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "device=cpu" \
  -F "file=@/path/to/audio.wav"

Use Diarization with Hugging Face Token

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "diarize=true" \
  -F "hf_token=YOUR_HUGGINGFACE_TOKEN" \
  -F "file=@/path/to/audio.wav"

Note: If you don't provide the hf_token parameter, the system will attempt to read the token from the file specified by WHISPER_HF_TOKEN_FILE (default: /home/llm/whisper_api/hf_token.txt). If no token is available (file missing, empty, or unreadable) and diarization is requested, diarization will be automatically disabled with a warning in the logs. This ensures transcription jobs can proceed even without a token, but without speaker identification.

Specify Output Format

curl -X POST "http://localhost:8181/transcription" \
  -H "Authorization: Bearer your_token_here" \
  -F "response_format=srt" \
  -F "file=@/path/to/audio.wav"

File Structure

The API is organized into modular components:

handlers/: HTTP request handlers and form processing
models.rs: Data structures for requests and responses
file_utils.rs: File operations and resource management
queue_manager.rs: Job queue and transcription processing with secure file cleanup
config.rs: Application configuration management
config_loader.rs: Configuration loading from TOML file and environment variables
error.rs: Error handling and HTTP responses
whisperx.sh: Wrapper script for running WhisperX in its virtual environment

The whisperx.sh script is responsible for activating the Python virtual environment, running the WhisperX command with the provided arguments, and then deactivating the environment. This ensures proper execution of WhisperX without requiring the API to manage Python environments directly.

The path to this script can be configured via the WHISPER_CMD setting in the configuration file or environment variable.

API and Processing Models

The Whisper API supports multiple modes of operation:

Request Processing Modes

Asynchronous Mode (default)
- Returns immediately with a job ID
- Client must poll for status and retrieve results separately
- Doesn't block HTTP connections during processing
- Ideal for long transcriptions or when immediate results aren't needed
Synchronous Mode
- Waits for the transcription to complete before responding
- Returns transcription results directly in the response
- Simpler client implementation (single HTTP request)
- Configurable timeout with SYNC_REQUEST_TIMEOUT_SECONDS
- Use by adding sync=true parameter to /transcription requests

Job Processing Models

The Whisper API supports two processing models that can be configured via the ENABLE_CONCURRENCY setting:

Sequential Processing (default, ENABLE_CONCURRENCY=false):
- Jobs are processed one at a time in FIFO order
- Prevents GPU memory contention by ensuring only one transcription runs at a time
- Simplifies resource management and provides predictable processing
- Queue position directly indicates how many jobs are ahead in the queue
- Recommended for systems with limited GPU memory
Concurrent Processing (ENABLE_CONCURRENCY=true):
- Multiple jobs can be processed simultaneously
- The number of concurrent jobs is controlled by the MAX_CONCURRENT_JOBS setting (default: 6)
- Increases throughput when sufficient GPU memory is available
- Queue positions are reported in "batches" based on concurrency level
- Optimal for systems with multiple GPUs or high-memory GPUs

Security and Privacy

The Whisper API implements several security and privacy measures:

Authentication: All API endpoints require Bearer token authentication
Comprehensive File Cleanup:
- Temporary files in the job directory are removed after delivering results
- Output files in the WhisperX output directory are also removed
- Files are deleted immediately after successful result delivery
- Automatic cleanup of expired jobs runs periodically
File Isolation:
- Each job gets a unique UUID-based directory
- Audio files and transcription results are isolated
- File paths are not exposed to clients

Resources

WhisperX: https://github.com/m-bain/whisperX
WhisperX Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
prompt.txt		prompt.txt
prompt_summary.txt		prompt_summary.txt
rust_preprompt.txt		rust_preprompt.txt
whisper_api.conf		whisper_api.conf
whisper_api.conf.backup		whisper_api.conf.backup
whisperx.md		whisperx.md
whisperx.sh		whisperx.sh

ggielly/whisper_api

Folders and files

Latest commit

History

Repository files navigation

WHISPER API

Architecture

Configuration

Configuration File

Environment Variables

Authentication

API Endpoints

OPTIONS for Transcription Resource

Submit a Transcription Job

Check Transcription Status

Get Transcription Result

Cancel a Transcription Job

Get API Status

Configuration Example

Test Commands

Submit Transcription (Asynchronous Mode - Default)

Submit Transcription (Synchronous Mode)

Check Status (with Queue Position)

Examples

Check Job Status and Queue Position

Disable Speaker Diarization

Add Initial Prompt

Specify Device and Device Index

Use CPU Instead of GPU

Use Diarization with Hugging Face Token

Specify Output Format

File Structure

API and Processing Models

Request Processing Modes

Job Processing Models

Security and Privacy

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages