This is a tool to manage, synchronize and otherwise work with the Blackbird music dataset.
- Flexible dataset structure supporting optional components - various file types per track
- WebDAV-based synchronization with selective component sync
- Schema management for dataset structure
- Statistics and analysis tools
- Command-line interface for common operations
- Performance optimizations:
- Parallel downloads with multi-threading
- HTTP/2 support for faster connections
- Connection pooling for reduced overhead
- Detailed performance profiling
- Continue after download stopped abruptly, and cover that with tests It should already skip files that are already downloaded, but that wasn't covered by tests and checked
There are two ways to install the Blackbird Dataset Manager:
# Install from the repository root
pip install -e .
# Install dependencies first
pip install -r requirements.txt
# Then install the package
pip install -e .
After installation, you can use the blackbird
command-line tool:
# Show help
blackbird --help
# Clone a dataset (downloads from remote to a new local dataset)
blackbird clone webdav://server/dataset /path/to/local --components vocals,instrumental --parallel 4
# Sync a dataset (updates existing local dataset)
blackbird sync webdav://server/dataset /path/to/local --components vocals,instrumental --parallel 4
# Clone only for tracks missing a specific component
blackbird clone webdav://server/dataset /path/to/local --components vocals,instrumental --missing mir
# This will only download vocals and instrumental files for tracks that don't have mir files
# Show dataset statistics
blackbird stats /path/to/dataset
# Find tracks
blackbird find-tracks /path/to/dataset --missing vocals
# Show schema
blackbird schema show /path/to/dataset
# Add new component
blackbird schema add /path/to/dataset component_name "*_pattern.mp3"
# Rebuild index
blackbird reindex /path/to/dataset
The Blackbird Dataset Manager includes several performance optimizations for faster downloads:
Use the --parallel
option to enable multi-threaded downloads:
# Download with 4 parallel threads
blackbird sync webdav://server/dataset /path/to/local --parallel 4
Enable HTTP/2 for faster connections with the --http2
flag:
# Use HTTP/2 protocol (requires httpx package)
blackbird sync webdav://server/dataset /path/to/local --http2
Adjust connection pool size with the --connection-pool
option:
# Set connection pool size to 20
blackbird sync webdav://server/dataset /path/to/local --connection-pool 20
Enable performance profiling to identify bottlenecks:
# Enable profiling
blackbird sync webdav://server/dataset /path/to/local --profile
For best performance, combine all optimizations:
# Use all optimizations
blackbird sync webdav://server/dataset /path/to/local --parallel 4 --http2 --connection-pool 20 --profile
For development, install additional dependencies:
pip install -e ".[dev]"
This will install development tools like pytest, black, and mypy.
To run all tests for the Blackbird Dataset Manager:
# Run all tests
python -m pytest
# Run tests with verbose output
python -m pytest -v
# Run tests for a specific module
python -m pytest blackbird/tests/test_locations.py
# Run tests with coverage report
python -m pytest --cov=blackbird
from blackbird import Dataset
# Initialize dataset
dataset = Dataset("/path/to/dataset")
# Find tracks missing MIR data
missing_mir = dataset.find_tracks(missing=['mir'])
# Sync specific components
dataset.sync("webdav://server", components=['vocals', 'mir'])
# Sync with performance optimizations
dataset.sync("webdav://server",
components=['vocals', 'mir'],
parallel=4,
use_http2=True,
connection_pool_size=20)
# Initialize new dataset
blackbird schema discover /path/to/dataset
blackbird reindex /path/to/dataset
# Show dataset status
blackbird status
# Add new component type
blackbird schema add lyrics "*.lyrics.txt"
# Sync specific components
blackbird sync --components=vocals,mir webdav://server
The dataset follows a hierarchical structure:
dataset_root/
├── Artist1/
│ ├── Album1/
│ │ ├── track1_instrumental.mp3
│ │ └── ...
│ └── Album2/
│ ├── CD1/
│ │ ├── track1_instrumental.mp3
│ │ └── ...
│ └── CD2/
│ ├── track1_instrumental.mp3
│ └── ...
└── ...
The repository includes a test script to compare different optimization strategies:
# Run performance tests
python test_optimized_sync.py webdav://server/dataset
This will generate a performance comparison chart showing the impact of different optimizations.
MIT License