Blackbird Dataset Manager

This is a tool to manage, synchronize and otherwise work with the Blackbird music dataset.

Features

Flexible dataset structure supporting optional components - various file types per track
WebDAV-based synchronization with selective component sync
Schema management for dataset structure
Statistics and analysis tools
Command-line interface for common operations
Performance optimizations:
- Parallel downloads with multi-threading
- HTTP/2 support for faster connections
- Connection pooling for reduced overhead
- Detailed performance profiling

TODO

Continue after download stopped abruptly, and cover that with tests It should already skip files that are already downloaded, but that wasn't covered by tests and checked

Installation

There are two ways to install the Blackbird Dataset Manager:

1. Using pip (recommended)

# Install from the repository root
pip install -e .

2. Using requirements.txt

# Install dependencies first
pip install -r requirements.txt

# Then install the package
pip install -e .

Usage

After installation, you can use the blackbird command-line tool:

# Show help
blackbird --help

# Clone a dataset (downloads from remote to a new local dataset)
blackbird clone webdav://server/dataset /path/to/local --components vocals,instrumental --parallel 4

# Sync a dataset (updates existing local dataset)
blackbird sync webdav://server/dataset /path/to/local --components vocals,instrumental --parallel 4

# Clone only for tracks missing a specific component
blackbird clone webdav://server/dataset /path/to/local --components vocals,instrumental --missing mir
# This will only download vocals and instrumental files for tracks that don't have mir files

# Show dataset statistics
blackbird stats /path/to/dataset

# Find tracks
blackbird find-tracks /path/to/dataset --missing vocals

# Show schema
blackbird schema show /path/to/dataset

# Add new component
blackbird schema add /path/to/dataset component_name "*_pattern.mp3"

# Rebuild index
blackbird reindex /path/to/dataset

Performance Optimizations

The Blackbird Dataset Manager includes several performance optimizations for faster downloads:

Parallel Downloads

Use the --parallel option to enable multi-threaded downloads:

# Download with 4 parallel threads
blackbird sync webdav://server/dataset /path/to/local --parallel 4

HTTP/2 Support

Enable HTTP/2 for faster connections with the --http2 flag:

# Use HTTP/2 protocol (requires httpx package)
blackbird sync webdav://server/dataset /path/to/local --http2

Connection Pooling

Adjust connection pool size with the --connection-pool option:

# Set connection pool size to 20
blackbird sync webdav://server/dataset /path/to/local --connection-pool 20

Performance Profiling

Enable performance profiling to identify bottlenecks:

# Enable profiling
blackbird sync webdav://server/dataset /path/to/local --profile

Combining Optimizations

For best performance, combine all optimizations:

# Use all optimizations
blackbird sync webdav://server/dataset /path/to/local --parallel 4 --http2 --connection-pool 20 --profile

Development

For development, install additional dependencies:

pip install -e ".[dev]"

This will install development tools like pytest, black, and mypy.

Running Tests

To run all tests for the Blackbird Dataset Manager:

# Run all tests
python -m pytest

# Run tests with verbose output
python -m pytest -v

# Run tests for a specific module
python -m pytest blackbird/tests/test_locations.py

# Run tests with coverage report
python -m pytest --cov=blackbird

Quick Start

from blackbird import Dataset

# Initialize dataset
dataset = Dataset("/path/to/dataset")

# Find tracks missing MIR data
missing_mir = dataset.find_tracks(missing=['mir'])

# Sync specific components
dataset.sync("webdav://server", components=['vocals', 'mir'])

# Sync with performance optimizations
dataset.sync("webdav://server", 
             components=['vocals', 'mir'],
             parallel=4,
             use_http2=True,
             connection_pool_size=20)

Command Line Usage

# Initialize new dataset
blackbird schema discover /path/to/dataset
blackbird reindex /path/to/dataset

# Show dataset status
blackbird status

# Add new component type
blackbird schema add lyrics "*.lyrics.txt"

# Sync specific components
blackbird sync --components=vocals,mir webdav://server

Dataset Structure

The dataset follows a hierarchical structure:

dataset_root/
├── Artist1/
│   ├── Album1/
│   │   ├── track1_instrumental.mp3
│   │   └── ...
│   └── Album2/
│       ├── CD1/
│       │   ├── track1_instrumental.mp3
│       │   └── ...
│       └── CD2/
│           ├── track1_instrumental.mp3
│           └── ...
└── ...

Performance Testing

The repository includes a test script to compare different optimization strategies:

# Run performance tests
python test_optimized_sync.py webdav://server/dataset

This will generate a performance comparison chart showing the impact of different optimizations.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.cursor/rules		.cursor/rules
.vscode		.vscode
blackbird		blackbird
utils		utils
.cursorignore		.cursorignore
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
blackbird.jpg		blackbird.jpg
blackbird_dataset_full_spec.md		blackbird_dataset_full_spec.md
cl_cd_dev_plan_10feb0134.md		cl_cd_dev_plan_10feb0134.md
cli_selective_sync_test.sh		cli_selective_sync_test.sh
example_dataset_operations.py		example_dataset_operations.py
requirements.txt		requirements.txt
setup.py		setup.py
test_real_dataset_index.py		test_real_dataset_index.py
test_track_names.py		test_track_names.py
webdav_setup_wizard_plan.md		webdav_setup_wizard_plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Blackbird Dataset Manager

Features

TODO

Installation

1. Using pip (recommended)

2. Using requirements.txt

Usage

Performance Optimizations

Parallel Downloads

HTTP/2 Support

Connection Pooling

Performance Profiling

Combining Optimizations

Development

Running Tests

Quick Start

Command Line Usage

Dataset Structure

Performance Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Kiberchaika/The_Blackbird_Dataset

Folders and files

Latest commit

History

Repository files navigation

Blackbird Dataset Manager

Features

TODO

Installation

1. Using pip (recommended)

2. Using requirements.txt

Usage

Performance Optimizations

Parallel Downloads

HTTP/2 Support

Connection Pooling

Performance Profiling

Combining Optimizations

Development

Running Tests

Quick Start

Command Line Usage

Dataset Structure

Performance Testing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages