🎹 InsRec: Musical Instrument Recognition App

Using state-of-the-art 📈 time series analysis neural networks for musical instrument recognition!

🚀 Powered by PyOmniTS, the unified framework for time series analysis.

Important

Accuracy is not guaranteed (and I'm not an expert in music)! Refer to the benchmark section for model performance details.

📷 Screenshot

🌟 Features

Models are currently trained on the OpenMIC-2018 dataset, which includes 20 types of "instruments":

🪗 Accordion [wiki]
🪕 Banjo [wiki]
Bass [wiki]
Cello [wiki]
Clarinet [wiki]
Cymbals [wiki]
🥁 Drums [wiki]
Flute [wiki]
🎸 Guitar [wiki]
Mallet Percussion [wiki]
Mandolin [wiki]
Organ [wiki]
🎹 Piano [wiki]
🎷 Saxophone [wiki]
Synthesizer [wiki]
Trombone [wiki]
🎺 Trumpet [wiki]
Ukulele [wiki]
🎻 Violin [wiki]
🗣️ Voice [wiki]

⏬ Installation

From Source

Clone this repository and its submodules, then checkout to branch InsRec for backend submodule.jh

git clone --recurse-submodules https://github.com/Ladbaby/InsRec
cd InsRec/backend
git checkout InsRec
cd ..

Create a python virtual environment via the tool of your choice.

for example, using Miniconda/Anaconda:
```
conda create -n InsRec python=3.12
conda activate InsRec
```
Python 3.11 & 3.12 have been tested. Other versions may also work.
Install dependencies in the created environment.
```
pip install -r backend/requirements.txt
pip install -r requirements.txt
```
Some models may require extra dependencies, which can be found in comments of backend/requirements.txt.

🚀 Usage

Easy: Use Existing Model Weights

The web UI is launched via:

streamlit run main.py

or running sh main.sh.

During the first run, it will prompt you whether to download checkpoint files for models in the terminal.

Advanced: Train a Model

Neural network training is powered by PyOmniTS framework.

The training procedure for existing models on OpenMIC-2018 dataset is detailed here.

Obtain OpenMIC Dataset

Download the dataset from here, and place the extracted result under backend/storage/datasets/OpenMIC. Create the parent folder if not exists.
Download the processed VGGish representations of corresponding audios from huggingface, and place it under backend/storage/datasets/OpenMIC/processed.

It's worth noting that these VGGish representations are different from the "X" in backend/storage/datasets/OpenMIC/openmic-2018.npz. Our representations are obtained using the pretrained PyTorch VGGish pipeline and the PCA weights from torchvggish.

Train the Model

You may find experimental settings (e.g., learning rate, d_model) for the chosen model in its scripts under backend/scripts/CHOSEN_MODEL/OpenMIC.sh.

Start training by:

cd backend
sh scripts/CHOSEN_MODEL/OpenMIC.sh

Model weights pytorch_model.bin will be found under backend/storage/results

To infer using your trained weights instead, replace the pytorch_model.bin file under backend/storage/pretrained/OpenMIC/CHOSEN_MODEL folder with your own.

📊 Model Performance Benchmark

Test set performance on OpenMIC-2018 dataset:

Model	Accuracy	Precision	Recall	F1
Pyraformer	67.86	64.72	65.01	64.30
Reformer	67.66	64.60	64.74	63.94
Informer	67.40	64.23	64.06	63.38
Nonstationary Transformer	66.88	64.42	64.09	63.46
iTransformer	66.04	63.74	60.87	61.53
Hi-Patch	65.90	63.72	60.76	61.12
GRU-D	65.83	63.30	62.34	61.95
TSMixer	65.47	62.27	60.80	60.82
LightTS	65.26	62.42	60.32	60.68
Mamba	65.05	61.89	61.52	60.95
Raindrop	64.95	61.81	62.17	61.13
Transformer	64.53	61.43	64.00	61.08
FEDformer	64.48	60.58	59.96	59.61
FreTS	64.48	62.12	59.30	59.99
DLinear	64.22	62.04	59.04	59.64
TimesNet	64.17	61.19	60.62	60.03
Linear	64.17	63.07	58.41	59.60
Leddam	63.23	62.53	59.66	59.38
mTAN	60.89	53.87	44.75	46.73
SegRNN	58.70	60.02	50.77	53.23
Autoformer	54.43	52.15	50.25	50.56
PatchTST	45.63	43.01	41.91	41.27
TiDE	36.35	34.18	27.93	29.38
MICN	36.67	34.91	29.31	30.01
Crossformer	21.72	1.09	5.00	1.78
FiLM	21.72	1.09	5.00	1.78

Existing state-of-the-art time series models mainly learns in the time domain, while audios processing models primarily learns in the frequency domain. Also, audio (e.g., 16k every second) is far longer than any time series in research datasets (e.g., 720). Therefore, VGGish is currently used as the encoder to convert audio input as embeddings, and time series models take them as input instead (it makes little sense I know, but this is possibly the only way for painless adaptation).

Further improvements may require changing network architecture of time series models, such that VGGish embeddings are treated as representations instead of time series.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
backend @ 6c3ac7a		backend @ 6c3ac7a
frontend/tabs		frontend/tabs
images		images
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.py		main.py
main.sh		main.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎹 InsRec: Musical Instrument Recognition App

📷 Screenshot

🌟 Features

⏬ Installation

From Source

🚀 Usage

Easy: Use Existing Model Weights

Advanced: Train a Model

Obtain OpenMIC Dataset

Train the Model

📊 Model Performance Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Ladbaby/InsRec

Folders and files

Latest commit

History

Repository files navigation

🎹 InsRec: Musical Instrument Recognition App

📷 Screenshot

🌟 Features

⏬ Installation

From Source

🚀 Usage

Easy: Use Existing Model Weights

Advanced: Train a Model

Obtain OpenMIC Dataset

Train the Model

📊 Model Performance Benchmark

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages