Using state-of-the-art π time series analysis neural networks for musical instrument recognition!
π Powered by PyOmniTS, the unified framework for time series analysis.
Important
Accuracy is not guaranteed (and I'm not an expert in music)! Refer to the benchmark section for model performance details.
Models are currently trained on the OpenMIC-2018 dataset, which includes 20 types of "instruments":
- πͺ Accordion [wiki]
- πͺ Banjo [wiki]
- Bass [wiki]
- Cello [wiki]
- Clarinet [wiki]
- Cymbals [wiki]
- π₯ Drums [wiki]
- Flute [wiki]
- πΈ Guitar [wiki]
- Mallet Percussion [wiki]
- Mandolin [wiki]
- Organ [wiki]
- πΉ Piano [wiki]
- π· Saxophone [wiki]
- Synthesizer [wiki]
- Trombone [wiki]
- πΊ Trumpet [wiki]
- Ukulele [wiki]
- π» Violin [wiki]
- π£οΈ Voice [wiki]
-
Clone this repository and its submodules, then checkout to branch
InsRec
for backend submodule.jhgit clone --recurse-submodules https://github.com/Ladbaby/InsRec cd InsRec/backend git checkout InsRec cd ..
-
Create a python virtual environment via the tool of your choice.
for example, using Miniconda/Anaconda:
conda create -n InsRec python=3.12 conda activate InsRec
Python 3.11 & 3.12 have been tested. Other versions may also work.
-
Install dependencies in the created environment.
pip install -r backend/requirements.txt pip install -r requirements.txt
Some models may require extra dependencies, which can be found in comments of
backend/requirements.txt
.
The web UI is launched via:
streamlit run main.py
or running sh main.sh
.
During the first run, it will prompt you whether to download checkpoint files for models in the terminal.
Neural network training is powered by PyOmniTS framework.
The training procedure for existing models on OpenMIC-2018 dataset is detailed here.
-
Download the dataset from here, and place the extracted result under
backend/storage/datasets/OpenMIC
. Create the parent folder if not exists. -
Download the processed VGGish representations of corresponding audios from huggingface, and place it under
backend/storage/datasets/OpenMIC/processed
.It's worth noting that these VGGish representations are different from the "X" in
backend/storage/datasets/OpenMIC/openmic-2018.npz
. Our representations are obtained using the pretrained PyTorch VGGish pipeline and the PCA weights from torchvggish.
You may find experimental settings (e.g., learning rate, d_model) for the chosen model in its scripts under backend/scripts/CHOSEN_MODEL/OpenMIC.sh
.
Start training by:
cd backend
sh scripts/CHOSEN_MODEL/OpenMIC.sh
Model weights pytorch_model.bin
will be found under backend/storage/results
To infer using your trained weights instead, replace the pytorch_model.bin
file under backend/storage/pretrained/OpenMIC/CHOSEN_MODEL
folder with your own.
Test set performance on OpenMIC-2018 dataset:
Model | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
Pyraformer | 67.86 | 64.72 | 65.01 | 64.30 |
Reformer | 67.66 | 64.60 | 64.74 | 63.94 |
Informer | 67.40 | 64.23 | 64.06 | 63.38 |
Nonstationary Transformer | 66.88 | 64.42 | 64.09 | 63.46 |
iTransformer | 66.04 | 63.74 | 60.87 | 61.53 |
Hi-Patch | 65.90 | 63.72 | 60.76 | 61.12 |
GRU-D | 65.83 | 63.30 | 62.34 | 61.95 |
TSMixer | 65.47 | 62.27 | 60.80 | 60.82 |
LightTS | 65.26 | 62.42 | 60.32 | 60.68 |
Mamba | 65.05 | 61.89 | 61.52 | 60.95 |
Raindrop | 64.95 | 61.81 | 62.17 | 61.13 |
Transformer | 64.53 | 61.43 | 64.00 | 61.08 |
FEDformer | 64.48 | 60.58 | 59.96 | 59.61 |
FreTS | 64.48 | 62.12 | 59.30 | 59.99 |
DLinear | 64.22 | 62.04 | 59.04 | 59.64 |
TimesNet | 64.17 | 61.19 | 60.62 | 60.03 |
Linear | 64.17 | 63.07 | 58.41 | 59.60 |
Leddam | 63.23 | 62.53 | 59.66 | 59.38 |
mTAN | 60.89 | 53.87 | 44.75 | 46.73 |
SegRNN | 58.70 | 60.02 | 50.77 | 53.23 |
Autoformer | 54.43 | 52.15 | 50.25 | 50.56 |
PatchTST | 45.63 | 43.01 | 41.91 | 41.27 |
TiDE | 36.35 | 34.18 | 27.93 | 29.38 |
MICN | 36.67 | 34.91 | 29.31 | 30.01 |
Crossformer | 21.72 | 1.09 | 5.00 | 1.78 |
FiLM | 21.72 | 1.09 | 5.00 | 1.78 |
Existing state-of-the-art time series models mainly learns in the time domain, while audios processing models primarily learns in the frequency domain. Also, audio (e.g., 16k every second) is far longer than any time series in research datasets (e.g., 720). Therefore, VGGish is currently used as the encoder to convert audio input as embeddings, and time series models take them as input instead (it makes little sense I know, but this is possibly the only way for painless adaptation).
Further improvements may require changing network architecture of time series models, such that VGGish embeddings are treated as representations instead of time series.