Add stage 5 & stage 6 #4649

A-Quarter-Mile · 2022-09-19T17:03:45Z

Hi @ftshijt, here are new updates on Muskit:

Add stage 5 & stage 6.
In stage 5, we use pitch from xml and align it according to lab_timeseq. In this version, data is collected when duration predictor is not used.

Recipe naive_rnn without DP on Ofuton can be runned now (stage 1 ~ 5 is tested, stage 6 is still running, others need to be done)
Some details need to be checked in NAR model with DP, which hasn't been applied in this version.

Add stage 7 (Decoding) & 8 (Scoring) & 9 (Pack model).
Add evalute_semitone & evalute_vuv.

Recipe ofuton naive_rnn has been tested from stage 1 to stage 9.

ftshijt

Many thanks! A few comments

espnet2/fileio/xml_scp.py

ftshijt · 2022-09-19T21:15:33Z

espnet2/svs/abs_svs.py

@@ -0,0 +1,45 @@
+# Copyright 2021 Tomoki Hayashi
+# Copyright 2021 Carnegie Mellon University (Jiatong Shi)


Could you try to import contributors of Muskits to here?

Similar can be applied to other files (but do not have to be now)

espnet2/svs/espnet_model.py

ftshijt · 2022-09-19T23:54:46Z

espnet2/svs/espnet_model.py

+
+                (
+                    labelFrame,
+                    labelFrame_lengths,
+                    scoreFrame,
+                    scoreFrame_lengths,
+                    tempoFrame,
+                    tempoFrame_lengths,
+                ) = extractMethod_frame(
+                    durations=durations.unsqueeze(-1),
+                    durations_lengths=durations_lengths,
+                    score=score.unsqueeze(-1),
+                    score_lengths=score_lengths,
+                    tempo=tempo.unsqueeze(-1),
+                    tempo_lengths=tempo_lengths,
+                )
+
+                labelFrame = labelFrame[
+                    :, : labelFrame_lengths.max()
+                ]  # for data-parallel
+                scoreFrame = scoreFrame[
+                    :, : scoreFrame_lengths.max()
+                ]  # for data-parallel
+
+                # Extract Syllable Level label, score, tempo information from Frame Level
+                (
+                    label,
+                    label_lengths,
+                    score,
+                    score_lengths,
+                    tempo,
+                    tempo_lengths,
+                ) = self.score_feats_extract(
+                    durations=labelFrame,
+                    durations_lengths=labelFrame_lengths,
+                    score=scoreFrame,
+                    score_lengths=scoreFrame_lengths,
+                    tempo=tempoFrame,
+                    tempo_lengths=tempoFrame_lengths,
+                )
+
+                # calculate durations, represent syllable encoder outputs to feats mapping
+                # Syllable Level duration info needs phone & midi
+                ds = []
+                for i, _ in enumerate(labelFrame_lengths):
+                    assert labelFrame_lengths[i] == scoreFrame_lengths[i]
+                    assert label_lengths[i] == score_lengths[i]
+
+                    frame_length = labelFrame_lengths[i]
+                    _phoneFrame = labelFrame[i, :frame_length]
+                    _midiFrame = scoreFrame[i, :frame_length]
+
+                    # Clean _phoneFrame & _midiFrame
+                    for index in range(frame_length):
+                        if _phoneFrame[index] == 0 and _midiFrame[index] == 0:
+                            frame_length -= 1
+                            feats_lengths[i] -= 1
+
+                    syllable_length = label_lengths[i]
+                    _phoneSyllable = label[i, :syllable_length]
+                    _midiSyllable = score[i, :syllable_length]
+
+                    start_index = 0
+                    ds_tmp = []
+                    flag_finish = 0
+                    for index in range(syllable_length):
+                        _findPhone = _phoneSyllable[index]
+                        _findMidi = _midiSyllable[index]
+                        _length = 0
+                        if flag_finish == 1:
+                            # Fix error in _phoneSyllable & _midiSyllable
+                            label[i, index] = 0
+                            score[i, index] = 0
+                            tempo[i, index] = 0
+                            label_lengths[i] -= 1
+                            score_lengths[i] -= 1
+                            tempo_lengths[i] -= 1
+                        else:
+                            for indexFrame in range(start_index, frame_length):
+                                if (
+                                    _phoneFrame[indexFrame] == _findPhone
+                                    and _midiFrame[indexFrame] == _findMidi
+                                ):
+                                    _length += 1
+                                else:
+                                    ds_tmp.append(_length)
+                                    start_index = indexFrame
+                                    break
+                                if indexFrame == frame_length - 1:
+                                    flag_finish = 1
+                                    ds_tmp.append(_length)
+                                    start_index = indexFrame
+                                    break
+
+                    assert (
+                        sum(ds_tmp) == frame_length and sum(ds_tmp) == feats_lengths[i]
+                    )
+
+                    ds.append(torch.tensor(ds_tmp))
+                ds = pad_list(ds, pad_value=0).to(label.device)


This part needs to be changed accordingly towards XML feature.

Also, can we make them to be a specific module (apart from tts model)

ftshijt · 2022-09-19T23:56:08Z

espnet2/svs/espnet_model.py

+        if tempo is not None:
+            tempo = tempo.to(dtype=torch.long)
+            batch.update(tempo=tempo, tempo_lengths=tempo_lengths)
+        if ds is not None:
+            batch.update(ds=ds)


We have got a lot of issues with the naming in the previous repo. Please consider renaming it for better interpretability

espnet2/svs/naive_rnn/naive_rnn.py

espnet2/train/preprocessor.py

Co-authored-by: Jiatong <728307998@qq.com>

ftshijt · 2022-09-24T14:14:23Z

Thanks for the update. I will first merge it

Add stage 5 & stage 6

01a3760

mergify bot added the ESPnet2 label Sep 19, 2022

ftshijt reviewed Sep 20, 2022

View reviewed changes

A-Quarter-Mile and others added 3 commits September 20, 2022 10:50

Update espnet2/svs/espnet_model.py

6a61ef8

Co-authored-by: Jiatong <728307998@qq.com>

Remove mix_train & noise

bb6bb71

Add stage 7 & 8 & 9

ae779e5

ftshijt merged commit 48b23ad into espnet:muskits Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add stage 5 & stage 6 #4649

Add stage 5 & stage 6 #4649

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,45 @@
		# Copyright 2021 Tomoki Hayashi
		# Copyright 2021 Carnegie Mellon University (Jiatong Shi)

Add stage 5 & stage 6 #4649

Add stage 5 & stage 6 #4649

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!