8000 Add stage 5 & stage 6 by A-Quarter-Mile · Pull Request #4649 · espnet/espnet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add stage 5 & stage 6 #4649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 24, 2022
Merged

Add stage 5 & stage 6 #4649

merged 4 commits into from
Sep 24, 2022

Conversation

A-Quarter-Mile
Copy link
Contributor
@A-Quarter-Mile A-Quarter-Mile commented Sep 19, 2022

Hi @ftshijt, here are new updates on Muskit:

  • Add stage 5 & stage 6.
  • In stage 5, we use pitch from xml and align it according to lab_timeseq. In this version, data is collected when duration predictor is not used.

Recipe naive_rnn without DP on Ofuton can be runned now (stage 1 ~ 5 is tested, stage 6 is still running, others need to be done)
Some details need to be checked in NAR model with DP, which hasn't been applied in this version.

  • Add stage 7 (Decoding) & 8 (Scoring) & 9 (Pack model).
  • Add evalute_semitone & evalute_vuv.

Recipe ofuton naive_rnn has been tested from stage 1 to stage 9.

@mergify mergify bot added the ESPnet2 label Sep 19, 2022
Copy link
Collaborator
@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! A few comments

@@ -0,0 +1,45 @@
# Copyright 2021 Tomoki Hayashi
# Copyright 2021 Carnegie Mellon University (Jiatong Shi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to import contributors of Muskits to here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar can be applied to other files (but do not have to be now)

Comment on lines +128 to +282

(
labelFrame,
labelFrame_lengths,
scoreFrame,
scoreFrame_lengths,
tempoFrame,
tempoFrame_lengths,
) = extractMethod_frame(
durations=durations.unsqueeze(-1),
durations_lengths=durations_lengths,
score=score.unsqueeze(-1),
score_lengths=score_lengths,
tempo=tempo.unsqueeze(-1),
tempo_lengths=tempo_lengths,
)

labelFrame = labelFrame[
:, : labelFrame_lengths.max()
] # for data-parallel
scoreFrame = scoreFrame[
:, : scoreFrame_lengths.max()
] # for data-parallel

# Extract Syllable Level label, score, tempo information from Frame Level
(
label,
label_lengths,
score,
score_lengths,
tempo,
tempo_lengths,
) = self.score_feats_extract(
durations=labelFrame,
durations_lengths=labelFrame_lengths,
score=scoreFrame,
score_lengths=scoreFrame_lengths,
tempo=tempoFrame,
tempo_lengths=tempoFrame_lengths,
)

# calculate durations, represent syllable encoder outputs to feats mapping
# Syllable Level duration info needs phone & midi
ds = []
for i, _ in enumerate(labelFrame_lengths):
assert labelFrame_lengths[i] == scoreFrame_lengths[i]
assert label_lengths[i] == score_lengths[i]

frame_length = labelFrame_lengths[i]
_phoneFrame = labelFrame[i, :frame_length]
_midiFrame = scoreFrame[i, :frame_length]

# Clean _phoneFrame & _midiFrame
for index in range(frame_length):
if _phoneFrame[index] == 0 and _midiFrame[index] == 0:
frame_length -= 1
feats_lengths[i] -= 1

syllable_length = label_lengths[i]
_phoneSyllable = label[i, :syllable_length]
_midiSyllable = score[i, :syllable_length]

start_index = 0
ds_tmp = []
flag_finish = 0
for index in range(syllable_length):
_findPhone = _phoneSyllable[index]
_findMidi = _midiSyllable[index]
_length = 0
if flag_finish == 1:
# Fix error in _phoneSyllable & _midiSyllable
label[i, index] = 0
score[i, index] = 0
tempo[i, index] = 0
label_lengths[i] -= 1
score_lengths[i] -= 1
tempo_lengths[i] -= 1
else:
for indexFrame in range(start_index, frame_length):
if (
_phoneFrame[indexFrame] == _findPhone
and _midiFrame[indexFrame] == _findMidi
):
_length += 1
else:
ds_tmp.append(_length)
start_index = indexFrame
break
if indexFrame == frame_length - 1:
flag_finish = 1
ds_tmp.append(_length)
start_index = indexFrame
break

assert (
sum(ds_tmp) == frame_length and sum(ds_tmp) == feats_lengths[i]
)

ds.append(torch.tensor(ds_tmp))
ds = pad_list(ds, pad_value=0).to(label.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part needs to be changed accordingly towards XML feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we make them to be a specific module (apart from tts model)

Comment on lines +329 to +333
if tempo is not None:
tempo = tempo.to(dtype=torch.long)
batch.update(tempo=tempo, tempo_lengths=tempo_lengths)
if ds is not None:
batch.update(ds=ds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have got a lot of issues with the naming in the previous repo. Please consider renaming it for better interpretability

@ftshijt
Copy link
Collaborator
ftshijt commented Sep 24, 2022

Thanks for the update. I will first merge it

@ftshijt ftshijt merged commit 48b23ad into espnet:muskits Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0