-
Notifications
You must be signed in to change notification settings - Fork 2.3k
SpeechLM PR1 Modeling #6146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: espnet3
Are you sure you want to change the base?
SpeechLM PR1 Modeling #6146
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to keep using espnet2/bin/speechlm_inference.py
to follow the consistent binary naming.
See https://github.com/espnet/espnet/tree/master/espnet2/bin
The only exception is to have a variant in the inference algorithm (like _k2
and _maskctc
), and this case is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, espnet2/bin/speechlm_inference_chat.py
--> espnet2/bin/speechlm_inference.py
espnet2/speechlm/core_lm/ar_delay.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to ask someone to check the details.
@wanchichen
return x, targets, loss_mask | ||
|
||
@torch.no_grad() | ||
def inference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you add this in the future?
If so, it's better to add TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you drop this due to the performance?
It is fine, but it would be better to record this history (e.g., adding this decision in the top of this PR).
from espnet2.speechlm.inference_utils import AbsInferenceConfig | ||
|
||
|
||
class ARParallelLM(AbsCoreLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used as an independent class, or is it only used for the ArDelayLM class?
If it is only used for the ArDelayLM class, we may merge this class with that.
If not, please explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to ask someone to check the details.
@wanchichen
espnet2/speechlm/core_lm/valle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Did you drop this due to the performance?
It is fine, but it would be better to record this history (e.g., adding this decision in the top of this PR).
@@ -95,7 +276,6 @@ def find_modality_type(self): # Used in shell data preparation script | |||
# b. don't delete / modify it, otherwise the model trained | |||
# previously can become incompatible. New tokens can be | |||
# added - there are enough slots | |||
|
|||
special_tokens = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place to explain this special token?
It is very important.
- If you don't have it, please prepare it in the appropriate place (README.md in the template).
- If you have it, please add a link to the document.
import torchaudio | ||
from kaldiio import WriteHelper | ||
|
||
from espnet2.speechlm.definitions import SPEECHLM_TASKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is SPEECHLM_TASKS
all upper-cases?
Any intention?
This breaks the naming convention.
return mask | ||
|
||
|
||
class TaskOrientedWriter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain the difference between TaskOrientedWriter
and ChatOrientedWriter
import torch | ||
|
||
|
||
class SpeechLMCrossEntropyLossV2(torch.nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why V2?
# Copyright 2024 Jinchuan Tian | ||
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) | ||
|
||
from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is ABC
?
(sorry. simply I don't know this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a valid reason, please record it in the top of this PR
espnet2/speechlm/module/valle.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
If there is a valid reason, please record it in the top of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add detailed changes at the top of this PR.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6146 +/- ##
==========================================
+ Coverage 20.63% 25.43% +4.79%
==========================================
Files 93 884 +791
Lines 10230 82937 +72707
==========================================
+ Hits 2111 21091 +18980
- Misses 8119 61846 +53727
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request introduces significant updates to the SpeechLM framework in the Refactoring and Core Updates:
New Architectures:
Removal of Outdated Implementation:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the initial implementation of the SpeechLM module by adding key modeling, tokenizer, task, and inference files while refactoring or removing legacy modules.
- Added and updated files in espnet2/tasks, espnet2/speechlm/tokenizer, espnet2/speechlm/module, and espnet2/speechlm/core_lm.
- Replaced unused modules (e.g., transformer.py, valle.py, ar_multiscale.py) with new implementations and integrated HuggingFace Transformer support.
- Extended task definitions and loss computation while updating argument parsers and in-model processing.
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
espnet2/tasks/speechlm.py | Updated import statements, added new task arguments, and integrated new model configuration parameters. |
espnet2/speechlm/tokenizer/text_bpe_tokenizer.py | Added a new text BPE tokenizer with minor documentation updates. |
espnet2/speechlm/tokenizer/codec_tokenizer.py | Adjusted default token per frame and added conditional logic for HF transformer support. |
espnet2/speechlm/tokenizer/abs_tokenizer.py | Revised method signatures and added detokenization interface. |
espnet2/speechlm/net_utils.py | Modified logits-to-tokens function with adjustments to support new search algorithms. |
espnet2/speechlm/module/huggingface.py | Introduced HFTransformerDecoder for HuggingFace model integration. |
espnet2/speechlm/module/abs_transformer.py | Added an abstract class defining the Transformer API. |
espnet2/speechlm/loss.py | Implemented SpeechLMCrossEntropyLossV2 with support for modality-specific loss computation. |
espnet2/speechlm/espnet_model.py | Updated the forward method to integrate the new criterion and model-building approach. |
espnet2/speechlm/definitions.py | Revised modalities and task definitions to support new SpeechLM tasks. |
espnet2/speechlm/core_lm/ar_parallel.py | Added new auto-regressive LM based on parallel interleave. |
espnet2/speechlm/core_lm/ar_delay.py | Added new delay-based auto-regressive LM with delay interleaving procedure. |
espnet2/speechlm/core_lm/abs_core_lm.py | Updated the abstract base class interface for core LM modules. |
"--pad_speaker_prompt", | ||
type=str2bool, | ||
default=True, | ||
help="If ture, add padding to the speaker prompt that is shorter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling mistake in the help message for '--pad_speaker_prompt'. Replace 'ture' with 'true'.
help="If ture, add padding to the speaker prompt that is shorter" | |
help="If true, add padding to the speaker prompt that is shorter" |
Copilot uses AI. Check for mistakes.
|
||
class TextBPETokenizer(AbsTokenizer): | ||
""" | ||
A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring contains a spelling mistake: 'warpper' should be corrected to 'wrapper'.
A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization | |
A wrapper for SentencePiece tokenizer, only used for speechlm BPE detokenization |
Copilot uses AI. Check for mistakes.
What?
This PR adds several key files in SpeechLM module.
espnet2/speechlm
espnet2/speechlm/tokenizers
espnet2/task/speechlm.py
espnet/bin/speechlm_inference_chat.py
Why?
To merge code step-by-step.
See also
#6050