8000 SpeechLM PR1 Modeling by jctian98 · Pull Request #6146 · espnet/espnet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

SpeechLM PR1 Modeling #6146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: espnet3
Choose a base branch
from
Open

SpeechLM PR1 Modeling #6146

wants to merge 9 commits into from

Conversation

jctian98
Copy link
Contributor
@jctian98 jctian98 commented Jun 12, 2025

What?

This PR adds several key files in SpeechLM module.

  • The modeling files espnet2/speechlm
  • The tokenizer files espnet2/speechlm/tokenizers
  • Task definition espnet2/task/speechlm.py
  • The inference file espnet/bin/speechlm_inference_chat.py

Why?

To merge code step-by-step.

See also

#6050

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 12, 2025
@mergify mergify bot added the ESPnet2 label Jun 12, 2025
@dosubot dosubot bot added the SLM label Jun 12, 2025
@sw005320 sw005320 added this to the v.202506 milestone Jun 12, 2025
@sw005320 sw005320 requested a review from ftshijt June 12, 2025 22:29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to keep using espnet2/bin/speechlm_inference.py to follow the consistent binary naming.
See https://github.com/espnet/espnet/tree/master/espnet2/bin

The only exception is to have a variant in the inference algorithm (like _k2 and _maskctc), and this case is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, espnet2/bin/speechlm_inference_chat.py --> espnet2/bin/speechlm_inference.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to ask someone to check the details.
@wanchichen

return x, targets, loss_mask

@torch.no_grad()
def inference(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you add this in the future?
If so, it's better to add TODO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you drop this due to the performance?
It is fine, but it would be better to record this history (e.g., adding this decision in the top of this PR).

from espnet2.speechlm.inference_utils import AbsInferenceConfig


class ARParallelLM(AbsCoreLM):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used as an independent class, or is it only used for the ArDelayLM class?
If it is only used for the ArDelayLM class, we may merge this class with that.
If not, please explain it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to ask someone to check the details.
@wanchichen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto
Did you drop this due to the performance?
It is fine, but it would be better to record this history (e.g., adding this decision in the top of this PR).

@@ -95,7 +276,6 @@ def find_modality_type(self): # Used in shell data preparation script
# b. don't delete / modify it, otherwise the model trained
# previously can become incompatible. New tokens can be
# added - there are enough slots

special_tokens = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place to explain this special token?
It is very important.

  • If you don't have it, please prepare it in the appropriate place (README.md in the template).
  • If you have it, please add a link to the document.

import torchaudio
from kaldiio import WriteHelper

from espnet2.speechlm.definitions import SPEECHLM_TASKS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is SPEECHLM_TASKS all upper-cases?
Any intention?
This breaks the naming convention.

return mask


class TaskOrientedWriter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the difference between TaskOrientedWriter and ChatOrientedWriter

import torch


class SpeechLMCrossEntropyLossV2(torch.nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why V2?

# Copyright 2024 Jinchuan Tian
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

from abc import ABC, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is ABC?
(sorry. simply I don't know this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a valid reason, please record it in the top of this PR

Copy link
< F438 /details-menu>
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto
If there is a valid reason, please record it in the top of this PR

Copy link
Contributor
@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add detailed changes at the top of this PR.

Copy link
codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 7.42009% with 811 lines in your changes missing coverage. Please review.

Project coverage is 25.43%. Comparing base (ae4688d) to head (aef1335).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/speechlm/inference_utils.py 0.00% 238 Missing ⚠️
espnet2/bin/speechlm_inference_chat.py 0.00% 179 Missing ⚠️
espnet2/speechlm/core_lm/ar_delay.py 0.00% 88 Missing ⚠️
espnet2/speechlm/loss.py 0.00% 88 Missing ⚠️
espnet2/tasks/speechlm.py 0.00% 58 Missing ⚠️
espnet2/speechlm/module/huggingface.py 0.00% 47 Missing ⚠️
espnet2/speechlm/core_lm/ar_parallel.py 0.00% 34 Missing ⚠️
espnet2/speechlm/net_utils.py 0.00% 31 Missing ⚠️
espnet2/speechlm/tokenizer/text_bpe_tokenizer.py 0.00% 14 Missing ⚠️
espnet2/speechlm/definitions.py 83.33% 13 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6146      +/-   ##
==========================================
+ Coverage   20.63%   25.43%   +4.79%     
==========================================
  Files          93      884     +791     
  Lines       10230    82937   +72707     
==========================================
+ Hits         2111    21091   +18980     
- Misses       8119    61846   +53727     
Flag Coverage Δ
test_integration_espnetez 37.89% <83.33%> (?)
test_python_espnetez 12.73% <7.42%> (?)
test_utils 20.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fhrozen
Copy link
Member
Fhrozen commented Jun 13, 2025

This pull request introduces significant updates to the SpeechLM framework in the espnet2 library, including refactoring the abstract base class, adding new architectures, and removing an outdated implementation. The changes aim to enhance modularity, simplify inference configurations, and support advanced language modeling techniques.

Refactoring and Core Updates:

  • espnet2/speechlm/core_lm/abs_core_lm.py: Refactored the abstract base class (AbsCoreLM) by removing the SpeechLMInferenceOptions dataclass and replacing it with AbsInferenceConfig. Updated method signatures to streamline arguments and added support for continuous feature tuples (conti_feats).

New Architectures:

  • espnet2/speechlm/core_lm/ar_parallel.py: Added ARParallelLM, a new auto-regressive language model supporting parallel interleave codec patterns. This includes modular continuous feature encoders and configurable embeddings.
  • espnet2/speechlm/core_lm/ar_delay.py: Introduced ARDelayLM, which implements a delay interleave pattern for training and inference, improving flexibility in sequence generation.

Removal of Outdated Implementation:

@Fhrozen Fhrozen requested a review from Copilot June 13, 2025 04:02
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the initial implementation of the SpeechLM module by adding key modeling, tokenizer, task, and inference files while refactoring or removing legacy modules.

  • Added and updated files in espnet2/tasks, espnet2/speechlm/tokenizer, espnet2/speechlm/module, and espnet2/speechlm/core_lm.
  • Replaced unused modules (e.g., transformer.py, valle.py, ar_multiscale.py) with new implementations and integrated HuggingFace Transformer support.
  • Extended task definitions and loss computation while updating argument parsers and in-model processing.

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
espnet2/tasks/speechlm.py Updated import statements, added new task arguments, and integrated new model configuration parameters.
espnet2/speechlm/tokenizer/text_bpe_tokenizer.py Added a new text BPE tokenizer with minor documentation updates.
espnet2/speechlm/tokenizer/codec_tokenizer.py Adjusted default token per frame and added conditional logic for HF transformer support.
espnet2/speechlm/tokenizer/abs_tokenizer.py Revised method signatures and added detokenization interface.
espnet2/speechlm/net_utils.py Modified logits-to-tokens function with adjustments to support new search algorithms.
espnet2/speechlm/module/huggingface.py Introduced HFTransformerDecoder for HuggingFace model integration.
espnet2/speechlm/module/abs_transformer.py Added an abstract class defining the Transformer API.
espnet2/speechlm/loss.py Implemented SpeechLMCrossEntropyLossV2 with support for modality-specific loss computation.
espnet2/speechlm/espnet_model.py Updated the forward method to integrate the new criterion and model-building approach.
espnet2/speechlm/definitions.py Revised modalities and task definitions to support new SpeechLM tasks.
espnet2/speechlm/core_lm/ar_parallel.py Added new auto-regressive LM based on parallel interleave.
espnet2/speechlm/core_lm/ar_delay.py Added new delay-based auto-regressive LM with delay interleaving procedure.
espnet2/speechlm/core_lm/abs_core_lm.py Updated the abstract base class interface for core LM modules.

"--pad_speaker_prompt",
type=str2bool,
default=True,
help="If ture, add padding to the speaker prompt that is shorter"
Copy link
Preview
Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling mistake in the help message for '--pad_speaker_prompt'. Replace 'ture' with 'true'.

Suggested change
help="If ture, add padding to the speaker prompt that is shorter"
help="If true, add padding to the speaker prompt that is shorter"

Copilot uses AI. Check for mistakes.


class TextBPETokenizer(AbsTokenizer):
"""
A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization
Copy link
Preview
Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring contains a spelling mistake: 'warpper' should be corrected to 'wrapper'.

Suggested change
A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization
A wrapper for SentencePiece tokenizer, only used for speechlm BPE detokenization

Copilot uses AI. Check for mistakes.

@sw005320 sw005320 changed the base branch from master to espnet3 June 13, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ESPnet2 size:XXL This PR changes 1000+ lines, ignoring generated files. SLM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0