SpeechLM PR1 Modeling #6146

jctian98 · 2025-06-12T22:00:25Z

What?

This PR adds several key files in SpeechLM module.

The modeling files espnet2/speechlm
The tokenizer files espnet2/speechlm/tokenizers
Task definition espnet2/task/speechlm.py
The inference file espnet/bin/speechlm_inference_chat.py

Why?

To merge code step-by-step.

Codecov Report

Attention: Patch coverage is 7.42009% with 811 lines in your changes missing coverage. Please review.

Project coverage is 25.43%. Comparing base (ae4688d) to head (aef1335).
Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
espnet2/speechlm/inference_utils.py	0.00%	238 Missing ⚠️
espnet2/bin/speechlm_inference_chat.py	0.00%	179 Missing ⚠️
espnet2/speechlm/core_lm/ar_delay.py	0.00%	88 Missing ⚠️
espnet2/speechlm/loss.py	0.00%	88 Missing ⚠️
espnet2/tasks/speechlm.py	0.00%	58 Missing ⚠️
espnet2/speechlm/module/huggingface.py	0.00%	47 Missing ⚠️
espnet2/speechlm/core_lm/ar_parallel.py	0.00%	34 Missing ⚠️
espnet2/speechlm/net_utils.py	0.00%	31 Missing ⚠️
espnet2/speechlm/tokenizer/text_bpe_tokenizer.py	0.00%	14 Missing ⚠️
espnet2/speechlm/definitions.py	83.33%	13 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6146      +/-   ##
==========================================
+ Coverage   20.63%   25.43%   +4.79%     
==========================================
  Files          93      884     +791     
  Lines       10230    82937   +72707     
==========================================
+ Hits         2111    21091   +18980     
- Misses       8119    61846   +53727

Flag	Coverage Δ
test_integration_espnetez	`37.89% <83.33%> (?)`
test_python_espnetez	`12.73% <7.42%> (?)`
test_utils	`20.63% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fhrozen · 2025-06-13T04:02:31Z

This pull request introduces significant updates to the SpeechLM framework in the espnet2 library, including refactoring the abstract base class, adding new architectures, and removing an outdated implementation. The changes aim to enhance modularity, simplify inference configurations, and support advanced language modeling techniques.

Refactoring and Core Updates:

espnet2/speechlm/core_lm/abs_core_lm.py: Refactored the abstract base class (AbsCoreLM) by removing the SpeechLMInferenceOptions dataclass and replacing it with AbsInferenceConfig. Updated method signatures to streamline arguments and added support for continuous feature tuples (conti_feats).

New Architectures:

espnet2/speechlm/core_lm/ar_parallel.py: Added ARParallelLM, a new auto-regressive language model supporting parallel interleave codec patterns. This includes modular continuous feature encoders and configurable embeddings.
espnet2/speechlm/core_lm/ar_delay.py: Introduced ARDelayLM, which implements a delay interleave pattern for training and inference, improving flexibility in sequence generation.

Removal of Outdated Implementation:

espnet2/speechlm/core_lm/ar_multiscale.py: Removed the MultiScaleLM implementation, which was based on the UniAudio architecture, to align with the updated framework design and focus on modularity.

Copilot

Pull Request Overview

This PR introduces the initial implementation of the SpeechLM module by adding key modeling, tokenizer, task, and inference files while refactoring or removing legacy modules.

Added and updated files in espnet2/tasks, espnet2/speechlm/tokenizer, espnet2/speechlm/module, and espnet2/speechlm/core_lm.
Replaced unused modules (e.g., transformer.py, valle.py, ar_multiscale.py) with new implementations and integrated HuggingFace Transformer support.
Extended task definitions and loss computation while updating argument parsers and in-model processing.

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
espnet2/tasks/speechlm.py	Updated import statements, added new task arguments, and integrated new model configuration parameters.
espnet2/speechlm/tokenizer/text_bpe_tokenizer.py	Added a new text BPE tokenizer with minor documentation updates.
espnet2/speechlm/tokenizer/codec_tokenizer.py	Adjusted default token per frame and added conditional logic for HF transformer support.
espnet2/speechlm/tokenizer/abs_tokenizer.py	Revised method signatures and added detokenization interface.
espnet2/speechlm/net_utils.py	Modified logits-to-tokens function with adjustments to support new search algorithms.
espnet2/speechlm/module/huggingface.py	Introduced HFTransformerDecoder for HuggingFace model integration.
espnet2/speechlm/module/abs_transformer.py	Added an abstract class defining the Transformer API.
espnet2/speechlm/loss.py	Implemented SpeechLMCrossEntropyLossV2 with support for modality-specific loss computation.
espnet2/speechlm/espnet_model.py	Updated the forward method to integrate the new criterion and model-building approach.
espnet2/speechlm/definitions.py	Revised modalities and task definitions to support new SpeechLM tasks.
espnet2/speechlm/core_lm/ar_parallel.py	Added new auto-regressive LM based on parallel interleave.
espnet2/speechlm/core_lm/ar_delay.py	Added new delay-based auto-regressive LM with delay interleaving procedure.
espnet2/speechlm/core_lm/abs_core_lm.py	Updated the abstract base class interface for core LM modules.

Copilot · 2025-06-13T04:03:26Z

espnet2/tasks/speechlm.py

+            "--pad_speaker_prompt",
+            type=str2bool,
+            default=True,
+            help="If ture, add padding to the speaker prompt that is shorter"


There is a spelling mistake in the help message for '--pad_speaker_prompt'. Replace 'ture' with 'true'.

Suggested change

help="If ture, add padding to the speaker prompt that is shorter"

help="If true, add padding to the speaker prompt that is shorter"

Copilot · 2025-06-13T04:03:26Z

espnet2/speechlm/tokenizer/text_bpe_tokenizer.py

+
+class TextBPETokenizer(AbsTokenizer):
+    """
+    A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization


The docstring contains a spelling mistake: 'warpper' should be corrected to 'wrapper'.

Suggested change

A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization

A wrapper for SentencePiece tokenizer, only used for speechlm BPE detokenization

jctian98 added 2 commits June 12, 2025 16:39

add speechlm modeling file

8d13a7d

merge remote

ff8819e

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 12, 2025

mergify bot added the ESPnet2 label Jun 12, 2025

dosubot bot added the SLM label Jun 12, 2025

pre-commit-ci bot and others added 7 commits June 12, 2025 22:01

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0c71cd

for more information, see https://pre-commit.ci

avoid import error

1571468

Merge branch 'master' of https://github.com/jctian98/espnet

6d9c650

merge remote

[pre-commit.ci] auto fixes from pre-commit.com hooks

135ddeb

for more information, see https://pre-commit.ci

add infer code

264f424

remove old infer code

27b9684

Merge branch 'master' of https://github.com/jctian98/espnet

aef1335

merge remote

sw005320 added this to the v.202506 milestone Jun 12, 2025

sw005320 requested a review from ftshijt June 12, 2025 22:29

sw005320 reviewed Jun 12, 2025

View reviewed changes

espnet2/speechlm/core_lm/ar_delay.py Outdated

Copy link

Contributor

sw005320 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to ask someone to check the details.
@wanchichen

sw005320 reviewed Jun 12, 2025

View reviewed changes

espnet2/speechlm/core_lm/ar_parallel.py

Copy link

Contributor

sw005320 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to ask someone to check the details.
@wanchichen

sw005320 reviewed Jun 12, 2025

View reviewed changes

espnet2/speechlm/loss.py

import torch

class SpeechLMCrossEntropyLossV2(torch.nn.Module):

Copy link

Contributor

sw005320 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why V2?

sw005320 reviewed Jun 12, 2025

View reviewed changes

Fhrozen requested a review from Copilot June 13, 2025 04:02

Copilot AI reviewed Jun 13, 2025

View reviewed changes

sw005320 changed the base branch from master to espnet3 June 13, 2025 11:18

               # b. don't delete / modify it, otherwise the model trained
               #    previously can become incompatible. New tokens can be
               #    added - there are enough slots
               special_tokens = [

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SpeechLM PR1 Modeling #6146

SpeechLM PR1 Modeling #6146

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

+                      return x, targets, loss_mask
+                  @torch.no_grad()
+                  def inference(

		from espnet2.speechlm.inference_utils import AbsInferenceConfig


		class ARParallelLM(AbsCoreLM):

+              import torchaudio
+              from kaldiio import WriteHelper
+              from espnet2.speechlm.definitions import SPEECHLM_TASKS

		import torch


		class SpeechLMCrossEntropyLossV2(torch.nn.Module):

+              # Copyright 2024 Jinchuan Tian
+              #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+              from abc import ABC, abstractmethod

	help="If ture, add padding to the speaker prompt that is shorter"
	help="If true, add padding to the speaker prompt that is shorter"

	A warpper for SentencePiece tokenizer, only used for speechlm BPE detokenization
	A wrapper for SentencePiece tokenizer, only used for speechlm BPE detokenization

		return mask


		class TaskOrientedWriter:

SpeechLM PR1 Modeling #6146

Are you sure you want to change the base?

SpeechLM PR1 Modeling #6146

Uh oh!

Conversation

Uh oh!

What?

Why?

See also

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Refactoring and Core Updates:

New Architectures:

Removal of Outdated Implementation:

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!