LLMBM

This repository contains the code for Enhancing Governmental New Word Discovery with Large Language Models, as well as the GovDict and GovTerm datasets.

Corpus

We begin with the PolicyBank as our raw corpus, and supplement it with additional government‑related texts scraped from various public websites. After preprocessing, we obtain the GovTerm dataset—an instruction‑tuning collection for large models to improve their knowledge of governmental terminology and their ability to extract new terms. We also provide GovDict— a glossary of government terms with partial explanations and a plain list of government terms without explanations.

Usage

Getting Started

pip install -r requirements.txt

Model

We use QWEN2.5‑7B as our base model. You can download it from Qwen2.5‑7B and place it under model/Qwen2.5-7B-Instruct.

Dataset

Under /data you will find:

GovTerm:

the fine‑tuning and evaluation data.
- The fine‑tuning set includes three task types: classification, correction, and explanation. You can select task types and proportions before or during training.datasets
- The evaluation set contains only the classification task.datasets_eval
GovDict:

Our curated glossary of government terms with explanations.govdict.json

The plain list of government terms.govdict.txt

Discovering New Words

You can unzip your documents by year/month into the data directory, following the GOVGLM documents. We also include a sample test case 2024-08.txt. Run discovery.pyto extract new words from that document.

Training and Evaluation

Training

After preparing your model files and datasets, runscripts/train.py to start training the Lora model.The configurations used in our experiments were:

Qwen

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=5,
    num_train_epochs=3,
    learning_rate=5e-4,
    logging_steps=50,
    save_steps=50,
    eval_strategy="no",
    save_total_limit=3,
    fp16=False,
    gradient_checkpointing=False,
    max_grad_norm=1.0,
    remove_unused_columns=True,
)

LoRA

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.2,
    bias='none',
)

Our experiments used full‑precisionfp32. If your hardware supports it, you may switch to bf16 to significantly speed up training with minimal impact on accuracy.

Evaluation

We provide built‑in evaluation routines to measure the model’s classification accuracy on our selected data. To assess overall new‑word discovery performance and coverage against our term lists, run scripts/eval.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMBM

Corpus

Usage

Getting Started

Model

Dataset

Discovering New Words

Training and Evaluation

Training

Qwen

LoRA

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
model		model
scripts		scripts
README.md		README.md
discovery.py		discovery.py
requirements.txt		requirements.txt

reml-group/LLMBM

Folders and files

Latest commit

History

Repository files navigation

LLMBM

Corpus

Usage

Getting Started

Model

Dataset

Discovering New Words

Training and Evaluation

Training

Qwen

LoRA

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages