Implementation of CitaLaw

This is the implementation of the paper "CitaLaw: Enhancing LLM with Citations in Legal Domain" based on PyTorch.

File Structure

.
├── dataset  # * dataset path
│   ├── layperson # * question path 
│   ├── practitioner # * question path 
│   └── corpus
│       ├── law article  # * corpus path and index path
│       └── precedent case # * corpus path and index path
└── benchmark  # * evaluation benchmark
    ├── models  # * evaluation models
    │   ├── flashrag # * requirements codes from flashrag
    │   ├── citation_attach # * code for adding citations
    │   ├── closebook.py # * code for CloseBook
    │   ├── cgg.py # * code for citation-guided generation
    │   ├── arg.py # * code for answer refinement generation
    │   └── utils # * codes for utils
    ├── evaluation  # * codes for evaluation
    │   ├── global-level # * code for global-level metrics
    │   └── syllogism-level # * codes for syllogism-level metrics
    └── shell  # * script for quick evaluation

Satisfy the requirements

You need to check it according to the requirements file.

conda create -n CitaLaw python=3.10
conda activate CitaLaw
pip install -r requirements.txt

Quick Start

Model preparation

The bge-base-en-v1.5, Llama3-8b-Instruct, Qwen2-7b-Instruct can be downloaded from huggingface.

For Legal LLMs, download from their links:

Data preparation.

Check folder datasets for details.

Retrieve and Generation

cd benchmark/shell
# for layperson
sh layperson.sh

# for practitioner
sh practitioner.sh

Examples of Layperson dataset:

# legal LLMs
models=("lexilaw" "lawgpt_zh" "fuzi" "hanfei" "tailing" "zhihai" "disc_lawllm")
model_paths=("lexilaw_path" "lawgpt_zh_path"  "fuzi_path" "haifei_path" "tailing_path" "zhihai_path" "disc_path")
# Null if there is no lora path
lora_paths=("lexilaw_lora_path" "lawgpt_zh_lora_path"  "fuzi_lora_path" "haifei_lora_path" "tailing_lora_path" "zhihai_lora_path" "disc_lora_path") 

len=${#models[@]}

for ((i=0; i<$len; i++)); do
  model=${models[$i]}
  model_path=${model_paths[$i]}
  lora_path=${lora_paths[$i]}
  
  echo "Running model: $model with model_path: $model_path and $lora_path" 
  
  python cgg.py \
    --data_dir ../../datasets/layperson \
    --dataset_name layperson \
    --split "layperson_test" \
    --index_path ../../datasets/corpus/bge_law_article.index \
    --corpus_path ../../datasets/corpus/law_article_corpus.jsonl \
    --gpu_id 2 \
    --model_path "$model_path" \
    --generator_model "$model" \
    --generator_lora_path "$lora_path"
done

# open domain LLM
# qwen2
# closebook
python closebook.py   --data_dir ../../datasets/layperson --dataset_name layperson --split "layperson_test"  --gpu_id 2 --model_path Qwen2_path --generator_model Qwen2

# cgg
python cgg.py   --data_dir ../../datasets/layperson --dataset_name layperson --split "layperson_test" --index_path ../../datasets/corpus/bge_law_article.index --corpus_path ../../datasets/corpus/law_article_corpus.jsonl  --gpu_id 2 --model_path Qwen2_path --generator_model Qwen2

# arg-q
python arg.py --input_file closebook_ouput_file  --output_file /qwen_arg_q_lay.json

# arg-qa
# step1: retrieve using q+a
python closebook.py   --data_dir ../../datasets/layperson --dataset_name layperson --split "layperson_test_qa"  --gpu_id 2 --model_path Qwen2_path --generator_model Qwen2

# step2: arg
python arg.py --input_file closebook_qa_ouput_file  --output_file /qwen_arg_qa_lay.json

Check folder models for details.

Citation attachment

Place the result file in the specified location first, then conduct the citation attachment.

cd benchmark/shell
sh citation_attach.sh

Check folder models/citation_attach for details.

Evaluation

Place the result file in the specified location first, then get the evaluation.

cd benchmark/shell
sh evaluation.sh

Check folder evaluation for details.

Reference

The CitaLaw is built based on the following projects:

Environments

We conducted the experiments based on the following environments:

CUDA Version: 11.4
torch version: 2.2.0
OS: Ubuntu 18.04.5 LTS
GPU: NVIDIA Geforce RTX A6000
CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark		benchmark
datasets		datasets
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Implementation of CitaLaw

File Structure

Satisfy the requirements

Quick Start

Model preparation

Data preparation.

Retrieve and Generation

Citation attachment

Evaluation

Reference

Environments

About

Uh oh!

Releases

Packages

Languages

ke-01/CitaLaw

Folders and files

Latest commit

History

Repository files navigation

Implementation of CitaLaw

File Structure

Satisfy the requirements

Quick Start

Model preparation

Data preparation.

Retrieve and Generation

Citation attachment

Evaluation

Reference

Environments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages