Skip to content

Latest commit

 

History

History
708 lines (671 loc) · 28.3 KB

README_EN.md

File metadata and controls

708 lines (671 loc) · 28.3 KB

Benchmarking Legal Knowledge of Large Language Models

🌐 Website • 🤗 Hugging Face • ⏬ Data • 📃 Paper

📖 中文 | English

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly-specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform law-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. Please refer to our paper for more details.

Tasks in LawBench are based on the law system of China. A similar bench based on the American law system is available here.

✨ Introduction

LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities. In designing the testing tasks, we simulated three dimensions of judicial cognition and selected 20 tasks to evaluate the abilities of large models. Compared to some existing benchmarks with only multiple-choice questions, we include more diverse types of tasks closely related to real-world applications, such as legal entity recognition, reading comprehension, criminal damages calculation and consultation. We recognize that the security policies of current large models may decline to respond to certain legal queries or experience difficulty in comprehending instructions, leading to a lack of response. Therefore, we have developed a separate evaluation metric "abstention rate" to measure how often the model refuses to provide the answer, or fail to understand the instructions properly. We report the performances of 51 large language models on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs.

📖 Dataset

Our dataset include 20 diverse tasks covering 3 cognitive levels:

  • Legal knowledge memorization: whether large language models can memorize necessary legal concepts, terminologies, articles and facts in their parameters.
  • Legal knowledge understanding: whether large language models can comprehend entities, events, and relationships within legal texts, so as to understand the meanings and connotations of legal text.
  • Legal knowledge applying: whether large language models can properly utilize their legal knowledge, reason over it to solve realistic legal tasks in downstream applications.

Task List

The following is the included task list. Every task has 500 examples.

Cognitive Level ID Tasks Data Sources Metrics Type
Legal Knowledge Memorization 1-1 Article Recitation FLK ROUGE-L Generation
1-2 Knowledge Question Answering JEC_QA Accuracy SLC
Legal Knowledge Understanding 2-1 Document Proofread CAIL2022 F0.5 Generation
2-2 Dispute Focus Identification LAIC2021 F1 MLC
2-3 Marital Disputes Identification AIStudio F1 MLC
2-4 Issue Topic Identification CrimeKgAssitant Accuracy SLC
2-5 Reading Comprehension CAIL2019 rc-F1 Extraction
2-6 Name Entity Recognition CAIL2021 soft-F1 Extraction
2-7 Opinion Summarization CAIL2022 ROUGE-L Generation
2-8 Argument Mining CAIL2022 Accuracy SLC
2-9 Event Detection LEVEN F1 MLC
2-10 Trigger Word Extraction LEVEN soft-F1 Extraction
Legal Knowledge Applying 3-1 Fact-based Article Prediction CAIL2018 F1 MLC
3-2 Scene-based Article Prediction LawGPT_zh Project ROUGE-L Generation
3-3 Charge Prediction CAIL2018 F1 MLC
3-4 Prison Term Prediction w.o Article CAIL2018 Normalized log-distance Regression
3-5 Prison Term Prediction w. Article CAIL2018 Normalized log-distance Regression
3-6 Case Analysis JEC_QA Accuracy SLC
3-7 Crimal Damages Calculation LAIC2021 Accuracy Regression
3-8 Consultation hualv.com ROUGE-L Generation

Data Format

The data is stored under the data folder. Every task is stored in the <task_id>.json file. The json file can be loaded via json.load as a list of dictionaries. The data format is as follows (use task 3-2 as an example):

[
  {
    "instruction": "请根据具体场景与问题给出法律依据,只需要给出具体法条内容,每个场景仅涉及一个法条。",
    "question": "场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社,以提高其在市场中的竞争力和规模效应。根据哪条法律,三个以上的农民专业合作社可以出资设立农民专业合作社联合社?",
    "answer": "根据《农民专业合作社法》第五十六条,三个以上的农民专业合作社在自愿的基础上,可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所,由联合社全体成员制定并承认的章程,以及符合章程规定的成员出资。"
  },
]

Model Output Format

The model outputs are stored under the predictions/zero_shot and predictions/one_shot folder. Every system has its own subfolder. Within each subfolder, task predictions are stored in <task_id>.json file. The json file can be loaded via json.load as a dictionary. The data format is as follows (use task 3-2 from GPT-4 zero-shot prediction as an example):

{
    "0": {
        "origin_prompt": [
            {
                "role": "HUMAN",
                "prompt": "请根据具体场景与问题给出法律依据,只需要给出具体法条内容,每个场景仅涉及一个法条。\n场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社,以提高其在市场中的竞争力和规模效应。根据哪条法律,三个以上的农民专业合作社可以出资设立农民专业合作社联合社?"
            }
        ],
        "prediction": "根据《中华人民共和国农民专业合作社法》第十七条:“三个以上的农民专业合作社可以出资设立农民专业合作社联合社。”",
        "refr": "根据《农民专业合作社法》第五十六条,三个以上的农民专业合作社在自愿的基础上,可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所,由联合社全体成员制定并承认的章程,以及符合章程规定的成员出资。"
    },

📖 Model List

We test 51 popular large language models. We group them as in the following table:

Model Parameters SFT RLHF Access Base Model
Multilingual LLMs
MPT 7B Weights -
MPT-Instruct 7B Weights MPT-7B
LLaMA 7/13/30/65B Weights -
LLaMA-2 7/13/70B Weights -
LLaMA-2-Chat 7/13/70B Weights LLaMA-2-7/13/70B
Alpaca-v1.0 7B Weights LLaMA-7B
Vicuna-v1.3 7/13/33B Weights LLaMA-7/13/33B
WizardLM 7B Weights LLaMA-7B
StableBeluga2 70B Weights LLaMA-2-70B
ChatGPT N/A API -
GPT-4 N/A API -
Chinese-oriented LLMs
MOSS-Moon 16B Weights -
MOSS-Moon-SFT 16B Weights MOSS-Moon
TigerBot-Base 7B Weights -
TigerBot-SFT 7B Weights TigerBot-Base
GoGPT 7B Weights LLaMA-7B
ChatGLM2 6B Weights ChatGLM
Ziya-LLaMA 13B Weights LLaMA-13B
Baichuan 7/13B Weights -
Baichuan-13B-Chat 13B Weights Baichuan-13B
XVERSE 13B Weights -
InternLM 7B Weights -
InternLM-Chat 7B Weights InternLM-7B
InternLM-Chat-7B-8K 7B Weights InternLM-7B
Qwen 7B Weights -
Qwen-7B-Chat 7B Weights Qwen-7B
Yulan-Chat-2 13B Weights LLaMA-2-13B
BELLE-LLaMA-2 13B Weights LLaMA-2-13B
Chinese-LLaMA-2 7B Weights LLaMA-2-7B
Chinese-Alpaca-2 7B Weights LLaMA-2-7B
LLaMA-2-Chinese 7/13B Weights LLaMA-2-7/13B
Legal Specific LLMs
HanFei 7B Weights HanFei
LaWGPT-7B-beta1.0 7B Weights Chinese-LLaMA
LaWGPT-7B-beta1.1 7B Weights Chinese-alpaca-plus-7B
LexiLaw 6B Weights ChatGLM-6B
Wisdom-Interrogatory 7B Weights Baichuan-7B
Fuzi-Mingcha 6B Weights ChatGLM-6B
Lawyer-LLaMA 13B Weights LLaMA
ChatLaw 13/33B Weights Ziya-LLaMA-13B/Anima-33B

📊 Model Performance

We test the model performance under 2 scenarios: (1) zero-shot, where only instructions are provided in the prompt, and (2) one-shot, where instructions and one-shot examples are concatenated in the prompt.

Zero-shot Performance

Average performance (zero-shot) of 51 LLMs evaluated on LawBench.

We show the performances of top-5 models with the highest average scores.

Note: gpt-3.5-turbo is version 2023.6.13, and all gpt-3.5-turbo results below are for this version

Task ID GPT4 GPT-3.5-turbo StableBeluga2 qwen-7b-chat internlm-chat-7b-8k
AVG 52.35 42.15 39.23 37.00 35.73
1-1 15.38 15.86 14.58 18.54 15.45
1-2 55.20 36.00 34.60 34.00 40.40
2-1 12.53 9.10 7.70 22.56 22.64
2-2 41.65 32.37 25.57 27.42 35.46
2-3 69.79 51.73 44.20 31.42 28.96
2-4 44.00 41.20 39.00 35.00 35.60
2-5 56.50 53.75 52.03 48.48 54.13
2-6 76.60 69.55 65.54 37.88 17.95
2-7 37.92 33.49 39.07 36.04 27.11
2-8 61.20 36.40 45.80 24.00 36.20
2-9 78.82 66.48 65.27 44.88 62.93
2-10 65.09 39.05 41.64 18.90 20.94
3-1 52.47 29.50 16.41 44.62 34.86
3-2 27.54 31.30 24.52 33.50 19.11
3-3 41.99 35.52 22.82 40.67 41.05
3-4 82.62 78.75 76.06 76.74 63.21
3-5 81.91 76.84 65.35 77.19 67.20
3-6 48.60 27.40 34.40 26.80 34.20
3-7 77.60 61.20 56.60 42.00 43.80
3-8 19.65 17.45 13.39 19.32 13.37

One-Shot Performance

Average performance (one-shot) of 51 LLMs evaluated on LawBench.

We show the performances of top-5 models with the highest average scores.

Task ID GPT4 GPT-3.5-turbo qwen-7b-chat StableBeluga2 internlm-chat-7b-8k
AVG 53.85 44.52 38.99 38.97 37.28
1-1 17.21 16.15 17.73 15.03 15.16
1-2 54.80 37.20 28.60 36.00 40.60
2-1 18.31 13.50 25.16 8.93 21.64
2-2 46.00 40.60 27.40 15.00 36.60
2-3 69.99 54.01 32.96 41.76 30.91
2-4 44.40 41.40 31.20 38.00 33.20
2-5 64.80 61.98 46.71 53.55 54.35
2-6 79.96 74.04 57.34 64.99 26.86
2-7 40.52 40.68 42.58 45.06 30.56
2-8 59.00 37.40 26.80 37.60 30.60
2-9 76.55 67.59 50.63 65.89 63.42
2-10 65.26 40.04 21.27 40.54 20.69
3-1 53.20 30.81 52.86 16.87 38.88
3-2 33.15 34.49 34.49 32.44 28.70
3-3 41.30 34.55 39.91 23.07 42.25
3-4 83.21 77.12 78.47 75.80 67.74
3-5 82.74 73.72 73.92 63.59 71.10
3-6 49.60 31.60 26.80 33.00 36.20
3-7 77.00 66.40 44.60 56.00 44.00
3-8 19.90 17.17 20.39 16.24 12.11

🛠️ How to Evaluate Model

We design different rule-based parsing to extract answers from model predictions. The evaluation scripts for every task is in evaluation/evaluation_functions.

Steps

The steps to evaluate the model predictions are as below:

  1. Put prediction results from all systems under a folder F. Every system has one subfolder.
  2. Under the subfolder of every system, every task has a prediction file. The name of every task is the task id.
  3. Enter the evaluation folder and run "python main.py -i F -o <metric_result>"

The data format is as below:

data/
├── system-1
│   ├── 1-1.json
│   ├── 1-2.json
│   ├── ...
├── system-2
│   ├── 1-1.json
│   ├── 1-2.json
│   ├── ...
├── ...

The output result will be saved in <metric_result>.

For example, the zero-shot predictions from the 51 tested models are saved in predictions/zero_shot. You can run

cd evaluation
python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv

to get their evaluation results stored as ../predictions/zero_shot/results.csv.

Result Format

The result file is a csv file with four columns: task, model_name, score and abstention_rate:

Column Description
task Task name. Set as the name of the prediction file
model_name Model name. Set as the name of the folder storing the prediction files
score Model score for the corresponding task.
abstention_rate Abstention rate for the corresponding task. This rate indicates how often the answer cannot be extracted from the model prediction.

Requirement

rouge_chinese==1.0.3
cn2an==0.5.22
ltp==4.2.13
OpenCC==1.1.6
python-Levenshtein==0.21.1
pypinyin==0.49.0
tqdm==4.64.1
timeout_decorator==0.5.0

📌 Licenses

LawBench is a mix of created and transformed datasets. We ask that you follow the license of the dataset creator. Please see the task list for the original source of each task.

🔜 Future Plan

  • ROUGE-L is not a good metric to evaluate long-form generation results. We will explore using large language model-based evaluation metrics dedicated to law tasks.
  • We will keep updating the task list included in LawBench. We welcome external contributors to collaborate with.

🖊️ Citation

@article{fei2023lawbench,
  title={LawBench: Benchmarking Legal Knowledge of Large Language Models},
  author={Fei, Zhiwei and Shen, Xiaoyu and Zhu, Dawei and Zhou, Fengzhe and Han, Zhuo and Zhang, Songyang and Chen, Kai and Shen, Zongwen and Ge, Jidong},
  journal={arXiv preprint arXiv:2309.16289},
  year={2023}
}

If you have law datasets that you would like to include or evaluate your own models. Feel free to contact us.