🌐 Website • 🤗 Hugging Face • ⏬ Data • 📃 Paper
Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly-specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform law-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. Please refer to our paper for more details.
Tasks in LawBench are based on the law system of China. A similar bench based on the American law system is available here.
LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities. In designing the testing tasks, we simulated three dimensions of judicial cognition and selected 20 tasks to evaluate the abilities of large models. Compared to some existing benchmarks with only multiple-choice questions, we include more diverse types of tasks closely related to real-world applications, such as legal entity recognition, reading comprehension, criminal damages calculation and consultation. We recognize that the security policies of current large models may decline to respond to certain legal queries or experience difficulty in comprehending instructions, leading to a lack of response. Therefore, we have developed a separate evaluation metric "abstention rate" to measure how often the model refuses to provide the answer, or fail to understand the instructions properly. We report the performances of 51 large language models on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs.
Our dataset include 20 diverse tasks covering 3 cognitive levels:
- Legal knowledge memorization: whether large language models can memorize necessary legal concepts, terminologies, articles and facts in their parameters.
- Legal knowledge understanding: whether large language models can comprehend entities, events, and relationships within legal texts, so as to understand the meanings and connotations of legal text.
- Legal knowledge applying: whether large language models can properly utilize their legal knowledge, reason over it to solve realistic legal tasks in downstream applications.
The following is the included task list. Every task has 500 examples.
Cognitive Level | ID | Tasks | Data Sources | Metrics | Type |
---|---|---|---|---|---|
Legal Knowledge Memorization | 1-1 | Article Recitation | FLK | ROUGE-L | Generation |
1-2 | Knowledge Question Answering | JEC_QA | Accuracy | SLC | |
Legal Knowledge Understanding | 2-1 | Document Proofread | CAIL2022 | F0.5 | Generation |
2-2 | Dispute Focus Identification | LAIC2021 | F1 | MLC | |
2-3 | Marital Disputes Identification | AIStudio | F1 | MLC | |
2-4 | Issue Topic Identification | CrimeKgAssitant | Accuracy | SLC | |
2-5 | Reading Comprehension | CAIL2019 | rc-F1 | Extraction | |
2-6 | Name Entity Recognition | CAIL2021 | soft-F1 | Extraction | |
2-7 | Opinion Summarization | CAIL2022 | ROUGE-L | Generation | |
2-8 | Argument Mining | CAIL2022 | Accuracy | SLC | |
2-9 | Event Detection | LEVEN | F1 | MLC | |
2-10 | Trigger Word Extraction | LEVEN | soft-F1 | Extraction | |
Legal Knowledge Applying | 3-1 | Fact-based Article Prediction | CAIL2018 | F1 | MLC |
3-2 | Scene-based Article Prediction | LawGPT_zh Project | ROUGE-L | Generation | |
3-3 | Charge Prediction | CAIL2018 | F1 | MLC | |
3-4 | Prison Term Prediction w.o Article | CAIL2018 | Normalized log-distance | Regression | |
3-5 | Prison Term Prediction w. Article | CAIL2018 | Normalized log-distance | Regression | |
3-6 | Case Analysis | JEC_QA | Accuracy | SLC | |
3-7 | Crimal Damages Calculation | LAIC2021 | Accuracy | Regression | |
3-8 | Consultation | hualv.com | ROUGE-L | Generation |
The data is stored under the data folder. Every task is stored in the <task_id>.json file. The json file can be loaded via json.load as a list of dictionaries. The data format is as follows (use task 3-2 as an example):
[
{
"instruction": "请根据具体场景与问题给出法律依据,只需要给出具体法条内容,每个场景仅涉及一个法条。",
"question": "场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社,以提高其在市场中的竞争力和规模效应。根据哪条法律,三个以上的农民专业合作社可以出资设立农民专业合作社联合社?",
"answer": "根据《农民专业合作社法》第五十六条,三个以上的农民专业合作社在自愿的基础上,可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所,由联合社全体成员制定并承认的章程,以及符合章程规定的成员出资。"
},
]
The model outputs are stored under the predictions/zero_shot and predictions/one_shot folder. Every system has its own subfolder. Within each subfolder, task predictions are stored in <task_id>.json file. The json file can be loaded via json.load as a dictionary. The data format is as follows (use task 3-2 from GPT-4 zero-shot prediction as an example):
{
"0": {
"origin_prompt": [
{
"role": "HUMAN",
"prompt": "请根据具体场景与问题给出法律依据,只需要给出具体法条内容,每个场景仅涉及一个法条。\n场景:某个地区的三个以上专业农民合作社想要出资设立农民专业合作社联合社,以提高其在市场中的竞争力和规模效应。根据哪条法律,三个以上的农民专业合作社可以出资设立农民专业合作社联合社?"
}
],
"prediction": "根据《中华人民共和国农民专业合作社法》第十七条:“三个以上的农民专业合作社可以出资设立农民专业合作社联合社。”",
"refr": "根据《农民专业合作社法》第五十六条,三个以上的农民专业合作社在自愿的基础上,可以出资设立农民专业合作社联合社。该联合社应当有自己的名称、组织机构和住所,由联合社全体成员制定并承认的章程,以及符合章程规定的成员出资。"
},
We test 51 popular large language models. We group them as in the following table:
Model | Parameters | SFT | RLHF | Access | Base Model | ||
---|---|---|---|---|---|---|---|
Multilingual LLMs | |||||||
MPT | 7B | ✗ | ✗ | Weights | - | ||
MPT-Instruct | 7B | ✓ | ✗ | Weights | MPT-7B | ||
LLaMA | 7/13/30/65B | ✗ | ✗ | Weights | - | ||
LLaMA-2 | 7/13/70B | ✗ | ✗ | Weights | - | ||
LLaMA-2-Chat | 7/13/70B | ✓ | ✓ | Weights | LLaMA-2-7/13/70B | ||
Alpaca-v1.0 | 7B | ✓ | ✗ | Weights | LLaMA-7B | ||
Vicuna-v1.3 | 7/13/33B | ✓ | ✗ | Weights | LLaMA-7/13/33B | ||
WizardLM | 7B | ✓ | ✗ | Weights | LLaMA-7B | ||
StableBeluga2 | 70B | ✓ | ✗ | Weights | LLaMA-2-70B | ||
ChatGPT | N/A | ✓ | ✓ | API | - | ||
GPT-4 | N/A | ✓ | ✓ | API | - | ||
Chinese-oriented LLMs | |||||||
MOSS-Moon | 16B | ✗ | ✗ | Weights | - | ||
MOSS-Moon-SFT | 16B | ✓ | ✗ | Weights | MOSS-Moon | ||
TigerBot-Base | 7B | ✗ | ✗ | Weights | - | ||
TigerBot-SFT | 7B | ✓ | ✗ | Weights | TigerBot-Base | ||
GoGPT | 7B | ✓ | ✗ | Weights | LLaMA-7B | ||
ChatGLM2 | 6B | ✓ | ✗ | Weights | ChatGLM | ||
Ziya-LLaMA | 13B | ✓ | ✓ | Weights | LLaMA-13B | ||
Baichuan | 7/13B | ✗ | ✗ | Weights | - | ||
Baichuan-13B-Chat | 13B | ✓ | ✗ | Weights | Baichuan-13B | ||
XVERSE | 13B | ✗ | ✗ | Weights | - | ||
InternLM | 7B | ✗ | ✗ | Weights | - | ||
InternLM-Chat | 7B | ✓ | ✗ | Weights | InternLM-7B | ||
InternLM-Chat-7B-8K | 7B | ✓ | ✗ | Weights | InternLM-7B | ||
Qwen | 7B | ✗ | ✗ | Weights | - | ||
Qwen-7B-Chat | 7B | ✓ | ✗ | Weights | Qwen-7B | ||
Yulan-Chat-2 | 13B | ✓ | ✗ | Weights | LLaMA-2-13B | ||
BELLE-LLaMA-2 | 13B | ✓ | ✗ | Weights | LLaMA-2-13B | ||
Chinese-LLaMA-2 | 7B | ✓ | ✗ | Weights | LLaMA-2-7B | ||
Chinese-Alpaca-2 | 7B | ✓ | ✗ | Weights | LLaMA-2-7B | ||
LLaMA-2-Chinese | 7/13B | ✓ | ✗ | Weights | LLaMA-2-7/13B | ||
Legal Specific LLMs | |||||||
HanFei | 7B | ✓ | ✗ | Weights | HanFei | ||
LaWGPT-7B-beta1.0 | 7B | ✓ | ✗ | Weights | Chinese-LLaMA | ||
LaWGPT-7B-beta1.1 | 7B | ✓ | ✗ | Weights | Chinese-alpaca-plus-7B | ||
LexiLaw | 6B | ✓ | ✗ | Weights | ChatGLM-6B | ||
Wisdom-Interrogatory | 7B | ✓ | ✗ | Weights | Baichuan-7B | ||
Fuzi-Mingcha | 6B | ✓ | ✗ | Weights | ChatGLM-6B | ||
Lawyer-LLaMA | 13B | ✓ | ✗ | Weights | LLaMA | ||
ChatLaw | 13/33B | ✓ | ✗ | Weights | Ziya-LLaMA-13B/Anima-33B |
We test the model performance under 2 scenarios: (1) zero-shot, where only instructions are provided in the prompt, and (2) one-shot, where instructions and one-shot examples are concatenated in the prompt.
Average performance (zero-shot) of 51 LLMs evaluated on LawBench.
We show the performances of top-5 models with the highest average scores.
Note: gpt-3.5-turbo is version 2023.6.13, and all gpt-3.5-turbo results below are for this version
Task ID | GPT4 | GPT-3.5-turbo | StableBeluga2 | qwen-7b-chat | internlm-chat-7b-8k |
---|---|---|---|---|---|
AVG | 52.35 | 42.15 | 39.23 | 37.00 | 35.73 |
1-1 | 15.38 | 15.86 | 14.58 | 18.54 | 15.45 |
1-2 | 55.20 | 36.00 | 34.60 | 34.00 | 40.40 |
2-1 | 12.53 | 9.10 | 7.70 | 22.56 | 22.64 |
2-2 | 41.65 | 32.37 | 25.57 | 27.42 | 35.46 |
2-3 | 69.79 | 51.73 | 44.20 | 31.42 | 28.96 |
2-4 | 44.00 | 41.20 | 39.00 | 35.00 | 35.60 |
2-5 | 56.50 | 53.75 | 52.03 | 48.48 | 54.13 |
2-6 | 76.60 | 69.55 | 65.54 | 37.88 | 17.95 |
2-7 | 37.92 | 33.49 | 39.07 | 36.04 | 27.11 |
2-8 | 61.20 | 36.40 | 45.80 | 24.00 | 36.20 |
2-9 | 78.82 | 66.48 | 65.27 | 44.88 | 62.93 |
2-10 | 65.09 | 39.05 | 41.64 | 18.90 | 20.94 |
3-1 | 52.47 | 29.50 | 16.41 | 44.62 | 34.86 |
3-2 | 27.54 | 31.30 | 24.52 | 33.50 | 19.11 |
3-3 | 41.99 | 35.52 | 22.82 | 40.67 | 41.05 |
3-4 | 82.62 | 78.75 | 76.06 | 76.74 | 63.21 |
3-5 | 81.91 | 76.84 | 65.35 | 77.19 | 67.20 |
3-6 | 48.60 | 27.40 | 34.40 | 26.80 | 34.20 |
3-7 | 77.60 | 61.20 | 56.60 | 42.00 | 43.80 |
3-8 | 19.65 | 17.45 | 13.39 | 19.32 | 13.37 |
Average performance (one-shot) of 51 LLMs evaluated on LawBench.
We show the performances of top-5 models with the highest average scores.
Task ID | GPT4 | GPT-3.5-turbo | qwen-7b-chat | StableBeluga2 | internlm-chat-7b-8k |
---|---|---|---|---|---|
AVG | 53.85 | 44.52 | 38.99 | 38.97 | 37.28 |
1-1 | 17.21 | 16.15 | 17.73 | 15.03 | 15.16 |
1-2 | 54.80 | 37.20 | 28.60 | 36.00 | 40.60 |
2-1 | 18.31 | 13.50 | 25.16 | 8.93 | 21.64 |
2-2 | 46.00 | 40.60 | 27.40 | 15.00 | 36.60 |
2-3 | 69.99 | 54.01 | 32.96 | 41.76 | 30.91 |
2-4 | 44.40 | 41.40 | 31.20 | 38.00 | 33.20 |
2-5 | 64.80 | 61.98 | 46.71 | 53.55 | 54.35 |
2-6 | 79.96 | 74.04 | 57.34 | 64.99 | 26.86 |
2-7 | 40.52 | 40.68 | 42.58 | 45.06 | 30.56 |
2-8 | 59.00 | 37.40 | 26.80 | 37.60 | 30.60 |
2-9 | 76.55 | 67.59 | 50.63 | 65.89 | 63.42 |
2-10 | 65.26 | 40.04 | 21.27 | 40.54 | 20.69 |
3-1 | 53.20 | 30.81 | 52.86 | 16.87 | 38.88 |
3-2 | 33.15 | 34.49 | 34.49 | 32.44 | 28.70 |
3-3 | 41.30 | 34.55 | 39.91 | 23.07 | 42.25 |
3-4 | 83.21 | 77.12 | 78.47 | 75.80 | 67.74 |
3-5 | 82.74 | 73.72 | 73.92 | 63.59 | 71.10 |
3-6 | 49.60 | 31.60 | 26.80 | 33.00 | 36.20 |
3-7 | 77.00 | 66.40 | 44.60 | 56.00 | 44.00 |
3-8 | 19.90 | 17.17 | 20.39 | 16.24 | 12.11 |
We design different rule-based parsing to extract answers from model predictions. The evaluation scripts for every task is in evaluation/evaluation_functions.
The steps to evaluate the model predictions are as below:
- Put prediction results from all systems under a folder F. Every system has one subfolder.
- Under the subfolder of every system, every task has a prediction file. The name of every task is the task id.
- Enter the evaluation folder and run "python main.py -i F -o <metric_result>"
The data format is as below:
data/
├── system-1
│ ├── 1-1.json
│ ├── 1-2.json
│ ├── ...
├── system-2
│ ├── 1-1.json
│ ├── 1-2.json
│ ├── ...
├── ...
The output result will be saved in <metric_result>.
For example, the zero-shot predictions from the 51 tested models are saved in predictions/zero_shot. You can run
cd evaluation
python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv
to get their evaluation results stored as ../predictions/zero_shot/results.csv.
The result file is a csv file with four columns: task, model_name, score and abstention_rate:
Column | Description |
---|---|
task | Task name. Set as the name of the prediction file |
model_name | Model name. Set as the name of the folder storing the prediction files |
score | Model score for the corresponding task. |
abstention_rate | Abstention rate for the corresponding task. This rate indicates how often the answer cannot be extracted from the model prediction. |
rouge_chinese==1.0.3
cn2an==0.5.22
ltp==4.2.13
OpenCC==1.1.6
python-Levenshtein==0.21.1
pypinyin==0.49.0
tqdm==4.64.1
timeout_decorator==0.5.0
LawBench is a mix of created and transformed datasets. We ask that you follow the license of the dataset creator. Please see the task list for the original source of each task.
- ROUGE-L is not a good metric to evaluate long-form generation results. We will explore using large language model-based evaluation metrics dedicated to law tasks.
- We will keep updating the task list included in LawBench. We welcome external contributors to collaborate with.
@article{fei2023lawbench,
title={LawBench: Benchmarking Legal Knowledge of Large Language Models},
author={Fei, Zhiwei and Shen, Xiaoyu and Zhu, Dawei and Zhou, Fengzhe and Han, Zhuo and Zhang, Songyang and Chen, Kai and Shen, Zongwen and Ge, Jidong},
journal={arXiv preprint arXiv:2309.16289},
year={2023}
}
If you have law datasets that you would like to include or evaluate your own models. Feel free to contact us.