Cocoruta may reproduce biases and prejudices inherent in the legal documents used for its training, which include older legislation. Users should exercise caution when interpreting the model’s outputs, especially in contexts requiring up-to-date legal perspectives or that may involve underrepresented groups. We observed that the Cocoruta model, while less proficient in handling utterances compared to larger models, would impart a legal bias to potential interactions.*
Cocoruta is a specialized large language model fine-tuned for legal document-based Question Answering (Q&A), developed to address legal queries related to the "Blue Amazon"—a term used to describe Brazil's extensive maritime territory. Cocoruta 1.0 is based on the LLaMa 2-7B model, fine-tuned with a corpus of 68,991 legal documents totaling 28.4 million tokens. Despite being trained with fewer parameters than some larger models, Cocoruta demonstrates competitive performance in domain-specific legal discourse.
- Parameter count: 7B (LLaMa 2-7B)
- Training data: 28.4 million tokens from 68,991 legal documents
- Training epochs: 15
Cocoruta has been evaluated using multiple automatic metrics to measure its effectiveness in generating accurate and relevant legal content. The model performed as follows:
- BLEU: 61.2
- ROUGE-N: 79.2
- BERTSCORE: 91.2
- MOVERSCORE: 76.5
The performance of Cocoruta in qualitative evaluation showed the utility of fine-tuning, as answers aligned with legal discourse were more frequent in Cocoruta compared to larger models. The larger models exhibited higher proficiency, delivering well-structured answers. However, for questions not directly related to the legal context, responses from the larger models did not maintain legal discourse.*
- Adherence to legal discourse: 74%
- Correct answers: 68%
- Inappropriate discourse: 51%
- Docker installed
- Windows or Linux OS
git clone https://github.com/C4AI/cocoruta.git
cd cocoruta
docker compose up
If you use Cocoruta in your research, please cite the following paper:
*@inproceedings{2024cocoruta,
author={do Espírito Santo, Felipe Oliveira and Marques Peres, Sarajane and de Sousa Gramacho, Givanildo and Alves Franco Brandão, Anarosa and Cozman, Fabio Gagliardi},
booktitle={2024 International Joint Conference on Neural Networks (IJCNN)},
title={Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective},
year={2024},
volume={},
number={},
pages={1-9},
keywords={Law;Large language models;Neural networks;Question answering (information retrieval);Complexity theory;Large language models;LLM evaluation;legal Q&A systems;legal-document corpus},
address={Yokohama, Japan},
isbn={978-8-3503-5931-2},
doi={10.1109/IJCNN60899.2024.10650895},
url={https://ieeexplore.ieee.org/abstract/document/10650895}
}