-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseline模型对比模式结果错误 #118
Comments
感谢,我们也同步复现一下; “每次都是model_a”赢,可能存在几种情况: 一种是本身model_a的答案就比较好,可能题目刚好有利于model_a; 另一种的GPT4的偏见,即对被测答案的长度、表达模式等,会有倾向性,导致偏置。 |
就是不太理解第一个问题的的win_1": "model_a", "win_2": "model_a",明明后面的评价一个是A,一个是B啊。怎么得出来两个win都是model_a?
其他的两两对比和单独打分没有报错,正常运行 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
使用https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html#id8
因为没有baseline,我先跑了两个模型的预测,得到了两个jsonl预测。然后将一个作为baseline,一个作为target_answers,用openai评估完后得到registry/data/arena/reviews/review_gpt4_pair_baseline.jsonl,如下:
应该是tie吧,但是每次都是model_a赢。
请问baseline_file的示例和跑一次模型预测一样吗?
{"model_a": "2.2", "model_b": "self", "win_1": "model_a", "win_2": "model_a", "anony": true, "tstamp": 1724924814.5014713, "language": "NA", "question_id": 1, "category": "generic", "question": "How can I improve my time management skills?", "review_text_1": "Both assistants provide relevant and helpful advice on improving time management skills. However, Assistant A provides a more detailed and structured response, with each point elaborated upon. Assistant A also includes additional tips such as eliminating distractions, practicing saying no, and learning to delegate, which are not mentioned by Assistant B. Furthermore, Assistant A ends with a note of encouragement, emphasizing that improving time management skills takes practice and patience. Therefore, Assistant A's response is more comprehensive and helpful. \n\nFinal Verdict: [[A]]", "review_text_2": "Both assistants provide relevant and helpful advice on improving time management skills. They both suggest prioritizing tasks, creating a schedule, breaking down large tasks, and taking breaks. However, Assistant B provides a more detailed and structured response, including additional tips such as eliminating distractions, learning to say no, and delegating tasks. Assistant B also emphasizes the importance of patience and practice in improving time management skills. Therefore, Assistant B's response is more comprehensive and detailed. \n\nFinal Verdict: [[B]]"}
{"model_a": "2.2", "model_b": "self", "win_1": "model_a", "win_2": "model_a", "anony": true, "tstamp": 1724924826.6124618, "language": "NA", "question_id": 2, "category": "generic", "question": "What are the most effective ways to deal with stress?", "review_text_1": "Both assistants provide relevant and accurate responses to the user's question about dealing with stress. However, Assistant A's response is more detailed and comprehensive, offering a wider range of strategies and explaining how each one can help manage stress. Assistant A also mentions the importance of trial and error in finding what works best for the individual, and suggests seeking professional help if needed. While Assistant B's response is accurate and relevant, it lacks the depth and detail of Assistant A's response. Therefore, Assistant A provides a better answer to the user's question. \n\nFinal Verdict: [[A]]", "review_text_2": "Both assistants provided relevant and helpful responses to the user's question about dealing with stress. They both mentioned similar strategies such as exercise, relaxation techniques, social support, self-care, and mindfulness. However, Assistant B provided a more detailed and organized response, breaking down the strategies into numbered points and elaborating on each one. Assistant B also mentioned cognitive-behavioral therapy, which Assistant A did not, and concluded by reminding the user that everyone deals with stress differently and that it's okay to seek professional help if needed. Therefore, Assistant B's response is more comprehensive and detailed. \n\nFinal Verdict: [[B]]"}
The text was updated successfully, but these errors were encountered: