Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

aptsunny · 2024-12-30T13:14:50Z

System Info / 系統信息

I have been reviewing the recent paper on the AutoGLM, specifically focusing on the section that discusses the "Agent with Intermediate Interface Design." In the paper, it is mentioned that the system is designed with a clear distinction between the planner and the grounder, which are intended to be improved separately. The planner is described to execute actions based on high-level descriptions, while the grounder is responsible for identifying the coordinates of elements within the GUI based on textual instructions.

However, upon examining the CogAgent's codebase, I have not been able to find evidence of this separation. It appears that the planner and grounder functionalities might have been integrated into a single component, which contradicts the design philosophy outlined in the paper.

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

xx

Expected behavior / 期待表现

Could you please clarify the following:
Is the separation of planner and grounder still a part of the CogAgent's design philosophy, or has it been integrated for any specific reasons?
If the separation is intentional, are there plans to update the codebase to reflect this design in the future?
If the integration was a deliberate choice, could you provide insights into the benefits or considerations that led to this decision?
Understanding the rationale behind this design choice is important for ensuring that the implementation aligns with the theoretical framework presented in the paper and for any future development or research building upon this work.

zRzRzRzRzRzRzR · 2024-12-31T05:52:06Z

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM.
The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder.
Are you referring to the model output where "integration" means that the plan and action are output together at once?

aptsunny · 2025-01-02T11:53:15Z

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM. The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder. Are you referring to the model output where "integration" means that the plan and action are output together at once?

As per my understanding from the projects, the model provides various output formats such as:

Action-Operation-Sensitive format
Status-Plan-Action-Operation format
Status-Action-Operation-Sensitive format
Status-Action-Operation format
Action-Operation format

My question is whether the grounding results and specific actions returned by the model are consistent regardless of the chosen format. If they are indeed consistent, could you please explain why the sft data is structured differently for each format? Additionally, what are the benefits of ensuring that the outputs remain consistent despite the format changes?

Thank you very much for your assistance. Your response will be greatly appreciated.

jasonnoy · 2025-01-03T10:20:36Z

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM. The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder. Are you referring to the model output where "integration" means that the plan and action are output together at once?

As per my understanding from the projects, the model provides various output formats such as:
Action-Operation-Sensitive format
Status-Plan-Action-Operation format
Status-Action-Operation-Sensitive format
Status-Action-Operation format
Action-Operation format
My question is whether the grounding results and specific actions returned by the model are consistent regardless of the chosen format. If they are indeed consistent, could you please explain why the sft data is structured differently for each format? Additionally, what are the benefits of ensuring that the outputs remain consistent despite the format changes?

Thank you very much for your assistance. Your response will be greatly appreciated.

The Action and Operation are not consistent. The COT of Status-Plan will enable the model to think before moving and therefore improve the operation accuracy. In previous experiments, we observed mild reduction of accuracy when we changed the format from Action-Operation to Operation, which also supports this point.
We recommend using Action-Operation format to reduce the waiting time between steps, but you are encouraged to try other formats and see if any of them helps you on your task. We would like to hear any feedbacks and suggestions from you!

zRzRzRzRzRzRzR self-assigned this Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

aptsunny commented Dec 30, 2024

zRzRzRzRzRzRzR commented Dec 31, 2024 •

edited

Loading

aptsunny commented Jan 2, 2025

jasonnoy commented Jan 3, 2025

Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

Comments

aptsunny commented Dec 30, 2024

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

zRzRzRzRzRzRzR commented Dec 31, 2024 • edited Loading

aptsunny commented Jan 2, 2025

jasonnoy commented Jan 3, 2025

zRzRzRzRzRzRzR commented Dec 31, 2024 •

edited

Loading