Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy Between Theoretical Design and Actual Implementation in CogAgent #12

Open
2 tasks
aptsunny opened this issue Dec 30, 2024 · 3 comments
Open
2 tasks
Assignees

Comments

@aptsunny
Copy link

System Info / 系統信息

I have been reviewing the recent paper on the AutoGLM, specifically focusing on the section that discusses the "Agent with Intermediate Interface Design." In the paper, it is mentioned that the system is designed with a clear distinction between the planner and the grounder, which are intended to be improved separately. The planner is described to execute actions based on high-level descriptions, while the grounder is responsible for identifying the coordinates of elements within the GUI based on textual instructions.

However, upon examining the CogAgent's codebase, I have not been able to find evidence of this separation. It appears that the planner and grounder functionalities might have been integrated into a single component, which contradicts the design philosophy outlined in the paper.

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

xx

Expected behavior / 期待表现

Could you please clarify the following:
Is the separation of planner and grounder still a part of the CogAgent's design philosophy, or has it been integrated for any specific reasons?
If the separation is intentional, are there plans to update the codebase to reflect this design in the future?
If the integration was a deliberate choice, could you provide insights into the benefits or considerations that led to this decision?
Understanding the rationale behind this design choice is important for ensuring that the implementation aligns with the theoretical framework presented in the paper and for any future development or research building upon this work.

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Dec 31, 2024
@zRzRzRzRzRzRzR
Copy link
Member

zRzRzRzRzRzRzR commented Dec 31, 2024

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM.
The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder.
Are you referring to the model output where "integration" means that the plan and action are output together at once?

@aptsunny
Copy link
Author

aptsunny commented Jan 2, 2025

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM. The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder. Are you referring to the model output where "integration" means that the plan and action are output together at once?

As per my understanding from the projects, the model provides various output formats such as:

Action-Operation-Sensitive format
Status-Plan-Action-Operation format
Status-Action-Operation-Sensitive format
Status-Action-Operation format
Action-Operation format

My question is whether the grounding results and specific actions returned by the model are consistent regardless of the chosen format. If they are indeed consistent, could you please explain why the sft data is structured differently for each format? Additionally, what are the benefits of ensuring that the outputs remain consistent despite the format changes?

Thank you very much for your assistance. Your response will be greatly appreciated.

@jasonnoy
Copy link

jasonnoy commented Jan 3, 2025

The need to elaborate, this model is not the open-source version of AutoGLM, and it is also not much related to AutoGLM. The product corresponding to CogAgent is GLM-PC. Therefore, you cannot use concepts related to AutoGLM, such as planner and grounder. Are you referring to the model output where "integration" means that the plan and action are output together at once?

As per my understanding from the projects, the model provides various output formats such as:

Action-Operation-Sensitive format
Status-Plan-Action-Operation format
Status-Action-Operation-Sensitive format
Status-Action-Operation format
Action-Operation format

My question is whether the grounding results and specific actions returned by the model are consistent regardless of the chosen format. If they are indeed consistent, could you please explain why the sft data is structured differently for each format? Additionally, what are the benefits of ensuring that the outputs remain consistent despite the format changes?

Thank you very much for your assistance. Your response will be greatly appreciated.

The Action and Operation are not consistent. The COT of Status-Plan will enable the model to think before moving and therefore improve the operation accuracy. In previous experiments, we observed mild reduction of accuracy when we changed the format from Action-Operation to Operation, which also supports this point.
We recommend using Action-Operation format to reduce the waiting time between steps, but you are encouraged to try other formats and see if any of them helps you on your task. We would like to hear any feedbacks and suggestions from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants