Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grounding Module: Details on Implementation #20

Open
mnbucher opened this issue Nov 19, 2024 · 3 comments
Open

Grounding Module: Details on Implementation #20

mnbucher opened this issue Nov 19, 2024 · 3 comments

Comments

@mnbucher
Copy link

hi dear authors! i am digging into the codebase as we want to run some evals on current SoTA models.

while doing this, i wanted to better understand the proposal of the "Grounding Module" (mentioned in Chapter 3.4 and Figure 2 in the paper). you mention that the details would be in the appendix but then the appendix is just just prompting examples.

unfortunately, i can't find anything in the codebase from a quick glance. you guys run:

output_ids = model.generate(...) 
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

on all evals. that means the token IDs get decoded directly into the vocabulary. i couldn't find any logic that conditionally checks for ”location tokens” and how they would be decoded differently? is that done within ```model.generate()`` under the hood? where is that logic implemented and what is the architecture of that projection module?

thanks for the quick help!

@ZCMax
Copy link
Owner

ZCMax commented Nov 19, 2024

Hello, currently we do not provide the grounding related code in the repo, and we have updated the grounding module architecture recently with simpler architecture and higher performance for CVPR submission. We'll release the related code after the CVPR supplementary deadline. Stay tuned!

@mnbucher
Copy link
Author

alright! along this line, i was looking for the evaluation code for ScanRefer, where you guys report the numbers in Figure Table 6 in the paper. any chance we can get this?

@mnbucher
Copy link
Author

additionally, i am trying to better understand the "3D Grounding" capabilities of the model as-is, which produces text tokens for bounding box coordinates. now, for a query like

"Where is the couch located? Please provide its coordinates"

which matches the prompt style of "3D Question Answering" in Figure 1, i get mostly only 2D coordinates or no coordinates at all (compared to the 3D bbox answer in Figure 1).

i figured out that when i'm prompting the model like

"Where is the couch located? Please provide its coordinates in the scene"

i sometimes get 3D coordinates, depending on the nucleus sampling. however, i can't figure out what the coordinate system is and what format i should assume here, since the predictions seem to be very far off from the actual ground-truth.

i get a response in the format like

[0.886, 1.084, 0.452, 0.862, 1.603, 0.847]

what format should i assume here? [min_x, min_y, min_z, max_x, max_y, max_z] ? or another format?

i tried to cross-check via ScanRefer evaluation pipeline, but since you guys haven't published that in the repo, it's not really clear from the code alone.

thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants