Grounding Module: Details on Implementation #20

mnbucher · 2024-11-19T03:47:00Z

hi dear authors! i am digging into the codebase as we want to run some evals on current SoTA models.

while doing this, i wanted to better understand the proposal of the "Grounding Module" (mentioned in Chapter 3.4 and Figure 2 in the paper). you mention that the details would be in the appendix but then the appendix is just just prompting examples.

unfortunately, i can't find anything in the codebase from a quick glance. you guys run:

output_ids = model.generate(...) 
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

on all evals. that means the token IDs get decoded directly into the vocabulary. i couldn't find any logic that conditionally checks for ”location tokens” and how they would be decoded differently? is that done within ```model.generate()`` under the hood? where is that logic implemented and what is the architecture of that projection module?

thanks for the quick help!

The text was updated successfully, but these errors were encountered:

ZCMax · 2024-11-19T12:40:06Z

Hello, currently we do not provide the grounding related code in the repo, and we have updated the grounding module architecture recently with simpler architecture and higher performance for CVPR submission. We'll release the related code after the CVPR supplementary deadline. Stay tuned!

mnbucher · 2024-11-23T03:33:40Z

alright! along this line, i was looking for the evaluation code for ScanRefer, where you guys report the numbers in Figure Table 6 in the paper. any chance we can get this?

mnbucher · 2024-11-23T04:33:09Z

additionally, i am trying to better understand the "3D Grounding" capabilities of the model as-is, which produces text tokens for bounding box coordinates. now, for a query like

"Where is the couch located? Please provide its coordinates"

which matches the prompt style of "3D Question Answering" in Figure 1, i get mostly only 2D coordinates or no coordinates at all (compared to the 3D bbox answer in Figure 1).

i figured out that when i'm prompting the model like

"Where is the couch located? Please provide its coordinates in the scene"

i sometimes get 3D coordinates, depending on the nucleus sampling. however, i can't figure out what the coordinate system is and what format i should assume here, since the predictions seem to be very far off from the actual ground-truth.

i get a response in the format like

[0.886, 1.084, 0.452, 0.862, 1.603, 0.847]

what format should i assume here? [min_x, min_y, min_z, max_x, max_y, max_z] ? or another format?

i tried to cross-check via ScanRefer evaluation pipeline, but since you guys haven't published that in the repo, it's not really clear from the code alone.

thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grounding Module: Details on Implementation #20

Grounding Module: Details on Implementation #20

mnbucher commented Nov 19, 2024

ZCMax commented Nov 19, 2024 •

edited

Loading

mnbucher commented Nov 23, 2024

mnbucher commented Nov 23, 2024

Grounding Module: Details on Implementation #20

Grounding Module: Details on Implementation #20

Comments

mnbucher commented Nov 19, 2024

ZCMax commented Nov 19, 2024 • edited Loading

mnbucher commented Nov 23, 2024

mnbucher commented Nov 23, 2024

ZCMax commented Nov 19, 2024 •

edited

Loading