-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grounding Module: Details on Implementation #20
Comments
Hello, currently we do not provide the grounding related code in the repo, and we have updated the grounding module architecture recently with simpler architecture and higher performance for CVPR submission. We'll release the related code after the CVPR supplementary deadline. Stay tuned! |
alright! along this line, i was looking for the evaluation code for ScanRefer, where you guys report the numbers in Figure Table 6 in the paper. any chance we can get this? |
additionally, i am trying to better understand the "3D Grounding" capabilities of the model as-is, which produces text tokens for bounding box coordinates. now, for a query like
which matches the prompt style of "3D Question Answering" in Figure 1, i get mostly only 2D coordinates or no coordinates at all (compared to the 3D bbox answer in Figure 1). i figured out that when i'm prompting the model like
i sometimes get 3D coordinates, depending on the nucleus sampling. however, i can't figure out what the coordinate system is and what format i should assume here, since the predictions seem to be very far off from the actual ground-truth. i get a response in the format like
what format should i assume here? [min_x, min_y, min_z, max_x, max_y, max_z] ? or another format? i tried to cross-check via ScanRefer evaluation pipeline, but since you guys haven't published that in the repo, it's not really clear from the code alone. thank you so much! |
hi dear authors! i am digging into the codebase as we want to run some evals on current SoTA models.
while doing this, i wanted to better understand the proposal of the "Grounding Module" (mentioned in Chapter 3.4 and Figure 2 in the paper). you mention that the details would be in the appendix but then the appendix is just just prompting examples.
unfortunately, i can't find anything in the codebase from a quick glance. you guys run:
on all evals. that means the token IDs get decoded directly into the vocabulary. i couldn't find any logic that conditionally checks for ”location tokens” and how they would be decoded differently? is that done within ```model.generate()`` under the hood? where is that logic implemented and what is the architecture of that projection module?
thanks for the quick help!
The text was updated successfully, but these errors were encountered: