Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Depth Embeddings in NyuV2 Zero-Shot Classification #107

Open
Leeinsu1 opened this issue Jan 25, 2024 · 4 comments
Open

Using Depth Embeddings in NyuV2 Zero-Shot Classification #107

Leeinsu1 opened this issue Jan 25, 2024 · 4 comments

Comments

@Leeinsu1
Copy link

Thank you for your exceptional work and the code you've provided.
I have a question regarding the use of depth embedding in the context of NyuV2 zero-shot classification.
For the conversion of depth to disparity, I am utilizing a focal length of 518.857901 and a baseline value of 0.075.
However, the accuracy I am achieving is only 45%, which is 10% lower than what is reported in the paper.

Could you possibly advise on any additional steps that might be necessary?
Currently, I am conducting operations such as converting depth to disparity, resizing, center cropping, and normalizing.
For the normalization process, I am using mean and standard deviation values of 0.0418 and 0.0295, respectively.
Additionally, I attempted to apply DepthNorm again after converting to disparity, but it did not yield the desired results.

For the 10-th class, I am using both methods - labeling it as 'others' and selecting the class with the highest cosine similarity from the 18 specified in the paper.

Your guidance on this matter would be greatly appreciated.
Thank you.

@zhang-ziang
Copy link

@Leeinsu1 I encountered similar problem, could you please share the code you used for discussion? :)

@jbrownkramer
Copy link

jbrownkramer commented Mar 1, 2024

I am trying to get embeddings in depth images, but I am also struggling since I have to guess at the normalization process.

@Leeinsu1 have you tried using a baseline of 75? If you look at the example disparity file from the omnivore repo, you'll see that the average value is around 16, which indicates a formula for disparity similar to 518.857901 * 75 / d, where d is depth in mm. I think then you might want to do a DepthNorm before normalizing by 0.0418 and 0.0295, since that matches the Omnivore pipeline.

That said, the mean of disparity followed by DepthNorm as defined above is probably about 10x bigger than 0.0418, so I don't know where that came from.

https://github.com/facebookresearch/omnivore/blob/1d55abdc8dfc7bd5cbf69316841ab804d0acf1ca/inference_tutorial.ipynb#L560

@StanLei52
Copy link

Hi there, I recommend you to check out our project ViT-Lens. For the depth experiments, we obtained better performance over ImageBind on the same testing data. Hope that helps.

@jbrownkramer
Copy link

@StanLei52 Oh, that looks great! I looked at your paper and code. It seems to follow the same data normalization pipeline as Omnivore and ImageBind. One missing piece of information is the scale in the conversion from depth to disparity. The ViT-Lens code starts by loading pre-computed disparity maps, so that info is not present.

Do you know if disparity is 518.857901 * 75 / depth or 518.857901 * .075 / depth or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants