You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
In the referring expressions task, the model is given an image and an expression, and has to find a bounding box in the image for the thing that the expression refers to.
Here is an example of some images with expressions:
To do this, we need the following components:
A DatasetReader that reads the referring expression data, matches it up with the images, and pre-processes it to produce candidate bounding boxes. The best way to get the referring expressions annotations is from https://github.com/lichengunc/refer, though the code there is out of date, so we'll have to write our own code to read in that data. Other than that, the dataset reader should follow the example of VQAv2Reader. The resulting Instances should consist of the embedded regions of interest from the RegionDetector, the text of one referring expression, in a TextField, and a label field that gives the IoU between the gold annotated region and each predicted region.
A Model that uses VilBERT as a back-end to combine the vision and text data, and gives each region a score. The model computes a loss by taking the softmax of the region scores, and computing the dot product of that with the label field. You might want to look at VqaVilbert to steal some ideas.
A model config that trains this whole thing end-to-end. We're hoping get somewhere near the scores in the VilBERT 12-in-1 paper, though we won't beat the high score since this issue does not cover the extensive multi-task-training work that's covered in the paper.
In the referring expressions task, the model is given an image and an expression, and has to find a bounding box in the image for the thing that the expression refers to.
Here is an example of some images with expressions:
To do this, we need the following components:
DatasetReader
that reads the referring expression data, matches it up with the images, and pre-processes it to produce candidate bounding boxes. The best way to get the referring expressions annotations is from https://github.com/lichengunc/refer, though the code there is out of date, so we'll have to write our own code to read in that data. Other than that, the dataset reader should follow the example ofVQAv2Reader
. The resultingInstance
s should consist of the embedded regions of interest from theRegionDetector
, the text of one referring expression, in aTextField
, and a label field that gives the IoU between the gold annotated region and each predicted region.Model
that uses VilBERT as a back-end to combine the vision and text data, and gives each region a score. The model computes a loss by taking the softmax of the region scores, and computing the dot product of that with the label field. You might want to look at VqaVilbert to steal some ideas.As always, we recommend you use the AllenNLP Repository Template as a starting point.
The text was updated successfully, but these errors were encountered: