Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Support exporting fpn #372

Closed

Conversation

daquexian
Copy link
Contributor

@daquexian daquexian commented Apr 17, 2018

Based on @orionr's work

With this PR, FPN is supported in cooperation with pytorch/pytorch#7091. I have verified that it works on e2e_faster_rcnn_R-50-FPN_1x.yaml

@daquexian
Copy link
Contributor Author

@orionr FPN works with pytorch/pytorch#7091 ! Please review it when you have time :)

@ZesongLee
Copy link

The “rpn_post_nms_topN” in convert_collect_and_distribute() function should be "post_nms_topN"

@daquexian
Copy link
Contributor Author

@ZesongLee Yes, I have fixed it in pytorch/pytorch#7091, because it makes the names consistent. Please use the latest caffe2 code :)

@gadcam
Copy link
Contributor

gadcam commented May 15, 2018

This PR seems ready to merge ! And going in the direction of #24 which is dear to a lot of people.
@orionr Do you have time to check this PR ? Or do you think it is better to go with #334 ?

inputs = [x for x in op.input]
ret = core.CreateOperator(
'CollectAndDistributeFpnRpnProposals',
inputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use list(op.input) ? It seems to have the exact same output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is from @orionr 's code. I have no idea whether it is better for me to change it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @daquexian,

Thanks for writing this. While running it to convert pkl model(FPN), I get an error at verify_model(args, [net, init_net], args.test_img)

Something like as shown below:

AssertionError:
Arrays are not almost equal to 3 decimals
result_boxes and result_boxes not matched. Max diff: 5.0
(mismatch 0.606060606061%)
x: array([[3.329e+02, 1.799e+01, 5.250e+02, 3.556e+02, 9.995e-01],
[0.000e+00, 2.073e+01, 2.308e+02, 3.685e+02, 9.926e-01],
[2.459e+01, 3.320e+01, 1.330e+02, 1.092e+02, 9.909e-01],...
y: array([[3.329e+02, 1.799e+01, 5.250e+02, 3.556e+02, 9.995e-01],
[0.000e+00, 2.073e+01, 2.308e+02, 3.685e+02, 9.926e-01],
[2.459e+01, 3.320e+01, 1.330e+02, 1.092e+02, 9.909e-01],...

I get this error when I use a test_img, otherwise model gets written in pb format.

How important it is to get correct boxes on test image??

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daquexian Sure. I can share the model, config file and test image. How would you like me to share it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daquexian : I was able to convert pkl model(with fpn) to protobufs. However, when I try to run this net, I get following error:

RuntimeError: [enforce fail at utility_ops.h:275] . Check failed: output->dims() == Input(i).dims().Description: Input #1, input dimension:1 256 38 60 should match output dimension: 1 256 38 59 Error from operator:
input: "fpn_inner_res4_5_sum_lateral" input: "fpn_inner_res4_5_sum_topdown" output: "fpn_inner_res4_5_sum" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 }

The error is typical of models with fpn layers. I never encountered this kind of issue before. Any pointers to what might be causing this would be of great help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daquexian : Yes, there is a method blob_utils.im_list_to_blob([im]) in convert_pkl_to_pb.py. The size of original image is (377, 590, 3).

The size of the blob while data prep becomes (1, 3, 500, 782). Also, content of im_info is array([[500. , 782. , 1.32626]], dtype=float32)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitbhio The error you encountered is caused by input size which is not a multiple of 32. When the size is not a multiple of 32, for example, the width is 36, after a certain layer, a feature map whose width=36/2/2=9 is produced, then after a conv layer whose stride=2 and a upsample2x layer, a feature map whose width=8 instead of 9 is produced, then an error occurs when summing these two feature maps whose widths are different with each other. blob_utils.im_list_to_blob pads the input so that its height and width will be a multiple of 32. You can try to print some messages to check whether blob_utils.im_list_to_blob works correctly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daquexian: Thanks for the explanation. Your observation regarding blob_utils.im_list_to_blob method is correct. For some reason, the method was not adjusting image blob dimensions according to COARSEST_STRIDE. Now, the size of the blob while data prep becomes (1, 3, 512, 800) and I am able to make detections. Wow. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitbhio Glad to see it :) Could you please tell me why the method doesn't adjust the dimensions? Is there anything wrong in my PR?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daquexian Your PR is awesome. I was trying to load these protobuf models without loading its corresponding config yaml file. Now, blob_utils.im_list_to_blob method ended up using cfg.FPN.FPN_ON and cfg.FPN.COARSEST_STRIDE from default config file(config.py). In default config file, cfg.FPN.FPN_ON is set to False. For my purpose, I made a copy of im_list_to_blob method from blob.py and modified it so that I could explicitly pass FPN_ON and COARSEST_STRIDE as input parameters so that this method doesn't take global values.

Copy link
Contributor

@gadcam gadcam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the very minor change I suggested LGTM.
I tested it with a few different models and it ran without problem.

gadcam added a commit to gadcam/Detectron-1 that referenced this pull request May 28, 2018
Pre-requisite : facebookresearch#372
Purpose : enable exporting all the models for CPU by exporting 2 separate nets : one for the bboxes and one for the rest of the inference.

Two main modifications
- Refactor the main() : it will call a function convert_to_pb for each sub_net
- run_model_pb : always do the inference for bbox and then call mask or keypoint part if needed. The exact same approach is adopted.

Then helper functions are only lightly modified to fit with the new objective to export 2 pb files
@daquexian
Copy link
Contributor Author

@rohitbhio It might be caused by some inconsistent implementation between python layer and corresponding c++ layer or some inherent difference between GPU and CPU. I think max diff 5.0 and mismatch 0.6% doesn't matter. You can comment verify_model to skip verification and save the model. And it's appreciated if you can share the model you want to convert and the test img

@gadcam
Copy link
Contributor

gadcam commented Jun 15, 2018

@rbgirshick @ir413 This PR has been open for almost two months : do you plan to review it, if so do you know when you will have time to do it ?

@@ -293,7 +293,7 @@ def add_topdown_lateral_module(
bias_init=const_fill(0.0)
)
# Top-down 2x upsampling
td = model.net.UpsampleNearest(fpn_top, fpn_bottom + '_topdown', scale=2)
td = model.net.ResizeNearest(fpn_top, fpn_bottom + '_topdown', width_scale=2., height_scale=2.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ops exactly equivalent? Please verify that this change does not impact the AP of FPN model zoo models and does not impact the results when training new models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbgirshick Thanks for your review. I will check it in several days.

@rbgirshick
Copy link
Contributor

@gadcam I haven't been able to follow all of the discussions about CPU support for FPN models. This PR looks self contained so I should be able to review it, though probably not until after CVPR which is next week. I left a comment about one concern over the switch from the UpsampleNearest to ResizeNearest op. I want someone to check that they produce equivalent results (the history is that when implementing this originally the C2 ResizeNearest op did not exist).

@daquexian
Copy link
Contributor Author

daquexian commented Jun 25, 2018

@rbgirshick I just tested the AP of e2e_faster_rcnn_R-50-FPN_1x and e2e_keypoint_rcnn_R-50-FPN_1x , the AP are all exactly same when using ResizeNearest and UpsampleNearest.

Both Faster R-CNN are:

INFO task_evaluation.py: 181: copypaste: Dataset: coco_2017_val
INFO task_evaluation.py: 183: copypaste: Task: box
INFO task_evaluation.py: 186: copypaste: AP,AP50,AP75,APs,APm,APl
INFO task_evaluation.py: 187: copypaste: 0.3671,0.5845,0.3962,0.2112,0.3985,0.4813

Both Keypoint R-CNN are:

INFO task_evaluation.py: 181: copypaste: Dataset: keypoints_coco_2017_val
INFO task_evaluation.py: 183: copypaste: Task: box
INFO task_evaluation.py: 186: copypaste: AP,AP50,AP75,APs,APm,APl
INFO task_evaluation.py: 187: copypaste: 0.5356,0.8285,0.5828,0.3649,0.6122,0.6970
INFO task_evaluation.py: 183: copypaste: Task: keypoint
INFO task_evaluation.py: 186: copypaste: AP,AP50,AP75,APm,APl
INFO task_evaluation.py: 187: copypaste: 0.6421,0.8643,0.6991,0.5854,0.7337

@daquexian
Copy link
Contributor Author

daquexian commented Jun 26, 2018

@rbgirshick I have also checked the code. Because resize/upsample nearest is simple to implement, both operators are basically doing the same thing. The only difference is

  1. There is a limitation on cuda block num in UpsampleNearest but not another

  2. UpsampleNearest can receive an input whose ndim()==3 but ResizeNearest only support ndim()==4

@rbgirshick
Copy link
Contributor

@daquexian thanks for checking! Will try to review and merge in the near future.

@rbgirshick
Copy link
Contributor

@orionr @newstzpz does this diff look good to you? My understanding is that it subsumes #334, which you worked on.

@facebook-github-bot
Copy link

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

spatial_scale,
(cfg.FPN.RPN_ANCHOR_START_SIZE * 2.**(lvl - cfg.FPN.RPN_MIN_LEVEL),)
) \
if lvl else get_anchors(spatial_scale, cfg.RPN.SIZES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be easier to read if we write

  anchor_sizes = ....
  blobs[anchor_name] = get_anchors(spatial_scale, anchor_sizes)

@newstzpz
Copy link
Contributor

newstzpz commented Jul 3, 2018

@rbgirshick The diff looks good to me.

inputs = [x for x in op.input]
ret = core.CreateOperator(
'CollectAndDistributeFpnRpnProposals',
inputs,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Iam trying to convert maskrcnn model from pkl to pb format. But Iam getting below error
Traceback (most recent call last):
File "convert_pkl_to_pb.py", line 586, in
main()
File "convert_pkl_to_pb.py", line 574, in main
[net, init_net] = convert_model_gpu(args, net, init_net)
File "convert_pkl_to_pb.py", line 300, in convert_model_gpu
ret = core.InjectDeviceCopiesAmongNets([ret_init_net, ret_net])
File "/usr/local/anaconda3/envs/caffetwo/lib/python2.7/site-packages/caffe2/python/core.py", line 2407, in InjectDeviceCopiesAmongNets
blob_remap=blob_remap,
File "/usr/local/anaconda3/envs/caffetwo/lib/python2.7/site-packages/caffe2/python/core.py", line 2298, in InjectCrossDeviceCopies
"input {} should be defined in the net.".format(input)
AssertionError: input rpn_rois should be defined in the net.

Input to the script is config file = e2e_mask_rcnn_R-101-FPN_1x.yaml and its pkl model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError: input rpn_rois should be defined in the net.

Looks like you also need #449 PR to do what you want.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Iam able to convert the model from pkl to pb format with convert_pkl_to_pb.py with args.device = cpu. But when Iam viewing the network using netron (https://github.com/lutzroeder/Netron), mask_conv_body section is missing in the graph. I want to understand if this is the issue with convert_pkl_to_pb.py or with Netron.

@gadcam
Copy link
Contributor

gadcam commented Jul 12, 2018

@orionr @rbgirshick Did you reach a final decision for this one ?

@gadcam
Copy link
Contributor

gadcam commented Jul 22, 2018

@orionr @rbgirshick @ir413 It is the last time I annoy you with this one : do you think it is now mergeable ?

As @daquexian did the tests you required & @newstzpz approved the changes it looks like ready to merge - from an external viewpoint at least.

@daquexian
Copy link
Contributor Author

daquexian commented Jul 23, 2018

@newstzpz Hi :) I have made the changes you requested. Is it ready for landing now? Thanks!

@ferasboulala
Copy link

Does this mean that it will be possible to convert both FPN and Mask models ?

@gadcam
Copy link
Contributor

gadcam commented Jul 25, 2018

@ferasboulala #449 add support for mask models.

@daquexian
Copy link
Contributor Author

@rbgirshick @orionr @newstzpz Could you please consider merging it? As you may have noticed many users are eager to see it landing.

@rbgirshick
Copy link
Contributor

@daquexian please rebase and then I will do some necessary testing to ensure that the UpsampleNearest -> ResizeNearest op change doesn't cause any problems.

@newstzpz
Copy link
Contributor

newstzpz commented Aug 3, 2018

Another possible way is to keep UpsampleNearest unchanged in the training code but convert it to ResizeNearest when converting to pb.

@daquexian
Copy link
Contributor Author

@rbgirshick @newstzpz Thanks! I will rebase in a few hours and then explore the possibility of keeping UpsampleNearest unchanged.

@rbgirshick
Copy link
Contributor

@newstzpz great suggestion!

@daquexian if you keep UpsampleNearest unchanged in the training code then I will be able to merge this trivially. It is my only concern and requires significant testing to ensure that the behavior of existing models is unchanged.

@daquexian daquexian force-pushed the add-export-support-fpn branch from fdcf219 to abaa4b9 Compare August 3, 2018 14:49
@daquexian
Copy link
Contributor Author

@rbgirshick Thanks! I have rebased and I will then explore the possibility of keeping UpsampleNearest unchanged in training code. Maybe I will update several days later because I don't have much time these days :)

@gadcam
Copy link
Contributor

gadcam commented Aug 4, 2018

@daquexian @rbgirshick To me, it looks like this should be merged with UpsampleNearest as a first step : the PRs of @jgong5 will enable CPU support with UpsampleNearest cf #596 and pytorch/pytorch#10157.

But as I understand it there will be no neon optimization and it is still a custom op so the question would still be open after merge.

@daquexian
Copy link
Contributor Author

daquexian commented Aug 8, 2018

@rbgirshick @gadcam I have updated the code. Now UpsampleNearest is kept unchanged in training code. It is easier than I imagined, I spent less than 20 minutes finishing it :) If pytorch/pytorch#10157 is merged, one can simply remove the corresponding convert_op_in_proto to use the CPU version of UpsampleNearest.

I have verified the new code on faster-resnet-50-fpn and the detection part of keypoint-resnet-50-fpn, the results of both models are right.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rbgirshick has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xzhewei pushed a commit to xzhewei/Detectron that referenced this pull request Jan 10, 2019
Summary:
Based on orionr's work

 - [x] Solve [the problem about GenerateProposals](facebookresearch#334 (comment))
 - [x] Use the existing [ResizeNearest](https://github.com/caffe2/caffe2/blob/master/caffe2/operators/resize_op.cc#L57) layer instead of UpsampleNearest. ResizeNearest has cpu implementation and neon optimization
 - [x] Make it work (with pytorch/pytorch#7091)

With this PR, FPN is supported in cooperation with pytorch/pytorch#7091. I have verified that it works on `e2e_faster_rcnn_R-50-FPN_1x.yaml`
Pull Request resolved: facebookresearch#372

Reviewed By: newstzpz

Differential Revision: D9213242

Pulled By: rbgirshick

fbshipit-source-id: 8fc7b77e6cbf08adaafd760505dd760df59bfd79
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants