-
Notifications
You must be signed in to change notification settings - Fork 57
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'mindspore-lab:main' into main
- Loading branch information
Showing
54 changed files
with
4,355 additions
and
877 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
English | [中文](README_CN.md) | ||
|
||
# LayoutLMv3 | ||
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. --> | ||
|
||
> [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) | ||
> [Original Repo](https://github.com/microsoft/unilm/tree/master/layoutlmv3) | ||
## 1. Introduction | ||
Unlike previous LayoutLM series models, LayoutLMv3 does not rely on complex CNN or Faster R-CNN networks to represent images in its model architecture. Instead, it directly utilizes image blocks of document images, thereby greatly reducing parameters and avoiding complex document preprocessing such as manual annotation of target region boxes and document object detection. Its simple unified architecture and training objectives make LayoutLMv3 a versatile pretraining model suitable for both text-centric and image-centric document AI tasks. | ||
|
||
The experimental results demonstrate that LayoutLMv3 achieves better performance with fewer parameters on the following datasets: | ||
|
||
- Text-centric datasets: Form Understanding FUNSD dataset, Receipt Understanding CORD dataset, and Document Visual Question Answering DocVQA dataset. | ||
- Image-centric datasets: Document Image Classification RVL-CDIP dataset and Document Layout Analysis PubLayNet dataset. | ||
|
||
LayoutLMv3 also employs a text-image multimodal Transformer architecture to learn cross-modal representations. Text vectors are obtained by adding word vectors, one-dimensional positional vectors, and two-dimensional positional vectors of words. Text from document images and their corresponding two-dimensional positional information (layout information) are extracted using optical character recognition (OCR) tools. As adjacent words in text often convey similar semantics, LayoutLMv3 shares the two-dimensional positional vectors of adjacent words, while each word in LayoutLM and LayoutLMv2 has different two-dimensional positional vectors. | ||
|
||
The representation of image vectors typically relies on CNN-extracted feature grid features or Faster R-CNN-extracted region features, which increase computational costs or depend on region annotations. Therefore, the authors obtain image features by linearly mapping image blocks, a representation method initially proposed in ViT, which incurs minimal computational cost and does not rely on region annotations, effectively addressing the aforementioned issues. Specifically, the image is first resized to a uniform size (e.g., 224x224), then divided into fixed-size blocks (e.g., 16x16), and image features are obtained through linear mapping to form an image feature sequence, followed by addition of a learnable one-dimensional positional vector to obtain the image vector.[[1](#references)] | ||
|
||
<p align="center"> | ||
<img src=../../kie/layoutlmv3/layoutlmv3_arch.jpg width=1000 /> | ||
</p> | ||
<p align="center"> | ||
<em> Figure 1. LayoutLMv3 architecture [<a href="#references">1</a>] </em> | ||
</p> | ||
|
||
## 2. Quick Start | ||
|
||
### 2.1 Preparation | ||
|
||
| mindspore | ascend driver | firmware | cann toolkit/kernel | | ||
|:----------:|:---------------:|:------------:|:--------------------:| | ||
| 2.3.1 | 24.1.RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 | | ||
|
||
#### 2.1.1 Installation | ||
Please refer to the [installation instruction](https://github.com/mindspore-lab/mindocr#installation) in MindOCR. | ||
|
||
#### 2.1.2 PubLayNet Dataset Preparation | ||
|
||
PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as "text", "list", "figure" etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. | ||
|
||
The training and validation datasets for PubLayNet can be downloaded [here](https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz) | ||
|
||
```bash | ||
python tools/dataset_converters/convert.py \ | ||
--dataset_name publaynet \ | ||
--image_dir publaynet/ \ | ||
--output_path publaynet/ | ||
``` | ||
|
||
Once the download is complete, the data can be converted to a data type in layoutlmv3 input format using the script provided by MindOCR above. | ||
|
||
### 2.2 Model Conversion | ||
|
||
Note: Please install torch before starting the conversion script | ||
```bash | ||
pip install torch | ||
``` | ||
|
||
Download the [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) model to /path/to/layoutlmv3-base-finetuned-publaynet, and run: | ||
|
||
```bash | ||
python tools/param_converter_from_torch.py \ | ||
--input_path /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pt \ | ||
--json_path configs/layout/layoutlmv3/layoutlmv3_publaynet_param_map.json \ | ||
--output_path /path/to/layoutlmv3-base-finetuned-publaynet/from_torch.ckpt | ||
``` | ||
|
||
### 2.3 Model Evaluation | ||
The evaluation results on the public benchmark dataset (PublayNet) are as follows: | ||
|
||
Experiments are tested on ascend 910* with mindspore 2.3.1 pynative mode | ||
<div align="center"> | ||
|
||
| **model name** | **cards** | **batch size** | **img/s** | **map** | **config** | | ||
|----------------|-----------|----------------|-----------|---------|----------------------------------------------------------------------------------------------------------------| | ||
| LayoutLMv3 | 1 | 1 | 2.7 | 94.3% | [yaml](https://github.com/mindspore-lab/mindocr/blob/main/configs/layout/layoutlmv3/layoutlmv3_publaynet.yaml) | | ||
</div> | ||
|
||
### 2.4 Model Inference | ||
|
||
```bash | ||
python tools/infer/text/predict_layout.py \ | ||
--mode 1 \ | ||
--image_dir {path_to_img} \ | ||
--layout_algorithm LAYOUTLMV3 \ | ||
--config {config_path} | ||
``` | ||
By default, model inference results are saved in the inference_results folder | ||
|
||
layout_res.png (Model inference visualization results) | ||
|
||
layout_results.txt (Model inference text results) | ||
|
||
### 2.5 Model Training | ||
|
||
coming soon | ||
|
||
## References | ||
<!--- Guideline: Citation format GB/T 7714 is suggested. --> | ||
|
||
[1] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2204.08387, 2022. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
[English](README.md) | 中文 | ||
|
||
# LayoutLMv3 | ||
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. --> | ||
|
||
> [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) | ||
> [Original Repo](https://github.com/microsoft/unilm/tree/master/layoutlmv3) | ||
## 1. 模型描述 | ||
<!--- Guideline: Introduce the model and architectures. Cite if you use/adopt paper explanation from others. --> | ||
|
||
不同于以往的LayoutLM系列模型,在模型架构设计上,LayoutLMv3 不依赖复杂的 CNN 或 Faster R-CNN 网络来表征图像,而是直接利用文档图像的图像块,从而大大节省了参数并避免了复杂的文档预处理(如人工标注目标区域框和文档目标检测)。简单的统一架构和训练目标使 LayoutLMv3 成为通用的预训练模型,可适用于以文本为中心和以图像为中心的文档 AI 任务。 | ||
|
||
实验结果表明,LayoutLMv3在以下数据集以更少的参数量达到了更优的性能: | ||
- 以文本为中心的数据集:表单理解FUNSD数据集、票据理解CORD数据集以及文档视觉问答DocVQA数据集。 | ||
- 以图像为中心的数据集:文档图像分类RVL-CDIP数据集以及文档布局分析PubLayNet数据集。 | ||
|
||
LayoutLMv3 还应用了文本——图像多模态 Transformer 架构来学习跨模态表征。文本向量由词向量、词的一维位置向量和二维位置向量相加得到。文档图像的文本和其相应的二维位置信息(布局信息)则利用光学字符识别(OCR)工具抽取。因为文本的邻接词通常表达了相似的语义,LayoutLMv3 共享了邻接词的二维位置向量,而 LayoutLM 和 LayoutLMv2 的每个词则用了不同的二维位置向量。 | ||
|
||
图像向量的表示通常依赖于 CNN 抽取特征图网格特征或 Faster R-CNN 提取区域特征,这些方式增加了计算开销或依赖于区域标注。因此,作者将图像块经过线性映射获得图像特征,这种图像表示方式最早在 ViT 中被提出,计算开销极小且不依赖于区域标注,有效解决了以上问题。具体来说,首先将图像缩放为统一的大小(例如224x224),然后将图像切分成固定大小的块(例如16x16),并通过线性映射获得图像特征序列,再加上可学习的一维位置向量后得到图像向量。[<a href="#参考文献">1</a>] | ||
|
||
<!--- Guideline: If an architecture table/figure is available in the paper, put one here and cite for intuitive illustration. --> | ||
|
||
<p align="center"> | ||
<img src=../../kie/layoutlmv3/layoutlmv3_arch.jpg width=1000 /> | ||
</p> | ||
<p align="center"> | ||
<em> 图1. LayoutLMv3架构图 [<a href="#参考文献">1</a>] </em> | ||
</p> | ||
|
||
|
||
## 2. 快速开始 | ||
|
||
### 2.1 环境及数据准备 | ||
|
||
| mindspore | ascend driver | firmware | cann toolkit/kernel | | ||
|:----------:|:---------------:|:------------:|:--------------------:| | ||
| 2.3.1 | 24.1.RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 | | ||
|
||
#### 2.1.1 安装 | ||
环境安装教程请参考MindOCR的 [installation instruction](https://github.com/mindspore-lab/mindocr#installation). | ||
|
||
#### 2.1.2 PubLayNet数据集准备 | ||
|
||
PubLayNet是一个用于文档布局分析的数据集。它包含研究论文和文章的图像,以及页面中各种元素的注释,如这些研究论文图像中的“文本”、“列表”、“图形”等。该数据集是通过自动匹配PubMed Central上公开的100多万篇PDF文章的XML表示和内容而获得的。 | ||
|
||
PubLayNet的训练及验证数据集可以从 [这里](https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz) 下载。 | ||
|
||
```bash | ||
python tools/dataset_converters/convert.py \ | ||
--dataset_name publaynet \ | ||
--image_dir publaynet/ \ | ||
--output_path publaynet/ | ||
``` | ||
|
||
下载完成后,可以使用上述MindOCR提供的脚本将数据转换为layoutlmv3输入格式的数据类型。 | ||
|
||
### 2.2 模型转换 | ||
|
||
注:启动转换脚本前请安装torch | ||
```bash | ||
pip install torch | ||
``` | ||
|
||
请下载 [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) 模型到 /path/to/layoutlmv3-base-finetuned-publaynet, 然后运行: | ||
|
||
```bash | ||
python tools/param_converter_from_torch.py \ | ||
--input_path /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pt \ | ||
--json_path configs/layout/layoutlmv3/layoutlmv3_publaynet_param_map.json \ | ||
--output_path /path/to/layoutlmv3-base-finetuned-publaynet/from_torch.ckpt | ||
``` | ||
|
||
### 2.3 模型评估 | ||
在公开基准数据集(PublayNet)上的-评估结果如下: | ||
|
||
在采用动态图模式的ascend 910*上实验结果,mindspore版本为2.3.1 | ||
<div align="center"> | ||
|
||
| **model name** | **cards** | **batch size** | **img/s** | **map** | **config** | | ||
|----------------|-----------|----------------|-----------|---------|----------------------------------------------------------------------------------------------------------------| | ||
| LayoutLMv3 | 1 | 1 | 2.7 | 94.3% | [yaml](https://github.com/mindspore-lab/mindocr/blob/main/configs/layout/layoutlmv3/layoutlmv3_publaynet.yaml) | | ||
</div> | ||
|
||
### 2.4 模型推理 | ||
|
||
```bash | ||
python tools/infer/text/predict_layout.py \ | ||
--mode 1 \ | ||
--image_dir {path_to_img} \ | ||
--layout_algorithm LAYOUTLMV3 \ | ||
--config {config_path} | ||
``` | ||
模型推理结果默认保存在inference_results文件夹下 | ||
|
||
layout_res.png (模型推理可视化结果) | ||
|
||
layout_results.txt (模型推理文本结果) | ||
|
||
### 2.5 模型训练 | ||
|
||
coming soon | ||
|
||
## 参考文献 | ||
<!--- Guideline: Citation format GB/T 7714 is suggested. --> | ||
|
||
[1] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2204.08387, 2022. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
system: | ||
mode: 1 # 0 for graph mode, 1 for pynative mode in MindSpore | ||
distribute: False | ||
amp_level: "O0" | ||
seed: 42 | ||
log_interval: 10 | ||
val_start_epoch: 50 | ||
val_while_train: True | ||
drop_overflow_update: False | ||
|
||
model: | ||
type: layout | ||
transform: null | ||
backbone: | ||
name: build_layoutlmv3_fpn_backbone | ||
out_features: ["layer3", "layer5", "layer7", "layer11"] | ||
fpn: | ||
in_features: ["layer3", "layer5", "layer7", "layer11"] | ||
norm: "" | ||
out_channels: 256 | ||
fuse_type: sum | ||
neck: | ||
name: RPN | ||
in_features: ["p2", "p3", "p4", "p5", "p6"] | ||
pre_nms_topk_train: 2000 | ||
pre_nms_topk_test: 1000 | ||
feat_channel: 256 | ||
anchor_generator: | ||
aspect_ratios: [0.5, 1.0, 2.0] | ||
anchor_sizes: [[32], [64], [128], [256], [512]] | ||
strides: [4, 8, 16, 32, 64] | ||
rpn_label_assignment: | ||
rpn_sample_batch: 256 | ||
fg_fraction: 0.5 | ||
negative_overlap: 0.3 | ||
positive_overlap: 0.7 | ||
use_random: True | ||
train_proposal: | ||
min_size: 0 | ||
nms_thresh: 0.7 | ||
pre_nms_top_n: 2000 | ||
post_nms_top_n: 1000 | ||
test_proposal: | ||
min_size: 0 | ||
nms_thresh: 0.7 | ||
pre_nms_top_n: 1000 | ||
post_nms_top_n: 1000 | ||
head: | ||
name: CascadeROIHeads | ||
mask_on: True | ||
in_features: ["p2", "p3", "p4", "p5"] | ||
num_classes: 5 | ||
bbox_loss: None | ||
add_gt_as_proposals: True | ||
roi_extractor: | ||
featmap_strides: [4, 8, 16, 32] | ||
roi_box_head: | ||
cls_agnostic_bbox_reg: True | ||
name: FastRCNNConvFCHead | ||
conv_dims: [] | ||
fc_dims: [1024, 1024] | ||
pooler_resolution: 7 | ||
pooler_sampling_ratio: 0 | ||
pooler_type: ROIAlignV2 | ||
in_channel: 256 | ||
out_channel: 1024 | ||
roi_mask_head: | ||
name: MaskRCNNConvUpsampleHead | ||
conv_dims: [256, 256, 256, 256, 256] | ||
pooler_resolution: 14 | ||
pooler_sampling_ratio: 0 | ||
pooler_type: ROIAlignV2 | ||
in_channel: 256 | ||
roi_box_cascade_head: | ||
bbox_reg_weights: [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0], [30.0, 30.0, 15.0, 15.0]] | ||
ious: [0.5, 0.6, 0.7] | ||
bbox_assigner: | ||
name: BBoxAssigner | ||
rois_per_batch: 512 | ||
bg_thresh: 0.5 | ||
fg_thresh: 0.5 | ||
fg_fraction: 0.25 | ||
pretrained: | ||
|
||
postprocess: | ||
name: Layoutlmv3Postprocess | ||
conf_thres: 0.05 | ||
iou_thres: 0.5 | ||
conf_free: False | ||
multi_label: True | ||
time_limit: 100 | ||
|
||
metric: | ||
name: Layoutlmv3Metric | ||
annotations_path: &annotations_path publaynet/val.json | ||
|
||
eval: | ||
ckpt_load_path: "from_torch.ckpt" | ||
dataset_sink_mode: False | ||
dataset: | ||
type: PublayNetDataset | ||
dataset_path: publaynet/val.txt | ||
annotations_path: *annotations_path | ||
img_size: 800 | ||
model_name: layoutlmv3 | ||
transform_pipeline: | ||
- func_name: letterbox | ||
- func_name: label_norm | ||
xyxy2xywh_: True | ||
- func_name: label_pad | ||
padding_size: 160 | ||
padding_value: -1 | ||
- func_name: image_normal | ||
mean: [ 127.5, 127.5, 127.5 ] | ||
std: [ 127.5, 127.5, 127.5 ] | ||
- func_name: image_transpose | ||
bgr2rgb: True | ||
hwc2chw: True | ||
- func_name: image_batch_pad | ||
max_size: 1333 | ||
batch_size: &refine_batch_size 1 | ||
stride: 64 | ||
output_columns: ["image", "labels", "image_ids", "hw_ori", "hw_scale", "pad"] | ||
net_input_column_index: [0, 3, 4] # input indices for network forward func in output_columns | ||
meta_data_column_index: [2, 3, 4, 5] # input indices marked as label | ||
loader: | ||
shuffle: False | ||
batch_size: *refine_batch_size | ||
drop_remainder: False | ||
max_rowsize: 12 | ||
num_workers: 1 | ||
|
||
predict: | ||
ckpt_load_path: "from_torch.ckpt" | ||
category_dict: {1: 'text', 2: 'title', 3: 'list', 4: 'table', 5: 'figure'} | ||
color_dict: {1: [255, 0, 0], 2: [0, 0, 255], 3: [0, 255, 0], 4: [0, 255, 255], 5: [255, 0, 255]} |
Oops, something went wrong.