Skip to content

Commit

Permalink
Merge branch 'mindspore-lab:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
panshaowu authored Dec 3, 2024
2 parents f4c3e63 + d3ea8a4 commit 71065ce
Show file tree
Hide file tree
Showing 54 changed files with 4,355 additions and 877 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,12 @@ MindOCR is an open-source toolbox for OCR development and application based on [
The following is the corresponding `mindocr` versions and supported
mindspore versions.

| mindocr | mindspore |
|:-------:|:---------:|
| master | master |
| 0.4 | 2.3.0 |
| 0.3 | 2.2.10 |
| 0.1 | 1.8 |
| mindocr | mindspore |
|:-------:|:-----------:|
| main | master |
| 0.4 | 2.3.0/2.3.1 |
| 0.3 | 2.2.10 |
| 0.1 | 1.8 |


## Installation
Expand Down
12 changes: 6 additions & 6 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,12 @@ MindOCR是一个基于[MindSpore](https://www.mindspore.cn/en) 框架开发的OC

以下是对应的“mindocr”版本和支持 Mindspore 版本。

| mindocr | mindspore |
|:-------:|:---------:|
| master | master |
| 0.4 | 2.3.0 |
| 0.3 | 2.2.10 |
| 0.1 | 1.8 |
| mindocr | mindspore |
|:-------:|:-----------:|
| main | master |
| 0.4 | 2.3.0/2.3.1 |
| 0.3 | 2.2.10 |
| 0.1 | 1.8 |


## 安装教程
Expand Down
263 changes: 107 additions & 156 deletions configs/det/dbnet/README.md

Large diffs are not rendered by default.

255 changes: 103 additions & 152 deletions configs/det/dbnet/README_CN.md

Large diffs are not rendered by default.

104 changes: 104 additions & 0 deletions configs/layout/layoutlmv3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
English | [中文](README_CN.md)

# LayoutLMv3
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->

> [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)
> [Original Repo](https://github.com/microsoft/unilm/tree/master/layoutlmv3)
## 1. Introduction
Unlike previous LayoutLM series models, LayoutLMv3 does not rely on complex CNN or Faster R-CNN networks to represent images in its model architecture. Instead, it directly utilizes image blocks of document images, thereby greatly reducing parameters and avoiding complex document preprocessing such as manual annotation of target region boxes and document object detection. Its simple unified architecture and training objectives make LayoutLMv3 a versatile pretraining model suitable for both text-centric and image-centric document AI tasks.

The experimental results demonstrate that LayoutLMv3 achieves better performance with fewer parameters on the following datasets:

- Text-centric datasets: Form Understanding FUNSD dataset, Receipt Understanding CORD dataset, and Document Visual Question Answering DocVQA dataset.
- Image-centric datasets: Document Image Classification RVL-CDIP dataset and Document Layout Analysis PubLayNet dataset.

LayoutLMv3 also employs a text-image multimodal Transformer architecture to learn cross-modal representations. Text vectors are obtained by adding word vectors, one-dimensional positional vectors, and two-dimensional positional vectors of words. Text from document images and their corresponding two-dimensional positional information (layout information) are extracted using optical character recognition (OCR) tools. As adjacent words in text often convey similar semantics, LayoutLMv3 shares the two-dimensional positional vectors of adjacent words, while each word in LayoutLM and LayoutLMv2 has different two-dimensional positional vectors.

The representation of image vectors typically relies on CNN-extracted feature grid features or Faster R-CNN-extracted region features, which increase computational costs or depend on region annotations. Therefore, the authors obtain image features by linearly mapping image blocks, a representation method initially proposed in ViT, which incurs minimal computational cost and does not rely on region annotations, effectively addressing the aforementioned issues. Specifically, the image is first resized to a uniform size (e.g., 224x224), then divided into fixed-size blocks (e.g., 16x16), and image features are obtained through linear mapping to form an image feature sequence, followed by addition of a learnable one-dimensional positional vector to obtain the image vector.[[1](#references)]

<p align="center">
<img src=../../kie/layoutlmv3/layoutlmv3_arch.jpg width=1000 />
</p>
<p align="center">
<em> Figure 1. LayoutLMv3 architecture [<a href="#references">1</a>] </em>
</p>

## 2. Quick Start

### 2.1 Preparation

| mindspore | ascend driver | firmware | cann toolkit/kernel |
|:----------:|:---------------:|:------------:|:--------------------:|
| 2.3.1 | 24.1.RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 |

#### 2.1.1 Installation
Please refer to the [installation instruction](https://github.com/mindspore-lab/mindocr#installation) in MindOCR.

#### 2.1.2 PubLayNet Dataset Preparation

PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as "text", "list", "figure" etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.

The training and validation datasets for PubLayNet can be downloaded [here](https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz)

```bash
python tools/dataset_converters/convert.py \
--dataset_name publaynet \
--image_dir publaynet/ \
--output_path publaynet/
```

Once the download is complete, the data can be converted to a data type in layoutlmv3 input format using the script provided by MindOCR above.

### 2.2 Model Conversion

Note: Please install torch before starting the conversion script
```bash
pip install torch
```

Download the [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) model to /path/to/layoutlmv3-base-finetuned-publaynet, and run:

```bash
python tools/param_converter_from_torch.py \
--input_path /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pt \
--json_path configs/layout/layoutlmv3/layoutlmv3_publaynet_param_map.json \
--output_path /path/to/layoutlmv3-base-finetuned-publaynet/from_torch.ckpt
```

### 2.3 Model Evaluation
The evaluation results on the public benchmark dataset (PublayNet) are as follows:

Experiments are tested on ascend 910* with mindspore 2.3.1 pynative mode
<div align="center">

| **model name** | **cards** | **batch size** | **img/s** | **map** | **config** |
|----------------|-----------|----------------|-----------|---------|----------------------------------------------------------------------------------------------------------------|
| LayoutLMv3 | 1 | 1 | 2.7 | 94.3% | [yaml](https://github.com/mindspore-lab/mindocr/blob/main/configs/layout/layoutlmv3/layoutlmv3_publaynet.yaml) |
</div>

### 2.4 Model Inference

```bash
python tools/infer/text/predict_layout.py \
--mode 1 \
--image_dir {path_to_img} \
--layout_algorithm LAYOUTLMV3 \
--config {config_path}
```
By default, model inference results are saved in the inference_results folder

layout_res.png (Model inference visualization results)

layout_results.txt (Model inference text results)

### 2.5 Model Training

coming soon

## References
<!--- Guideline: Citation format GB/T 7714 is suggested. -->

[1] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2204.08387, 2022.
108 changes: 108 additions & 0 deletions configs/layout/layoutlmv3/README_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
[English](README.md) | 中文

# LayoutLMv3
<!--- Guideline: use url linked to abstract in ArXiv instead of PDF for fast loading. -->

> [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)
> [Original Repo](https://github.com/microsoft/unilm/tree/master/layoutlmv3)
## 1. 模型描述
<!--- Guideline: Introduce the model and architectures. Cite if you use/adopt paper explanation from others. -->

不同于以往的LayoutLM系列模型,在模型架构设计上,LayoutLMv3 不依赖复杂的 CNN 或 Faster R-CNN 网络来表征图像,而是直接利用文档图像的图像块,从而大大节省了参数并避免了复杂的文档预处理(如人工标注目标区域框和文档目标检测)。简单的统一架构和训练目标使 LayoutLMv3 成为通用的预训练模型,可适用于以文本为中心和以图像为中心的文档 AI 任务。

实验结果表明,LayoutLMv3在以下数据集以更少的参数量达到了更优的性能:
- 以文本为中心的数据集:表单理解FUNSD数据集、票据理解CORD数据集以及文档视觉问答DocVQA数据集。
- 以图像为中心的数据集:文档图像分类RVL-CDIP数据集以及文档布局分析PubLayNet数据集。

LayoutLMv3 还应用了文本——图像多模态 Transformer 架构来学习跨模态表征。文本向量由词向量、词的一维位置向量和二维位置向量相加得到。文档图像的文本和其相应的二维位置信息(布局信息)则利用光学字符识别(OCR)工具抽取。因为文本的邻接词通常表达了相似的语义,LayoutLMv3 共享了邻接词的二维位置向量,而 LayoutLM 和 LayoutLMv2 的每个词则用了不同的二维位置向量。

图像向量的表示通常依赖于 CNN 抽取特征图网格特征或 Faster R-CNN 提取区域特征,这些方式增加了计算开销或依赖于区域标注。因此,作者将图像块经过线性映射获得图像特征,这种图像表示方式最早在 ViT 中被提出,计算开销极小且不依赖于区域标注,有效解决了以上问题。具体来说,首先将图像缩放为统一的大小(例如224x224),然后将图像切分成固定大小的块(例如16x16),并通过线性映射获得图像特征序列,再加上可学习的一维位置向量后得到图像向量。[<a href="#参考文献">1</a>]

<!--- Guideline: If an architecture table/figure is available in the paper, put one here and cite for intuitive illustration. -->

<p align="center">
<img src=../../kie/layoutlmv3/layoutlmv3_arch.jpg width=1000 />
</p>
<p align="center">
<em> 图1. LayoutLMv3架构图 [<a href="#参考文献">1</a>] </em>
</p>


## 2. 快速开始

### 2.1 环境及数据准备

| mindspore | ascend driver | firmware | cann toolkit/kernel |
|:----------:|:---------------:|:------------:|:--------------------:|
| 2.3.1 | 24.1.RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 |

#### 2.1.1 安装
环境安装教程请参考MindOCR的 [installation instruction](https://github.com/mindspore-lab/mindocr#installation).

#### 2.1.2 PubLayNet数据集准备

PubLayNet是一个用于文档布局分析的数据集。它包含研究论文和文章的图像,以及页面中各种元素的注释,如这些研究论文图像中的“文本”、“列表”、“图形”等。该数据集是通过自动匹配PubMed Central上公开的100多万篇PDF文章的XML表示和内容而获得的。

PubLayNet的训练及验证数据集可以从 [这里](https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz) 下载。

```bash
python tools/dataset_converters/convert.py \
--dataset_name publaynet \
--image_dir publaynet/ \
--output_path publaynet/
```

下载完成后,可以使用上述MindOCR提供的脚本将数据转换为layoutlmv3输入格式的数据类型。

### 2.2 模型转换

注:启动转换脚本前请安装torch
```bash
pip install torch
```

请下载 [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) 模型到 /path/to/layoutlmv3-base-finetuned-publaynet, 然后运行:

```bash
python tools/param_converter_from_torch.py \
--input_path /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pt \
--json_path configs/layout/layoutlmv3/layoutlmv3_publaynet_param_map.json \
--output_path /path/to/layoutlmv3-base-finetuned-publaynet/from_torch.ckpt
```

### 2.3 模型评估
在公开基准数据集(PublayNet)上的-评估结果如下:

在采用动态图模式的ascend 910*上实验结果,mindspore版本为2.3.1
<div align="center">

| **model name** | **cards** | **batch size** | **img/s** | **map** | **config** |
|----------------|-----------|----------------|-----------|---------|----------------------------------------------------------------------------------------------------------------|
| LayoutLMv3 | 1 | 1 | 2.7 | 94.3% | [yaml](https://github.com/mindspore-lab/mindocr/blob/main/configs/layout/layoutlmv3/layoutlmv3_publaynet.yaml) |
</div>

### 2.4 模型推理

```bash
python tools/infer/text/predict_layout.py \
--mode 1 \
--image_dir {path_to_img} \
--layout_algorithm LAYOUTLMV3 \
--config {config_path}
```
模型推理结果默认保存在inference_results文件夹下

layout_res.png (模型推理可视化结果)

layout_results.txt (模型推理文本结果)

### 2.5 模型训练

coming soon

## 参考文献
<!--- Guideline: Citation format GB/T 7714 is suggested. -->

[1] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2204.08387, 2022.
136 changes: 136 additions & 0 deletions configs/layout/layoutlmv3/layoutlmv3_publaynet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
system:
mode: 1 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
amp_level: "O0"
seed: 42
log_interval: 10
val_start_epoch: 50
val_while_train: True
drop_overflow_update: False

model:
type: layout
transform: null
backbone:
name: build_layoutlmv3_fpn_backbone
out_features: ["layer3", "layer5", "layer7", "layer11"]
fpn:
in_features: ["layer3", "layer5", "layer7", "layer11"]
norm: ""
out_channels: 256
fuse_type: sum
neck:
name: RPN
in_features: ["p2", "p3", "p4", "p5", "p6"]
pre_nms_topk_train: 2000
pre_nms_topk_test: 1000
feat_channel: 256
anchor_generator:
aspect_ratios: [0.5, 1.0, 2.0]
anchor_sizes: [[32], [64], [128], [256], [512]]
strides: [4, 8, 16, 32, 64]
rpn_label_assignment:
rpn_sample_batch: 256
fg_fraction: 0.5
negative_overlap: 0.3
positive_overlap: 0.7
use_random: True
train_proposal:
min_size: 0
nms_thresh: 0.7
pre_nms_top_n: 2000
post_nms_top_n: 1000
test_proposal:
min_size: 0
nms_thresh: 0.7
pre_nms_top_n: 1000
post_nms_top_n: 1000
head:
name: CascadeROIHeads
mask_on: True
in_features: ["p2", "p3", "p4", "p5"]
num_classes: 5
bbox_loss: None
add_gt_as_proposals: True
roi_extractor:
featmap_strides: [4, 8, 16, 32]
roi_box_head:
cls_agnostic_bbox_reg: True
name: FastRCNNConvFCHead
conv_dims: []
fc_dims: [1024, 1024]
pooler_resolution: 7
pooler_sampling_ratio: 0
pooler_type: ROIAlignV2
in_channel: 256
out_channel: 1024
roi_mask_head:
name: MaskRCNNConvUpsampleHead
conv_dims: [256, 256, 256, 256, 256]
pooler_resolution: 14
pooler_sampling_ratio: 0
pooler_type: ROIAlignV2
in_channel: 256
roi_box_cascade_head:
bbox_reg_weights: [[10.0, 10.0, 5.0, 5.0], [20.0, 20.0, 10.0, 10.0], [30.0, 30.0, 15.0, 15.0]]
ious: [0.5, 0.6, 0.7]
bbox_assigner:
name: BBoxAssigner
rois_per_batch: 512
bg_thresh: 0.5
fg_thresh: 0.5
fg_fraction: 0.25
pretrained:

postprocess:
name: Layoutlmv3Postprocess
conf_thres: 0.05
iou_thres: 0.5
conf_free: False
multi_label: True
time_limit: 100

metric:
name: Layoutlmv3Metric
annotations_path: &annotations_path publaynet/val.json

eval:
ckpt_load_path: "from_torch.ckpt"
dataset_sink_mode: False
dataset:
type: PublayNetDataset
dataset_path: publaynet/val.txt
annotations_path: *annotations_path
img_size: 800
model_name: layoutlmv3
transform_pipeline:
- func_name: letterbox
- func_name: label_norm
xyxy2xywh_: True
- func_name: label_pad
padding_size: 160
padding_value: -1
- func_name: image_normal
mean: [ 127.5, 127.5, 127.5 ]
std: [ 127.5, 127.5, 127.5 ]
- func_name: image_transpose
bgr2rgb: True
hwc2chw: True
- func_name: image_batch_pad
max_size: 1333
batch_size: &refine_batch_size 1
stride: 64
output_columns: ["image", "labels", "image_ids", "hw_ori", "hw_scale", "pad"]
net_input_column_index: [0, 3, 4] # input indices for network forward func in output_columns
meta_data_column_index: [2, 3, 4, 5] # input indices marked as label
loader:
shuffle: False
batch_size: *refine_batch_size
drop_remainder: False
max_rowsize: 12
num_workers: 1

predict:
ckpt_load_path: "from_torch.ckpt"
category_dict: {1: 'text', 2: 'title', 3: 'list', 4: 'table', 5: 'figure'}
color_dict: {1: [255, 0, 0], 2: [0, 0, 255], 3: [0, 255, 0], 4: [0, 255, 255], 5: [255, 0, 255]}
Loading

0 comments on commit 71065ce

Please sign in to comment.