-
Notifications
You must be signed in to change notification settings - Fork 459
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
952 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,19 +25,55 @@ Following, an updated view of the cascade architecture: | |
|
||
At the moment, the flavored processes are available as follows: | ||
|
||
| Identifier | Flavored models | Description | Advantages and Limitations | | ||
|-----------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| `article/light` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else in the body | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition. Limitation are that all noise not being part of the article, such as references, page numbers, headnotes, and footnotes are also included in the body. | | ||
| `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else in the body | Variation of the `article/light` that includes the recognision of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references. | | ||
| Name | Identifier | Flavored models | Description | Advantages and Limitations | | ||
|-----------------------------------------------|-----------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| Article lightweight structure | `article/light` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, and put everything else in the body | Simple model that can work with any document and bring the advantage of pdfalto processing which solves many issue with text ordering and column recognition. Limitation are that all noise not being part of the article, such as references, page numbers, headnotes, and footnotes are also included in the body. | | ||
| Article lightweight structure with references | `article/light-ref` | `segmentation`, `header` | Simple process that extracts only title, authors, publication date and doi from the header, the references, and put everything else in the body | Variation of the `article/light` that includes the recognision of references. More versatile than `article/light` in the realm of variation of scientific articles, such as corrections, erratums, letters which may contain references. | | ||
|
||
## Benchmarking | ||
|
||
The evaluation of the flavors is performed in the same way as the standard processing for scientific articles. | ||
However, the evaluation is performed on a reduced set of fields: | ||
The evaluation of the flavors is performed in the same way as the standard processing for scientific articles: | ||
|
||
- **BidLSTM_ChainCRF_FEATURES** as sequence labeling for the header model | ||
|
||
- **BidLSTM_ChainCRF_FEATURES** as sequence labeling for the reference-segmenter model | ||
|
||
- **BidLSTM-CRF-FEATURES** as sequence labeling for the citation model | ||
|
||
- **BidLSTM_CRF_FEATURES** as sequence labeling for the affiliation-address model | ||
|
||
- **CRF Wapiti** as sequence labelling engine for all other models. | ||
|
||
Header extractions are consolidated by default with [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service (the results with CrossRef REST API as consolidation service should be similar but much slower). | ||
|
||
The evaluation, which is usually create grobid files suffixing `fulltext.tei.xml`, will suffix also the flavor, for example `article/light` will be suffixed as `article_light.tei.xml`. | ||
In this way is possible to run evaluation for multiple flavor without loosing the Grobid processed files. | ||
|
||
The evaluation is performed on a reduced set of fields: | ||
|
||
| Flavor | Header fields | Fulltext fields | Citation fields | | ||
|---------------------|--------------------------------------|-----------------|----------------------------------| | ||
| `article/light` | `title`, `first author`, `authors` | N/A | N/A | | ||
| `article/light-ref` | `title`, `first author`, `authors` | N/A | Same as the standard processing* | | ||
|
||
(*) for this flavor the citation model is included to avoid regressions, as the citation parsing is performed using the standard citation model | ||
|
||
The benchmarks results are listed here with links to the full reports. | ||
|
||
### Article lightweight structure | ||
|
||
| Corpus | Header (avg. micro F1 Ratcliff/[email protected]) | Full report | | ||
|-----------------|------------------------------------------------|----------------------------------------------------------------------------------| | ||
| Bioxiv | 89.4 | [benchmaking-bioxiv.md](benchmarks/flavors/article_light/benchmaking-bioxiv.md) | | ||
| PMC_sample_1943 | 95.71 | [benchmaking-pmc.md](benchmarks/flavors/article_light/benchmaking-pmc.md) | | ||
| PLOS_1000 | 99.37 | [benchmaking-plos.md](benchmarks/flavors/article_light/benchmaking-plos.md) | | ||
| eLife_984 | 88.73 | [benchmaking-elife.md](benchmarks/flavors/article_light/benchmaking-elife.md) | | ||
|
||
### Article lightweight structure with references | ||
|
||
| Corpus | Header (avg. micro F1 Ratcliff/[email protected]) | Citations (Instance-level f-score (RatcliffObershelp)) | Full report | | ||
|-----------------|------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------| | ||
| Bioxiv | 89.79 | 56.31 | [benchmaking-bioxiv.md](benchmarks/flavors/article_light_ref/benchmaking-bioxiv.md) | | ||
| PMC_sample_1943 | 95.74 | 58.78 | [benchmaking-pmc.md](benchmarks/flavors/article_light_ref/benchmaking-pmc.md) | | ||
| PLOS_1000 | 99.52 | 48.04 | [benchmaking-plos.md](benchmarks/flavors/article_light_ref/benchmaking-plos.md) | | ||
| eLife_984 | 91.35 | 76.14 | [benchmaking-elife.md](benchmarks/flavors/article_light_ref/benchmaking-elife.md) | |
72 changes: 72 additions & 0 deletions
72
doc/benchmarks/flavors/article_light/benchmaking-bioxiv.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
## Header metadata | ||
|
||
Evaluation on 1996 random PDF files out of 1998 PDF (ratio 1.0). | ||
|
||
#### Strict Matching (exact matches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 82.92 | 81.5 | 82.2 | 1995 | | ||
| first_author | 96.33 | 94.78 | 95.55 | 1993 | | ||
| title | 78.16 | 73.7 | 75.86 | 1996 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **85.91** | **83.32** | **84.59** | 5984 | | ||
| all fields (macro avg.) | 85.8 | 83.33 | 84.54 | 5984 | | ||
|
||
#### Soft Matching (ignoring punctuation, case and space characters mismatches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|----------|---------| | ||
| authors | 83.53 | 82.11 | 82.81 | 1995 | | ||
| first_author | 96.63 | 95.08 | 95.85 | 1993 | | ||
| title | 80.66 | 76.05 | 78.29 | 1996 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **87.03** | **84.41** | **85.7** | 5984 | | ||
| all fields (macro avg.) | 86.94 | 84.41 | 85.65 | 5984 | | ||
|
||
#### Levenshtein Matching (Minimum Levenshtein distance at 0.8) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 91.59 | 90.03 | 90.8 | 1995 | | ||
| first_author | 96.84 | 95.28 | 96.05 | 1993 | | ||
| title | 92.03 | 86.77 | 89.32 | 1996 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **93.5** | **90.69** | **92.08** | 5984 | | ||
| all fields (macro avg.) | 93.48 | 90.69 | 92.06 | 5984 | | ||
|
||
#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|----------|---------| | ||
| authors | 87.51 | 86.02 | 86.75 | 1995 | | ||
| first_author | 96.33 | 94.78 | 95.55 | 1993 | | ||
| title | 88.42 | 83.37 | 85.82 | 1996 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **90.78** | **88.05** | **89.4** | 5984 | | ||
| all fields (macro avg.) | 90.75 | 88.05 | 89.37 | 5984 | | ||
|
||
#### Instance-level results | ||
|
||
``` | ||
Total expected instances: 1996 | ||
Total correct instances: 1278 (strict) | ||
Total correct instances: 1312 (soft) | ||
Total correct instances: 1613 (Levenshtein) | ||
Total correct instances: 1496 (ObservedRatcliffObershelp) | ||
Instance-level recall: 64.03 (strict) | ||
Instance-level recall: 65.73 (soft) | ||
Instance-level recall: 80.81 (Levenshtein) | ||
Instance-level recall: 74.95 (RatcliffObershelp) | ||
``` | ||
|
||
Evaluation metrics produced in 15.364 seconds |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
## Header metadata | ||
|
||
Evaluation on 957 random PDF files out of 982 PDF (ratio 1.0). | ||
|
||
#### Strict Matching (exact matches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 78.74 | 78.16 | 78.45 | 957 | | ||
| first_author | 92 | 91.42 | 91.71 | 956 | | ||
| title | 89.92 | 87.67 | 88.78 | 957 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **86.87** | **85.75** | **86.31** | 2870 | | ||
| all fields (macro avg.) | 86.89 | 85.75 | 86.31 | 2870 | | ||
|
||
#### Soft Matching (ignoring punctuation, case and space characters mismatches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 79.05 | 78.47 | 78.76 | 957 | | ||
| first_author | 92 | 91.42 | 91.71 | 956 | | ||
| title | 97 | 94.57 | 95.77 | 957 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **89.3** | **88.15** | **88.73** | 2870 | | ||
| all fields (macro avg.) | 89.35 | 88.15 | 88.75 | 2870 | | ||
|
||
#### Levenshtein Matching (Minimum Levenshtein distance at 0.8) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 90.53 | 89.86 | 90.19 | 957 | | ||
| first_author | 92.32 | 91.74 | 92.03 | 956 | | ||
| title | 98.5 | 96.03 | 97.25 | 957 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **93.75** | **92.54** | **93.14** | 2870 | | ||
| all fields (macro avg.) | 93.78 | 92.54 | 93.16 | 2870 | | ||
|
||
#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 84.32 | 83.7 | 84.01 | 957 | | ||
| first_author | 92 | 91.42 | 91.71 | 956 | | ||
| title | 98.5 | 96.03 | 97.25 | 957 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **91.56** | **90.38** | **90.97** | 2870 | | ||
| all fields (macro avg.) | 91.61 | 90.38 | 90.99 | 2870 | | ||
|
||
#### Instance-level results | ||
|
||
``` | ||
Total expected instances: 957 | ||
Total correct instances: 678 (strict) | ||
Total correct instances: 729 (soft) | ||
Total correct instances: 811 (Levenshtein) | ||
Total correct instances: 773 (ObservedRatcliffObershelp) | ||
Instance-level recall: 70.85 (strict) | ||
Instance-level recall: 76.18 (soft) | ||
Instance-level recall: 84.74 (Levenshtein) | ||
Instance-level recall: 80.77 (RatcliffObershelp) | ||
``` | ||
|
||
Evaluation metrics produced in 13.732 seconds |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
## Header metadata | ||
|
||
Evaluation on 1000 random PDF files out of 998 PDF (ratio 1.0). | ||
|
||
#### Strict Matching (exact matches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 98.97 | 99.28 | 99.12 | 969 | | ||
| first_author | 99.28 | 99.59 | 99.43 | 969 | | ||
| title | 95.79 | 95.5 | 95.64 | 1000 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **97.99** | **98.09** | **98.04** | 2938 | | ||
| all fields (macro avg.) | 98.01 | 98.12 | 98.07 | 2938 | | ||
|
||
#### Soft Matching (ignoring punctuation, case and space characters mismatches) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 98.97 | 99.28 | 99.12 | 969 | | ||
| first_author | 99.28 | 99.59 | 99.43 | 969 | | ||
| title | 99.3 | 99 | 99.15 | 1000 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **99.18** | **99.29** | **99.23** | 2938 | | ||
| all fields (macro avg.) | 99.18 | 99.29 | 99.24 | 2938 | | ||
|
||
#### Levenshtein Matching (Minimum Levenshtein distance at 0.8) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 99.28 | 99.59 | 99.43 | 969 | | ||
| first_author | 99.38 | 99.69 | 99.54 | 969 | | ||
| title | 99.7 | 99.4 | 99.55 | 1000 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **99.46** | **99.56** | **99.51** | 2938 | | ||
| all fields (macro avg.) | 99.45 | 99.56 | 99.51 | 2938 | | ||
|
||
#### Ratcliff/Obershelp Matching (Minimum Ratcliff/Obershelp similarity at 0.95) | ||
|
||
**Field-level results** | ||
|
||
| label | precision | recall | f1 | support | | ||
|-----------------------------|-----------|-----------|-----------|---------| | ||
| authors | 99.18 | 99.48 | 99.33 | 969 | | ||
| first_author | 99.28 | 99.59 | 99.43 | 969 | | ||
| title | 99.5 | 99.2 | 99.35 | 1000 | | ||
| | | | | | | ||
| **all fields (micro avg.)** | **99.32** | **99.42** | **99.37** | 2938 | | ||
| all fields (macro avg.) | 99.32 | 99.42 | 99.37 | 2938 | | ||
|
||
#### Instance-level results | ||
|
||
``` | ||
Total expected instances: 1000 | ||
Total correct instances: 950 (strict) | ||
Total correct instances: 985 (soft) | ||
Total correct instances: 989 (Levenshtein) | ||
Total correct instances: 988 (ObservedRatcliffObershelp) | ||
Instance-level recall: 95 (strict) | ||
Instance-level recall: 98.5 (soft) | ||
Instance-level recall: 98.9 (Levenshtein) | ||
Instance-level recall: 98.8 (RatcliffObershelp) | ||
``` | ||
|
||
Evaluation metrics produced in 12.571 seconds |
Oops, something went wrong.