Skip to content

Commit

Permalink
Merge pull request #92 from Thomas-Rowlands/PMC_Config_Update
Browse files Browse the repository at this point in the history
Pmc config update
  • Loading branch information
Thomas-Rowlands authored Dec 6, 2024
2 parents cb897cc + c5d557e commit 2290832
Show file tree
Hide file tree
Showing 12 changed files with 3,228 additions and 17 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ pip install autocorpus
Run the below command for a single file example

```sh
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
```

Run the main app for a directory of files example

```sh
auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
```

### Available arguments
Expand All @@ -46,7 +46,7 @@ auto-corpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/ht

If you wish to contribute or edit a config file then please follow the instructions in the [config guide](docs/config_tutorial.md).

Auto-CORPus is able to parse HTML from different publishers, which utilise different HTML structures and naming conventions. This is made possible by the inclusion of config files which tell Auto-CORPus how to identify specific sections of the article/table within the source HTML. We have supplied a config template along with example config files for [PubMed Central](configs/config_pmc.json), [Plos Genetics](configs/config_plos_genetics.json) and [Nature Genetics](configs/config_nature_genetics.json) in the [configs](configs) directory. Users of Auto-CORPus can submit their own config files for different sources via the [issues](https://github.com/omicsNLP/Auto-CORPus/issues) tab.
Auto-CORPus is able to parse HTML from different publishers, which utilise different HTML structures and naming conventions. This is made possible by the inclusion of config files which tell Auto-CORPus how to identify specific sections of the article/table within the source HTML. We have supplied a config template along with example config files for [PubMed Central](autocorpus/configs/config_pmc.json), [Plos Genetics](autocorpus/configs/config_plos_genetics.json) and [Nature Genetics](autocorpus/configs/config_nature_genetics.json) in the [configs](autocorpus/configs) directory. Users of Auto-CORPus can submit their own config files for different sources via the [issues](https://github.com/omicsNLP/Auto-CORPus/issues) tab.

**Auto-CORPus recognises 2 types of input file which are:**

Expand Down Expand Up @@ -126,13 +126,13 @@ To get started:
1. Run the main app for a single file example:

```sh
python -m autocorpus -c "configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
python -m autocorpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
```

1. Run the main app for a directory of files example

```sh
python -m autocorpus -c "configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
python -m autocorpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
```

**Note:** The `auto-corpus` commandline script is also available and will behave the same as `python -m autocorpus`
Expand Down
File renamed without changes.
File renamed without changes.
194 changes: 194 additions & 0 deletions autocorpus/configs/config_pmc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
{
"config": {
"references": {
"data": {
"title": [],
"journal": [],
"volume": []
},
"defined-by": [
{
"tag": "li"
},
{
"tag": "p",
"attrs": {
"id": "(__){0,2}p\\d+"
}
},
{
"xpath": "//*[@class=\"ref-list\"]"
}
]
},
"title": {
"data": {},
"defined-by": [
{
"tag": "h1",
"xpath": "/html/body/div[2]/div[2]/div/div[1]/div/div[2]/main/article/section[1]/section[2]/div/hgroup/h1"
}
]
},
"keywords": {
"data": {},
"defined-by": [
{
"tag": "section",
"attrs": {
"class": [
"kwd-group"
]
}
}
]
},
"abbreviations-table": {
"data": {},
"defined-by": [
{
"tag": "table",
"attrs": {
"class": "glossary"
}
}
]
},
"sections": {
"data": {
"headers": [
{
"tag": "h2",
"attrs": {
"class": "pmc_sec_title"
}
},
{
"tag": "h2"
}
]
},
"defined-by": [
{
"xpath": "//section[contains(@class, 'body')]/section"
}
]
},
"sub-sections": {
"data": {
"headers": [
{
"tag": "h[3-6]",
"attrs": {
"class": "pmc_sec_title"
}
}
]
},
"defined-by": [
{
"tag": "section",
"xpath": "//section[contains(@class, 'body')]/section/section"
}
]
},
"paragraphs": {
"data": {},
"defined-by": [
{
"tag": "p"
},
{
"tag": "p",
"xpath": "//section[contains(@class, 'body')]/section//p"
}
]
},
"tables": {
"data": {
"caption": [
{
"tag": "div",
"attrs": {
"class": "caption"
}
}
],
"table-content": [
{
"tag": "table"
}
],
"title": [
{
"tag": "h4",
"attrs": {
"class": "obj_head"
}
}
],
"footer": [
{
"tag": "div",
"attrs": {
"class": "tw-foot"
}
}
],
"table-row": [
{
"tag": "tr"
}
],
"header-row": [
{
"tag": "thead"
}
],
"header-element": [
{
"tag": "th"
}
]
},
"defined-by": [
{
"tag": "section",
"attrs": {
"class": "tw"
}
}
]
},
"figures": {
"data": {
"caption": [
{
"tag": "p"
}
]
},
"defined-by": [
{
"tag": "figcaption"
}
]
}
},
"contributions": {
"author": {
"name": "Tom Shorter",
"contact_email": "[email protected]",
"comments": "Provided with Auto-CORPus for processing PubMed Central HTML files"
},
"editors": [
{
"name": "Thomas Rowlands",
"contact_email": "",
"date_edited": "28/11/2024",
"comments": "Modified for compatibility with PMC website changes from October 2024."
}
]
},
"example_source_HTML_URL": "https://pmc.ncbi.nlm.nih.gov/articles/PMC8885717"
}
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 2290832

Please sign in to comment.