Skip to content

Commit

Permalink
Revert "Apply codespell, manual fixes"
Browse files Browse the repository at this point in the history
This reverts commit e0bf216.
  • Loading branch information
tsmbland committed Oct 24, 2024
1 parent e0bf216 commit 126371f
Show file tree
Hide file tree
Showing 9 changed files with 8 additions and 14 deletions.
1 change: 0 additions & 1 deletion .codespell_ignore.txt

This file was deleted.

5 changes: 0 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,3 @@ repos:
hooks:
- id: markdownlint-fix
args: [--disable, MD013, MD033, MD036, MD041, MD040, --]
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
args: [-I, .codespell_ignore.txt]
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socioeconomic status,
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.

Expand Down
2 changes: 1 addition & 1 deletion docs/config_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ the HTML is generated from each source without having to define exact matches fo
The second example identifies all `header` elements ranging from `<h3>` to `<h6>`. Auto-CORPus will process all matching
headers at the same time.

Within the first example, notice the use of "\\\d" instead of the usual "\d" for identifying any digit. This is due to the regex pattern being defined within the config which is a JSON file. For further information about escapaing special characters within JSON have a look at [this guide by tutorials point](https://www.tutorialspoint.com/json_simple/json_simple_escape_characters.htm).
Within the first example, notice the use of "\\\d" instead of the usual "\d" for identifying any digit. This is due to the regex pattern being defined within the config which is a JSON file. For further informaion about escapaing special characters within JSON have a look at [this guide by tutorials point](https://www.tutorialspoint.com/json_simple/json_simple_escape_characters.htm).

<h3><a name="submit">Submitting/editing config files</a></h3>

Expand Down
2 changes: 1 addition & 1 deletion src/IAO_dicts/IAO_FINAL_MAPPING.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ experimental design methods section / materials section
experimental methods methods section
experimental procedures methods section
experimental section methods section
extended data supplementary material section
extented data supplementary material section
figures and tables supplementary material section
financial support funding source declaration section
footnotes footnote section
Expand Down
2 changes: 1 addition & 1 deletion src/abbreviation.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def __conditions(self, candidate):
viable = False
if len(candidate.split()) > 2:
viable = False
if candidate.islower(): # customize function discard all lower case candidate
if candidate.islower(): # customize funcition discard all lower case candidate
viable = False
if not re2.search(r"\p{L}", candidate): # \p{L} = All Unicode letter
viable = False
Expand Down
2 changes: 1 addition & 1 deletion src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def handle_defined_by(config, soup):
if new_matches:
new_matches = [x for x in new_matches if x.text]
if "xpath" in bsAttrs:
if isinstance(bsAttrs["xpath"], list):
if type(bsAttrs["xpath"]) is list:
for path in bsAttrs["xpath"]:
xpath_matches = fromstring(str(soup)).xpath(path)
if xpath_matches:
Expand Down
2 changes: 1 addition & 1 deletion tests/data/PMC8885717.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions tests/data/PMC8885717_bioc.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"iao_name_1": "textual abstract section",
"iao_id_1": "IAO:0000315"
},
"text": "To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogeneous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.",
"text": "To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.",
"sentences": [],
"annotations": [],
"relations": []
Expand All @@ -63,7 +63,7 @@
"iao_name_1": "introduction section",
"iao_id_1": "IAO:0000316"
},
"text": "Natural language processing (NLP) is a branch of artificial intelligence that uses computers to process, understand, and use human language. NLP is applied in many different fields including language modeling, speech recognition, text mining, and translation systems. In the biomedical realm NLP has been applied to extract, for example, medication data from electronic health records and patient clinical history from free-text (unstructured) clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts (1, 2). Biomedical research publications, although semi-structured, pose similar challenges with regards to extracting and integrating relevant information (3). The full-text of biomedical literature is predominately made available online in the accessible and reusable HTML format, however, some publications are only available as PDF documents which are more difficult to reuse. Efforts to resolve the problem of publication text accessibility across science in general includes work by the Semantic Scholar search engine to convert PDF documents to HTML formats (4). Whichever process is used to obtain a suitable HTML file, before the text can be processed using NLP, heterogeneously structured HTML requires standardization and optimization. BioC is a simple JSON (and XML) format for sharing and reusing text data that has been developed by the text mining community to improve system interoperability (5). The BioC data model consists of collections of documents divided into data elements such as publication sections and associated entity and relation annotations. PubMed Central (PMC) makes full-text articles from its Open Access and Author Manuscript collections available in BioC format (6). To our knowledge there are no services available to convert PMC publications that are not part of these collections to BioC. Additionally, there is a gap in available software to convert publishers' publication HTML to BioC, creating a bottleneck in many biomedical literature text mining workflows caused by having to process documents in heterogeneous formats. To bridge this gap, we have developed an Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) that can be configured to process any HTML publication structure and transform the corresponding publications to BioC format.",
"text": "Natural language processing (NLP) is a branch of artificial intelligence that uses computers to process, understand, and use human language. NLP is applied in many different fields including language modeling, speech recognition, text mining, and translation systems. In the biomedical realm NLP has been applied to extract, for example, medication data from electronic health records and patient clinical history from free-text (unstructured) clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts (1, 2). Biomedical research publications, although semi-structured, pose similar challenges with regards to extracting and integrating relevant information (3). The full-text of biomedical literature is predominately made available online in the accessible and reusable HTML format, however, some publications are only available as PDF documents which are more difficult to reuse. Efforts to resolve the problem of publication text accessibility across science in general includes work by the Semantic Scholar search engine to convert PDF documents to HTML formats (4). Whichever process is used to obtain a suitable HTML file, before the text can be processed using NLP, heterogeneously structured HTML requires standardization and optimization. BioC is a simple JSON (and XML) format for sharing and reusing text data that has been developed by the text mining community to improve system interoperability (5). The BioC data model consists of collections of documents divided into data elements such as publication sections and associated entity and relation annotations. PubMed Central (PMC) makes full-text articles from its Open Access and Author Manuscript collections available in BioC format (6). To our knowledge there are no services available to convert PMC publications that are not part of these collections to BioC. Additionally, there is a gap in available software to convert publishers' publication HTML to BioC, creating a bottleneck in many biomedical literature text mining workflows caused by having to process documents in heterogenous formats. To bridge this gap, we have developed an Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) that can be configured to process any HTML publication structure and transform the corresponding publications to BioC format.",
"sentences": [],
"annotations": [],
"relations": []
Expand Down

0 comments on commit 126371f

Please sign in to comment.