Revert "Apply codespell, manual fixes"

This reverts commit e0bf216.
omicsNLP · Oct 24, 2024 · 126371f · 126371f
1 parent e0bf216
commit 126371f
Show file tree

Hide file tree

Showing 9 changed files with 8 additions and 14 deletions.
diff --git a/.codespell_ignore.txt b/.codespell_ignore.txt
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -28,8 +28,3 @@ repos:
     hooks:
       - id: markdownlint-fix
         args: [--disable, MD013, MD033, MD036, MD041, MD040, --]
-  - repo: https://github.com/codespell-project/codespell
-    rev: v2.3.0
-    hooks:
-      - id: codespell
-        args: [-I, .codespell_ignore.txt]
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -5,7 +5,7 @@
 We as members, contributors, and leaders pledge to make participation in our
 community a harassment-free experience for everyone, regardless of age, body
 size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socioeconomic status,
+identity and expression, level of experience, education, socio-economic status,
 nationality, personal appearance, race, religion, or sexual identity
 and orientation.
 

diff --git a/docs/config_tutorial.md b/docs/config_tutorial.md
@@ -105,7 +105,7 @@ the HTML is generated from each source without having to define exact matches fo
 The second example identifies all `header` elements ranging from `<h3>` to `<h6>`. Auto-CORPus will process all matching
 headers at the same time.
 
-Within the first example, notice the use of "\\\d" instead of the usual "\d" for identifying any digit. This is due to the regex pattern being defined within the config which is a JSON file. For further information about escapaing special characters within JSON have a look at [this guide by tutorials point](https://www.tutorialspoint.com/json_simple/json_simple_escape_characters.htm).
+Within the first example, notice the use of "\\\d" instead of the usual "\d" for identifying any digit. This is due to the regex pattern being defined within the config which is a JSON file. For further informaion about escapaing special characters within JSON have a look at [this guide by tutorials point](https://www.tutorialspoint.com/json_simple/json_simple_escape_characters.htm).
 
 <h3><a name="submit">Submitting/editing config files</a></h3>
 

diff --git a/src/IAO_dicts/IAO_FINAL_MAPPING.txt b/src/IAO_dicts/IAO_FINAL_MAPPING.txt
@@ -114,7 +114,7 @@ experimental design	methods section / materials section
 experimental methods	methods section
 experimental procedures	methods section
 experimental section	methods section
-extended data	supplementary material section
+extented data	supplementary material section
 figures and tables	supplementary material section
 financial support	funding source declaration section
 footnotes	footnote section

diff --git a/src/abbreviation.py b/src/abbreviation.py
@@ -34,7 +34,7 @@ def __conditions(self, candidate):
             viable = False
         if len(candidate.split()) > 2:
             viable = False
-        if candidate.islower():  # customize function discard all lower case candidate
+        if candidate.islower():  # customize funcition discard all lower case candidate
             viable = False
         if not re2.search(r"\p{L}", candidate):  # \p{L} = All Unicode letter
             viable = False

diff --git a/src/utils.py b/src/utils.py
@@ -195,7 +195,7 @@ def handle_defined_by(config, soup):
             if new_matches:
                 new_matches = [x for x in new_matches if x.text]
         if "xpath" in bsAttrs:
-            if isinstance(bsAttrs["xpath"], list):
+            if type(bsAttrs["xpath"]) is list:
                 for path in bsAttrs["xpath"]:
                     xpath_matches = fromstring(str(soup)).xpath(path)
                     if xpath_matches:

diff --git a/tests/data/PMC8885717.html b/tests/data/PMC8885717.html
diff --git a/tests/data/PMC8885717_bioc.json b/tests/data/PMC8885717_bioc.json
@@ -51,7 +51,7 @@
                         "iao_name_1": "textual abstract section",
                         "iao_id_1": "IAO:0000315"
                     },
-                    "text": "To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogeneous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.",
+                    "text": "To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.",
                     "sentences": [],
                     "annotations": [],
                     "relations": []
@@ -63,7 +63,7 @@
                         "iao_name_1": "introduction section",
                         "iao_id_1": "IAO:0000316"
                     },
-                    "text": "Natural language processing (NLP) is a branch of artificial intelligence that uses computers to process, understand, and use human language. NLP is applied in many different fields including language modeling, speech recognition, text mining, and translation systems. In the biomedical realm NLP has been applied to extract, for example, medication data from electronic health records and patient clinical history from free-text (unstructured) clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts (1, 2). Biomedical research publications, although semi-structured, pose similar challenges with regards to extracting and integrating relevant information (3). The full-text of biomedical literature is predominately made available online in the accessible and reusable HTML format, however, some publications are only available as PDF documents which are more difficult to reuse. Efforts to resolve the problem of publication text accessibility across science in general includes work by the Semantic Scholar search engine to convert PDF documents to HTML formats (4). Whichever process is used to obtain a suitable HTML file, before the text can be processed using NLP, heterogeneously structured HTML requires standardization and optimization. BioC is a simple JSON (and XML) format for sharing and reusing text data that has been developed by the text mining community to improve system interoperability (5). The BioC data model consists of collections of documents divided into data elements such as publication sections and associated entity and relation annotations. PubMed Central (PMC) makes full-text articles from its Open Access and Author Manuscript collections available in BioC format (6). To our knowledge there are no services available to convert PMC publications that are not part of these collections to BioC. Additionally, there is a gap in available software to convert publishers' publication HTML to BioC, creating a bottleneck in many biomedical literature text mining workflows caused by having to process documents in heterogeneous formats. To bridge this gap, we have developed an Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) that can be configured to process any HTML publication structure and transform the corresponding publications to BioC format.",
+                    "text": "Natural language processing (NLP) is a branch of artificial intelligence that uses computers to process, understand, and use human language. NLP is applied in many different fields including language modeling, speech recognition, text mining, and translation systems. In the biomedical realm NLP has been applied to extract, for example, medication data from electronic health records and patient clinical history from free-text (unstructured) clinical notes, to significantly speed up processes that would otherwise be extracted manually by experts (1, 2). Biomedical research publications, although semi-structured, pose similar challenges with regards to extracting and integrating relevant information (3). The full-text of biomedical literature is predominately made available online in the accessible and reusable HTML format, however, some publications are only available as PDF documents which are more difficult to reuse. Efforts to resolve the problem of publication text accessibility across science in general includes work by the Semantic Scholar search engine to convert PDF documents to HTML formats (4). Whichever process is used to obtain a suitable HTML file, before the text can be processed using NLP, heterogeneously structured HTML requires standardization and optimization. BioC is a simple JSON (and XML) format for sharing and reusing text data that has been developed by the text mining community to improve system interoperability (5). The BioC data model consists of collections of documents divided into data elements such as publication sections and associated entity and relation annotations. PubMed Central (PMC) makes full-text articles from its Open Access and Author Manuscript collections available in BioC format (6). To our knowledge there are no services available to convert PMC publications that are not part of these collections to BioC. Additionally, there is a gap in available software to convert publishers' publication HTML to BioC, creating a bottleneck in many biomedical literature text mining workflows caused by having to process documents in heterogenous formats. To bridge this gap, we have developed an Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) that can be configured to process any HTML publication structure and transform the corresponding publications to BioC format.",
                     "sentences": [],
                     "annotations": [],
                     "relations": []