Training a BioBERT-based text classifier for categorization of epidemiological study catalogues with the Maelstrom Taxonomy
This repository contains scripts, the API and detailed information about the trained BioBERT model.
Background: Findable, Accessible, Interoperable and Reusable (FAIR) data as a prerequisite for data re-use and a more collaborative and reusable research environment is becoming increasingly important. The National Research Data Infrastructure for Personal Health Data (NFDI4Health) aims to improve the discoverability, reusability and interoperability of health data, in particular from epidemiological, clinical and public health studies. To improve findability of health data sets, NFDI4Health has established the German Central Health Study Hub which provides rich metadata for searching health data sets. With the Maelstrom Catalog, Maelstrom Research provides a large dataset of labeled and harmonized study variables, contributing to the findability and reusability of epidemiological data. Both infrastructures facilitate access to and search within study data catalogues. The categorization and standardization of these catalogue variables is essential to optimize findability and re-use of the data. Thus, the Maelstrom Research has developed the Maelstrom Taxonomy classification, a well-established classification system for epidemiological studies which is also used by NFDI4Health.
As the categorization and standardization of variables is time-consuming and labor-intensive, there is a need for tools and automated methods to ease the curator intensive work. Therefore, NFDI4Health has developed the Metadata Annotation Workbench, a service that supports the annotation of metadata from data catalogues with standardized vocabulary, e.g., to facilitate the categorization of study data. Here, we present the first AI solution for automatic classification and annotation of semantic relationships integrated into this service. In particular, we've developed a text classifier based on the biomedical text mining model BioBERT to support classification into 135 categories of the Maelstrom Taxonomy. The predictive performance of the model on test data was measured, with a weighted F1 score of over 92 % and a macro F1 score of over 81 %. Initial usability tests within the Metadata Annotation Workbench have also shown improved annotation performance especially for non-experts in the taxonomy. As a result, the categorization of study variables can be accelerated, enabling filtering and facilitating the identification of variables through standardized labeling within the German Central Health Study Hub or the Maelstrom catalogue. We are optimistic that such AI applications can be further developed in close collaboration with curators to significantly reduce the annotation effort and to realize semantically annotated interoperable variable catalogues for all application areas in the future.
The first version of the Workbench does not include an automated annotation mechanism. A search and auto-completion feature of the underlying terminology service is implemented to allow the annotator to search through concepts in a terminology and manually assign the concept to the variable. The annotator is supported by the generation of 10 annotation suggestions using string matching of variable and concept. The suggestions are created by text search across all textual fields in the terminologies and the results are ranked towards hits, then labels, then synonyms, then definitions, then annotations. The search endpoint of the underlying terminology service is based on the API of the Ontology Lookup System (OLS).
Data were provided by the Maelstrom Research Group in CSV format. Variables from international epidemiological studies were collected in a raw format in the form of codebooks, categories as labels, ISO scales or the variable labels were missing. The raw material coming from the study owners is manually converted by Maelstrom Research into a cleaned and categorized format for publication in the Maelstrom Catalog. We worked in collaboration with Maelstrom to train and test the model on the data dictionaries from their catalog. Maelstrom selected a subset of variables from their catalog to create a CSV file containing 779,381 tagged variables containing information such as definition, description, and for categorical variables, the description and definition of categories. To train and evaluate the model's performance, the associated classification of each variable (domain and taxonomy), performed by Maelstrom, was appended to the file. The data required transformations such as the creation of additional labels, resulting from strong correlations between two domains, and only baseline variables were retained to eliminate the effect of over-representation of identical follow-up variables. Finally, Maelstrom's taxonomies labeled as 'Other' (e.g., 'Other psychological measures and assessments') were excluded from the model due to their lack of specificity or potential inclusion of inconsistent or unreliable data, which could negatively impact the model's performance if included. The Maelstrom Taxonomy consists of 18 domains, which cover all types of information collected by population-based cohort studies (e.g. in demographics and socioeconomic, health and well-being, assessment and measurement, lifestyle and behavior, or environmental factors). The domains contain a total of 135 sub-domains, used for variable categorization. The variable distribution of the dataset is heterogeneous. The largest number of variables (~147,000) are assigned to the domains of Socio-Demographic and Economic Characteristics, followed by Cognition, Personality and Psychological Measures and Assessments (~108,000), Lifestyle and Behaviors (~90,000), and Diseases (~73,000). The fewest variables are assigned to the domains of Non-Pharmacological Interventions (4820), End of Life (7185), Reproduction (7866), and Health and Community Care Services Utilization (9110). The data is highly imbalanced with least present sub-domains of Toxicology (3) from the domain Laboratory Measures (~13,000), Reproduction (23) from the domain Socio-Demographic and Economic Characteristics and Histology (33) from the domain Physical Measures (~50,000). The Maelstrom dataset is real data from the harmonization of epidemiological studies. The Maelstrom Taxonomy is successfully used by Maelstrom Research for manual categorization of variables. Imbalances in the dataset reflect this use case and variables which assign to the categories Toxicity or Reproduction are therefore less common in epidemiological studies. To account for this aspect, methods such as over- or under-sampling, were not used. This must be considered when analyzing and applying the AI model. Only English language variables were used for training and evaluation.
The accessibility of a large, labelled dataset allowed the use of a deep learning approach. BERT was chosen because of its state-of-the-art performance in various NLP tasks, e.g. text classification link. everal pre-trained models were fine-tuned for a text-classification task and evaluated with the Maelstrom data: BioBERT, Clinical BioBERT, BERT base cased and BERT base uncased. Hyperparameter optimization of the initial training parameters was performed with literature-based parameter variations and did not result in any significant differences between the different BERT models (results not shown). The final BioBERT model was trained with a batch size of 10, a learning rate of 5e-5 and 5 epochs. The default threshold (0.5) was used for evaluating the BioBERT model. Based on Nguyen et al. (2021), a testing/training ratio of 70/30 was used and a training/validation ratio of 90/10. The final training, evaluation and validation sets were therefore split in a 63/30/7 ratio. Due to the few occurrences of the sub-domain Toxicology, it couldn't be guaranteed that all variables occur in all training, validation and test sets. It was ensured that the less present categories (except for Toxicology) are present in all three categories. Nevertheless, the imbalanced support of the categories must be considered for evaluation and application. Training of the AI model was done with Transformers for PyTorch, the state-of-the-art machine learning library in Python [26]. We used the model BertForSequenceClassification with the multi_label_classification problem type. The generic tokenizer class AutoTokenizer was used for automatically returning the correct tokenizer class for the model type. Evaluation was done with the Weights & Biases MLOps platform as well with the scikit-learn Python library.
Data analysis, data set generation and model training were performed using Python and its Hugging Face Transformers library. The model trained on the raw variables is available at the Hugging Face Hub. The model can be used via the Hugging Face TextClassificationPipeline API [29]. We used our own API for integration into the Annotation Workbench. Additional metadata information of the predicted Maelstrom tag (IRI, label, tag, and confidence), which were not available via the Hugging Face pipeline, could be requested via API call.
A Docker image with a Python Flask API with an endpoint for the text classification task was created to allow for easy integration into web applications such as the Annotation Workbench. A request for classification of a word, term, or sentence returns the predicted categories (sub-domains or composed domains), a confidence score, and metadata information such as the Identifier of the Maelstrom categories. The automatic classification method has been integrated as a semi-automatic annotation system. The user is presented with annotation predictions on an item-by-item basis and must actively select the annotation term. The number of predictions was set to the five first categories, ordered by the highest confidence score.
Prediction via Hugging Face pipeline
The models are uploaded at Hugging Face Hub: raw and cleaned. It can be used via the TextClassificationPipeline pipeline task. A token for accessing the repository will be necessary.
The API in this repository can be installed via Docker container. See documentation in the api
folder. It provides metadata information of the predicted Maelstrom tag (IRI, label, tag, confidence).
AI Model with cleaned Maelstrom variables: https://huggingface.co/JuSas/biobert-Maelstrom-cleaned
AI Model with raw Maelstrom variables: https://huggingface.co/JuSas/biobert-Maelstrom-raw
This work was done as part of the NFDI4Health Consortium and is published on behalf of this Consortium (www.nfdi4health.de). It is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 442326535.