-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
138 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
👨🏫 Tutorial | ||
============================ | ||
|
||
|
||
3-steps to train a powerful pubmed sentence embeddings. | ||
------------------------------------------------------------ | ||
|
||
This tutorial will guide you through the process of training powerful sentence embeddings using PubMed data with the AnglE framework. We'll cover data preparation, model training, and evaluation. | ||
|
||
|
||
Step 1: Data preparation | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
||
Clean data from the `qiaojin/PubMedQA <https://huggingface.co/datasets/qiaojin/PubMedQA>`_ dataset and prepare it into AnglE's `DatasetFormats.C <https://angle.readthedocs.io/en/latest/notes/training.html#data-prepration>`_ format. | ||
|
||
We have already processed the data and made it available on HuggingFace: `WhereIsAI/medical-triples <https://huggingface.co/datasets/WhereIsAI/medical-triples/viewer/all_pubmed_en_v1>`_. You can use this processed dataset for this tutorial. | ||
|
||
|
||
Step 2: Train the model with `angle-trainer` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
||
To train AnglE embeddings, you'll need to install the `angle-emb` package: | ||
|
||
.. code-block:: bash | ||
python -m pip install -U angle-emb | ||
The `angle-emb` package includes a user-friendly command-line interface called `angle-trainer <https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended>`_ for training AnglE embeddings. | ||
|
||
With `angle-trainer`, you can quickly start model training by specifying the data path and `hyperparameters <https://angle.readthedocs.io/en/latest/notes/training.html#fine-tuning-tips>`_. | ||
|
||
Here's an example of training a BERT-base model: | ||
|
||
.. code-block:: bash | ||
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \ | ||
--train_name_or_path WhereIsAI/medical-triples \ | ||
--train_subset_name all_pubmed_en_v1 \ | ||
--save_dir ckpts/pubmedbert-medical-base-v1 \ | ||
--model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \ | ||
--pooling_strategy cls \ | ||
--maxlen 75 \ | ||
--ibn_w 20.0 \ | ||
--cosine_w 0.0 \ | ||
--angle_w 1.0 \ | ||
--learning_rate 1e-6 \ | ||
--logging_steps 5 \ | ||
--save_steps 500 \ | ||
--warmup_steps 50 \ | ||
--batch_size 64 \ | ||
--seed 42 \ | ||
--gradient_accumulation_steps 3 \ | ||
--push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \ | ||
--epochs 1 \ | ||
--fp16 1 | ||
And here's an example of training a BERT-large model: | ||
|
||
.. code-block:: bash | ||
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \ | ||
--train_name_or_path WhereIsAI/medical-triples \ | ||
--train_subset_name all_pubmed_en_v1 \ | ||
--save_dir ckpts/uae-medical-large-v1 \ | ||
--model_name_or_path WhereIsAI/UAE-Large-V1 \ | ||
--load_mlm_model 1 \ | ||
--pooling_strategy cls \ | ||
--maxlen 75 \ | ||
--ibn_w 20.0 \ | ||
--cosine_w 0.0 \ | ||
--angle_w 1.0 \ | ||
--learning_rate 1e-6 \ | ||
--logging_steps 5 \ | ||
--save_steps 500 \ | ||
--warmup_steps 50 \ | ||
--batch_size 32 \ | ||
--seed 42 \ | ||
--gradient_accumulation_steps 3 \ | ||
--push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \ | ||
--epochs 1 \ | ||
--fp16 1 | ||
These examples use the `WhereIsAI/medical-triples` dataset and specify various hyperparameters for training. Adjust the hyperparameters as needed for your specific use case. | ||
|
||
|
||
Step 3: Evaluate the model | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
AnglE provides a `CorrelationEvaluator <https://angle.readthedocs.io/en/latest/notes/evaluation.html#spearman-and-pearson-correlation>`_ to evaluate the performance of sentence embeddings. | ||
|
||
For convenience, we have processed the `PubMedQA pqa_labeled <https://huggingface.co/datasets/qiaojin/PubMedQA/viewer/pqa_labeled>`_ data into the `DatasetFormats.A` format and made it available as `WhereIsAI/pubmedqa-test-angle-format-a <https://huggingface.co/datasets/WhereIsAI/pubmedqa-test-angle-format-a>`_ for evaluation purposes. | ||
|
||
The following code demonstrates how to evaluate the trained `pubmed-angle-base-en` model: | ||
|
||
.. code-block:: python | ||
import os | ||
os.environ['CUDA_VISIBLE_DEVICES'] = '0' | ||
from angle_emb import AnglE, CorrelationEvaluator | ||
from datasets import load_dataset | ||
angle = AnglE.from_pretrained('WhereIsAI/pubmed-angle-base-en', pooling_strategy='cls').cuda() | ||
ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train') | ||
metric = CorrelationEvaluator( | ||
text1=ds['text1'], | ||
text2=ds['text2'], | ||
labels=ds['label'] | ||
)(angle, show_progress=True) | ||
print(metric) | ||
Here, we compare the performance of our trained models with two popular models trained on PubMed data. The results are as follows: | ||
|
||
|
||
+----------------------------------------+-------------------------+ | ||
| Model | Spearman's Correlation | | ||
+========================================+=========================+ | ||
| tavakolih/all-MiniLM-L6-v2-pubmed-full | 84.56 | | ||
+----------------------------------------+-------------------------+ | ||
| NeuML/pubmedbert-base-embeddings | 84.88 | | ||
+----------------------------------------+-------------------------+ | ||
| WhereIsAI/pubmed-angle-base-en | 86.01 | | ||
+----------------------------------------+-------------------------+ | ||
| WhereIsAI/pubmed-angle-large-en | **86.21** | | ||
+----------------------------------------+-------------------------+ | ||
|
||
|
||
The results show that our trained models, `WhereIsAI/pubmed-angle-base-en` and `WhereIsAI/pubmed-angle-large-en`, performs better than other popular models on the PubMedQA dataset, with the large model achieving the highest Spearman's correlation of **86.21**. | ||
|