Skip to content

Commit

Permalink
add tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
SeanLee97 committed Jul 31, 2024
1 parent 8dcae41 commit 1a6e8a8
Showing 1 changed file with 138 additions and 0 deletions.
138 changes: 138 additions & 0 deletions docs/notes/tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
👨‍🏫 Tutorial
============================


3-steps to train a powerful pubmed sentence embeddings.
------------------------------------------------------------

This tutorial will guide you through the process of training powerful sentence embeddings using PubMed data with the AnglE framework. We'll cover data preparation, model training, and evaluation.


Step 1: Data preparation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Clean data from the `qiaojin/PubMedQA <https://huggingface.co/datasets/qiaojin/PubMedQA>`_ dataset and prepare it into AnglE's `DatasetFormats.C <https://angle.readthedocs.io/en/latest/notes/training.html#data-prepration>`_ format.

We have already processed the data and made it available on HuggingFace: `WhereIsAI/medical-triples <https://huggingface.co/datasets/WhereIsAI/medical-triples/viewer/all_pubmed_en_v1>`_. You can use this processed dataset for this tutorial.


Step 2: Train the model with `angle-trainer`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


To train AnglE embeddings, you'll need to install the `angle-emb` package:

.. code-block:: bash
python -m pip install -U angle-emb
The `angle-emb` package includes a user-friendly command-line interface called `angle-trainer <https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended>`_ for training AnglE embeddings.

With `angle-trainer`, you can quickly start model training by specifying the data path and `hyperparameters <https://angle.readthedocs.io/en/latest/notes/training.html#fine-tuning-tips>`_.

Here's an example of training a BERT-base model:

.. code-block:: bash
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/pubmedbert-medical-base-v1 \
--model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 20.0 \
--cosine_w 0.0 \
--angle_w 1.0 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 64 \
--seed 42 \
--gradient_accumulation_steps 3 \
--push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1
And here's an example of training a BERT-large model:

.. code-block:: bash
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/uae-medical-large-v1 \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--load_mlm_model 1 \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 20.0 \
--cosine_w 0.0 \
--angle_w 1.0 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 32 \
--seed 42 \
--gradient_accumulation_steps 3 \
--push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1
These examples use the `WhereIsAI/medical-triples` dataset and specify various hyperparameters for training. Adjust the hyperparameters as needed for your specific use case.


Step 3: Evaluate the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AnglE provides a `CorrelationEvaluator <https://angle.readthedocs.io/en/latest/notes/evaluation.html#spearman-and-pearson-correlation>`_ to evaluate the performance of sentence embeddings.

For convenience, we have processed the `PubMedQA pqa_labeled <https://huggingface.co/datasets/qiaojin/PubMedQA/viewer/pqa_labeled>`_ data into the `DatasetFormats.A` format and made it available as `WhereIsAI/pubmedqa-test-angle-format-a <https://huggingface.co/datasets/WhereIsAI/pubmedqa-test-angle-format-a>`_ for evaluation purposes.

The following code demonstrates how to evaluate the trained `pubmed-angle-base-en` model:

.. code-block:: python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from angle_emb import AnglE, CorrelationEvaluator
from datasets import load_dataset
angle = AnglE.from_pretrained('WhereIsAI/pubmed-angle-base-en', pooling_strategy='cls').cuda()
ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train')
metric = CorrelationEvaluator(
text1=ds['text1'],
text2=ds['text2'],
labels=ds['label']
)(angle, show_progress=True)
print(metric)
Here, we compare the performance of our trained models with two popular models trained on PubMed data. The results are as follows:


+----------------------------------------+-------------------------+
| Model | Spearman's Correlation |
+========================================+=========================+
| tavakolih/all-MiniLM-L6-v2-pubmed-full | 84.56 |
+----------------------------------------+-------------------------+
| NeuML/pubmedbert-base-embeddings | 84.88 |
+----------------------------------------+-------------------------+
| WhereIsAI/pubmed-angle-base-en | 86.01 |
+----------------------------------------+-------------------------+
| WhereIsAI/pubmed-angle-large-en | **86.21** |
+----------------------------------------+-------------------------+


The results show that our trained models, `WhereIsAI/pubmed-angle-base-en` and `WhereIsAI/pubmed-angle-large-en`, performs better than other popular models on the PubMedQA dataset, with the large model achieving the highest Spearman's correlation of **86.21**.

0 comments on commit 1a6e8a8

Please sign in to comment.