add tutorial

SeanLee97 · Jul 31, 2024 · 1a6e8a8 · 1a6e8a8
1 parent 8dcae41
commit 1a6e8a8
Showing 1 changed file with 138 additions and 0 deletions.
diff --git a/docs/notes/tutorial.rst b/docs/notes/tutorial.rst
@@ -0,0 +1,138 @@
+👨‍🏫 Tutorial
+============================
+
+
+3-steps to train a powerful pubmed sentence embeddings.
+------------------------------------------------------------
+
+This tutorial will guide you through the process of training powerful sentence embeddings using PubMed data with the AnglE framework. We'll cover data preparation, model training, and evaluation.
+
+
+Step 1: Data preparation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+Clean data from the `qiaojin/PubMedQA <https://huggingface.co/datasets/qiaojin/PubMedQA>`_ dataset and prepare it into AnglE's `DatasetFormats.C <https://angle.readthedocs.io/en/latest/notes/training.html#data-prepration>`_ format.
+
+We have already processed the data and made it available on HuggingFace: `WhereIsAI/medical-triples <https://huggingface.co/datasets/WhereIsAI/medical-triples/viewer/all_pubmed_en_v1>`_. You can use this processed dataset for this tutorial.
+
+
+Step 2: Train the model with `angle-trainer`
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+To train AnglE embeddings, you'll need to install the `angle-emb` package:
+
+.. code-block:: bash
+
+    python -m pip install -U angle-emb
+
+The `angle-emb` package includes a user-friendly command-line interface called `angle-trainer <https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended>`_ for training AnglE embeddings.
+
+With `angle-trainer`, you can quickly start model training by specifying the data path and `hyperparameters <https://angle.readthedocs.io/en/latest/notes/training.html#fine-tuning-tips>`_.
+
+Here's an example of training a BERT-base model:
+
+.. code-block:: bash
+
+    WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \
+    --train_name_or_path WhereIsAI/medical-triples \
+    --train_subset_name all_pubmed_en_v1 \
+    --save_dir ckpts/pubmedbert-medical-base-v1 \
+    --model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \
+    --pooling_strategy cls \
+    --maxlen 75 \
+    --ibn_w 20.0 \
+    --cosine_w 0.0 \
+    --angle_w 1.0 \
+    --learning_rate 1e-6 \
+    --logging_steps 5 \
+    --save_steps 500 \
+    --warmup_steps 50 \
+    --batch_size 64 \
+    --seed 42 \
+    --gradient_accumulation_steps 3 \
+    --push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \
+    --epochs 1 \
+    --fp16 1
+
+
+And here's an example of training a BERT-large model:
+
+.. code-block:: bash
+
+    WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \
+    --train_name_or_path WhereIsAI/medical-triples \
+    --train_subset_name all_pubmed_en_v1 \
+    --save_dir ckpts/uae-medical-large-v1 \
+    --model_name_or_path WhereIsAI/UAE-Large-V1 \
+    --load_mlm_model 1 \
+    --pooling_strategy cls \
+    --maxlen 75 \
+    --ibn_w 20.0 \
+    --cosine_w 0.0 \
+    --angle_w 1.0 \
+    --learning_rate 1e-6 \
+    --logging_steps 5 \
+    --save_steps 500 \
+    --warmup_steps 50 \
+    --batch_size 32 \
+    --seed 42 \
+    --gradient_accumulation_steps 3 \
+    --push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \
+    --epochs 1 \
+    --fp16 1
+
+
+These examples use the `WhereIsAI/medical-triples` dataset and specify various hyperparameters for training. Adjust the hyperparameters as needed for your specific use case.
+
+
+Step 3: Evaluate the model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+AnglE provides a `CorrelationEvaluator <https://angle.readthedocs.io/en/latest/notes/evaluation.html#spearman-and-pearson-correlation>`_ to evaluate the performance of sentence embeddings.
+
+For convenience, we have processed the `PubMedQA pqa_labeled <https://huggingface.co/datasets/qiaojin/PubMedQA/viewer/pqa_labeled>`_ data into the `DatasetFormats.A` format and made it available as `WhereIsAI/pubmedqa-test-angle-format-a <https://huggingface.co/datasets/WhereIsAI/pubmedqa-test-angle-format-a>`_ for evaluation purposes.
+
+The following code demonstrates how to evaluate the trained `pubmed-angle-base-en` model:
+
+.. code-block:: python
+
+    import os
+    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+    from angle_emb import AnglE, CorrelationEvaluator
+    from datasets import load_dataset
+
+
+    angle = AnglE.from_pretrained('WhereIsAI/pubmed-angle-base-en', pooling_strategy='cls').cuda()
+
+    ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train')
+
+    metric = CorrelationEvaluator(
+        text1=ds['text1'],
+        text2=ds['text2'],
+        labels=ds['label']
+    )(angle, show_progress=True)
+
+    print(metric)
+
+
+Here, we compare the performance of our trained models with two popular models trained on PubMed data. The results are as follows:
+
+
++----------------------------------------+-------------------------+
+| Model                                  | Spearman's Correlation  |
++========================================+=========================+
+| tavakolih/all-MiniLM-L6-v2-pubmed-full | 84.56                   |
++----------------------------------------+-------------------------+
+| NeuML/pubmedbert-base-embeddings       | 84.88                   |
++----------------------------------------+-------------------------+
+| WhereIsAI/pubmed-angle-base-en         | 86.01                   |
++----------------------------------------+-------------------------+
+| WhereIsAI/pubmed-angle-large-en        | **86.21**               |
++----------------------------------------+-------------------------+
+
+
+The results show that our trained models, `WhereIsAI/pubmed-angle-base-en` and `WhereIsAI/pubmed-angle-large-en`, performs better than other popular models on the PubMedQA dataset, with the large model achieving the highest Spearman's correlation of **86.21**.
+