From 1a6e8a8d3e9efe85818b6d667d6ff1773114f44f Mon Sep 17 00:00:00 2001 From: Sean Lee Date: Wed, 31 Jul 2024 16:10:49 +0800 Subject: [PATCH] add tutorial --- docs/notes/tutorial.rst | 138 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 docs/notes/tutorial.rst diff --git a/docs/notes/tutorial.rst b/docs/notes/tutorial.rst new file mode 100644 index 0000000..afbd85b --- /dev/null +++ b/docs/notes/tutorial.rst @@ -0,0 +1,138 @@ +👨‍🏫 Tutorial +============================ + + +3-steps to train a powerful pubmed sentence embeddings. +------------------------------------------------------------ + +This tutorial will guide you through the process of training powerful sentence embeddings using PubMed data with the AnglE framework. We'll cover data preparation, model training, and evaluation. + + +Step 1: Data preparation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +Clean data from the `qiaojin/PubMedQA `_ dataset and prepare it into AnglE's `DatasetFormats.C `_ format. + +We have already processed the data and made it available on HuggingFace: `WhereIsAI/medical-triples `_. You can use this processed dataset for this tutorial. + + +Step 2: Train the model with `angle-trainer` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +To train AnglE embeddings, you'll need to install the `angle-emb` package: + +.. code-block:: bash + + python -m pip install -U angle-emb + +The `angle-emb` package includes a user-friendly command-line interface called `angle-trainer `_ for training AnglE embeddings. + +With `angle-trainer`, you can quickly start model training by specifying the data path and `hyperparameters `_. + +Here's an example of training a BERT-base model: + +.. code-block:: bash + + WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \ + --train_name_or_path WhereIsAI/medical-triples \ + --train_subset_name all_pubmed_en_v1 \ + --save_dir ckpts/pubmedbert-medical-base-v1 \ + --model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \ + --pooling_strategy cls \ + --maxlen 75 \ + --ibn_w 20.0 \ + --cosine_w 0.0 \ + --angle_w 1.0 \ + --learning_rate 1e-6 \ + --logging_steps 5 \ + --save_steps 500 \ + --warmup_steps 50 \ + --batch_size 64 \ + --seed 42 \ + --gradient_accumulation_steps 3 \ + --push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \ + --epochs 1 \ + --fp16 1 + + +And here's an example of training a BERT-large model: + +.. code-block:: bash + + WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 --master_port=1234 -m angle_emb.angle_trainer \ + --train_name_or_path WhereIsAI/medical-triples \ + --train_subset_name all_pubmed_en_v1 \ + --save_dir ckpts/uae-medical-large-v1 \ + --model_name_or_path WhereIsAI/UAE-Large-V1 \ + --load_mlm_model 1 \ + --pooling_strategy cls \ + --maxlen 75 \ + --ibn_w 20.0 \ + --cosine_w 0.0 \ + --angle_w 1.0 \ + --learning_rate 1e-6 \ + --logging_steps 5 \ + --save_steps 500 \ + --warmup_steps 50 \ + --batch_size 32 \ + --seed 42 \ + --gradient_accumulation_steps 3 \ + --push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \ + --epochs 1 \ + --fp16 1 + + +These examples use the `WhereIsAI/medical-triples` dataset and specify various hyperparameters for training. Adjust the hyperparameters as needed for your specific use case. + + +Step 3: Evaluate the model +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +AnglE provides a `CorrelationEvaluator `_ to evaluate the performance of sentence embeddings. + +For convenience, we have processed the `PubMedQA pqa_labeled `_ data into the `DatasetFormats.A` format and made it available as `WhereIsAI/pubmedqa-test-angle-format-a `_ for evaluation purposes. + +The following code demonstrates how to evaluate the trained `pubmed-angle-base-en` model: + +.. code-block:: python + + import os + os.environ['CUDA_VISIBLE_DEVICES'] = '0' + + from angle_emb import AnglE, CorrelationEvaluator + from datasets import load_dataset + + + angle = AnglE.from_pretrained('WhereIsAI/pubmed-angle-base-en', pooling_strategy='cls').cuda() + + ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train') + + metric = CorrelationEvaluator( + text1=ds['text1'], + text2=ds['text2'], + labels=ds['label'] + )(angle, show_progress=True) + + print(metric) + + +Here, we compare the performance of our trained models with two popular models trained on PubMed data. The results are as follows: + + ++----------------------------------------+-------------------------+ +| Model | Spearman's Correlation | ++========================================+=========================+ +| tavakolih/all-MiniLM-L6-v2-pubmed-full | 84.56 | ++----------------------------------------+-------------------------+ +| NeuML/pubmedbert-base-embeddings | 84.88 | ++----------------------------------------+-------------------------+ +| WhereIsAI/pubmed-angle-base-en | 86.01 | ++----------------------------------------+-------------------------+ +| WhereIsAI/pubmed-angle-large-en | **86.21** | ++----------------------------------------+-------------------------+ + + +The results show that our trained models, `WhereIsAI/pubmed-angle-base-en` and `WhereIsAI/pubmed-angle-large-en`, performs better than other popular models on the PubMedQA dataset, with the large model achieving the highest Spearman's correlation of **86.21**. +