- This repo contains the source code of our competition in WSDM Cup 2023: Pre-training for Web Search and Unbiased Learning for Web Search.
- In the Pre-training task, we implement all codes in both Pytorch and PaddlePaddle version (You can pretrain & finetune in anyone of these two frameworks.).
- In the Unbiased LTR task, we implement codes in Pytorch version.
- All checkpoints are available here: Download
Please refer to our paper for details in this competition:
- Task1 Unbiased Learning to rank: Multi-Feature Integration for Perception-Dependent Examination-Bias Estimation
- Task2 Pretraining for web search: Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking
Below is details for Pre-training task. For Unbiased LTR task, see README.md for details.
- Pre-training BERT with MLM and CTR prediction loss (or multi-task CTR prediction loss).
- Finetuning BERT with pairwise ranking loss.
- Obtain prediction scores from different BERTs.
- Ensemble learning to combine BERT features and sparse features.
Details will be updated in the submission paper.
- In all
start.sh
files inpytorch_pretrain (or paddle_pretrain)
, modify alldata_root={Palce Your Data Root Path Here}/baidu_ultr
as your file path. - Set
NPROC
as your GPU number.
cd pytorch_pretrain/pretrain (or paddle_pretrain/pretrain)
sh start.sh
cd pytorch_pretrain/finetune (or paddle_pretrain/finetune)
sh start.sh
cd pytorch_pretrain/submit (or paddle_pretrain/submit)
sh start.sh
We use lambdamart by lightgbm to ensemble different scores from the finetuned bert models.
- query length
- document length
- query frequency
- number of hit words of query in document
- BM25 score
- TF-IDF score
1) Model details: Checkpoints Download Here
Index | Model Flag | Method | Pretrain step | Finetune step | DCG on leaderboard |
---|---|---|---|---|---|
1 | large_group2_wwm_from_unw4625K | M1 | 1700K | 5130 | 11.96214 |
2 | large_group2_wwm_from_unw4625K | M1 | 1700K | 5130 | NAN |
3 | base_group2_wwm | M2 | 2150K | 5130 | ~11.32363 |
4 | large_group2_wwm_from_unw4625K | M1 | 590K | 5130 | 11.94845 |
5 | large_group2_wwm_from_unw4625K | M1 | 1700K | 4180 | NAN |
6 | large_group2_mt_pretrain | M3 | 1940K | 5130 | NAN |
Method | Model Layers | Details |
---|---|---|
M1 | 24 | WWM & CTR prediction as pretraining tasks |
M2 | 12 | WWM & CTR prediction as pretraining tasks |
M3 | 24 | WWM & Multi-task CTR prediction as pretraining tasks |
- Cross validation on validation set to determine best parameters. See
./lambdamart/cross_validation.ipynb
. - Generate the final scores based on the determined parameters in step 1. See
./lambdamart/run.ipynb
.
python ./paddle_pretrain/convert/convert-onnx.py
python -m onnxsim model.onnx model_sim.onnx
x2paddle --framework=onnx --model=model_sim.onnx --save_dir=./pd_model
It will output a folder named ./pd_model
which contains x2paddle.py
(Model definition in paddlepaddle) and model.pdparams
(Trained parameters). Copy x2paddle.py
to ./paddle_pretrain/review/x2paddle.py
. We already generate a file there which converts a 24-layer model, you can directly use this file if your always use a 24-layer model.
- Modify
data_root
as your path in./paddle_pretrain/review/start.sh
and then run it bysh start.sh
. - It uses PaddlePaddle framework to inference the score of each query-document pair.
- We already inference scores of 6 different models. The Scores are all contained in
./lambdamart/features
. - Run all cells in
./lambdamart/run.ipynb
. It will reproduce the scores of our final submission by ensembling all scores from different models, which is the same as./lambdamart/features/final_result_submit.csv
.
We opensource dockers for both pytorch and paddlepaddle to save your configuration time of environment.
Version | Key configuration |
---|---|
Pytorch |
|
PaddlePaddle |
|
To be updated. |
- Xiangsheng Li: [email protected].
- Xiaoshu Chen: [email protected]