RaLMSpec - A Speculation and Caching Framework for Iterative Retrieval-augmented Language Models

Retrieval-augmented language models (RaLM) have emerged as a powerful approach for addressing knowledge-intensive natural language processing (NLP) tasks. RaLM combines a non-parametric knowledge base with a parametric language model to deliver impressive results.

To overcome the overhead challenges associated with iterative RaLM, we introduce RaLMSpec, a novel framework inspired by speculation and caching techniques. RaLMSpec offers a generic speed-up solution for iterative RaLM while maintaining consistent model outputs. This is achieved through two key mechanisms: speculative retrieval and batched verification.

By incorporating additional features such as prefetching (P), optimal speculation stride scheduler (S), and asynchronous verification (A), RaLMSpec maximizes its potential for accelerating RaLM workflows.

How to Run RaLMSpec

Evaluate RaLMSpec:

To run RaLMSpec without any special features enabled, use the following command:

python -u eval_rag_serve.py \
--model_name $MODEL_NAME \
--dataset_path $DATASET \
--dataset_split validation \
--output_dir $OUTPUT_PATH \
--gpu_id $HOST_GPU_ID \
--trial_num $NUMBER_TRAIL \
--stride 4 \
--spec_step 3 \
--retrieval_type $RETRIEVAL_TYPE \
--max_length 128 \
--retriever \
--cache

In our experiments, the options of $MODEL_NAME are

gpt2-medium
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
meta-llama/Llama-2-70b-hf (Requires 4 A100-80G GPUs)
facebook/opt-1.3b

Evaluate RaLMSpec with Prefetch (P):

To run RaLMSpec with Pretech enabled, use the following command:

python -u eval_rag_serve.py \
--model_name $MODEL_NAME \
--dataset_path $DATASET \
--dataset_split validation \
--output_dir $OUTPUT_PATH \
--gpu_id $HOST_GPU_ID \
--trial_num $NUMBER_TRAIL \
--stride 4 \
--spec_step 3 \
--retrieval_type $RETRIEVAL_TYPE \
--max_length 128 \
--retriever \
--cache \
--cache_update_width 20

Evaluate RaLMSpec with Optimal Speculation Stride (S):

To run RaLMSpec with Optimal Speculation Stride enabled, use the following command:

python -u eval_rag_serve.py \
--model_name $MODEL_NAME \
--dataset_path $DATASET \
--dataset_split validation \
--output_dir $OUTPUT_PATH \
--gpu_id $HOST_GPU_ID \
--trial_num $NUMBER_TRAIL \
--stride 4 \
--spec_step 1 \
--retrieval_type $RETRIEVAL_TYPE \
--max_length 128 \
--retriever \
--cache \
--adapt_spec_step

Evaluate RaLMSpec with Asynchronous Verification (A):

To run RaLMSpec with Asynchronous Verification enabled, use the following command:

python -u eval_rag_serve.py \
--model_name $MODEL_NAME \
--dataset_path $DATASET \
--dataset_split validation \
--output_dir $OUTPUT_PATH \
--gpu_id $HOST_GPU_ID \
--trial_num $NUMBER_TRAIL \
--stride 4 \
--spec_step 3 \
--retrieval_type $RETRIEVAL_TYPE \
--max_length 128 \
--retriever \
--cache \
--async_retrieval

Evaluate RaLMSpec with Prefetch (P), Asynchronous Verification (A), and Optimal Speculation Stride (S):

To run RaLMSpec with all functionalities, use the following command:

python -u eval_rag_serve.py \
--model_name $MODEL_NAME \
--dataset_path $DATASET \
--dataset_split validation \
--output_dir $OUTPUT_PATH \
--gpu_id $HOST_GPU_ID \
--trial_num $NUMBER_TRAIL \
--stride 4 \
--spec_step 1 \
--retrieval_type $RETRIEVAL_TYPE \
--max_length 128 \
--retriever \
--cache \
--cache_update_width 20 \
--adapt_spec_step \
--async_retrieval

Diverse workloads

To run experiments on Iter-Retgen and FLARE, simply replace eval_rag_serve.py with eval_rag_serve_retgen.py or eval_rag_serve_active.py.

Evaluation

For document-level retrieval, extensive evaluations were conducted using three language models on four downstream QA datasets. RaLMSpec can achieve a speed-up ratio of 1.75-2.39×, 1.04-1.39×, and 1.31-1.77× when the retriever is an exact dense retriever (EDR), approximate dense retriever (ADR), and sparse retriever (SR) respectively compared with the baseline.

For token-level iterative RaLM (KNN-LM) serving, RaLMSpec can achieve a speed-up ratio up to 7.59× and 2.45× compared to kNN-LMs over wiki-QA when the retriever is an exact dense and approximate dense retriever, respectively.

In summary, RaLMSpec offers an effective solution to enhance the efficiency of iterative RaLM approaches, making them even more powerful and practical for knowledge-intensive NLP tasks.

Citation

Please cite RaLMSpec as:

@misc{zhang2024accelerating,
      title={Accelerating Retrieval-Augmented Language Model Serving with Speculation}, 
      author={Zhihao Zhang and Alan Zhu and Lijie Yang and Yihua Xu and Lanting Li and Phitchaya Mangpo Phothilimthana and Zhihao Jia},
      year={2024},
      eprint={2401.14021},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.idea		.idea
exp_plots		exp_plots
ralm		ralm
scripts		scripts
transformer_utils		transformer_utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
build_hnsw_index.py		build_hnsw_index.py
eval_rag_serve.py		eval_rag_serve.py
eval_rag_serve_active.py		eval_rag_serve_active.py
eval_rag_serve_retgen.py		eval_rag_serve_retgen.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RaLMSpec - A Speculation and Caching Framework for Iterative Retrieval-augmented Language Models

How to Run RaLMSpec

Evaluate RaLMSpec:

Evaluate RaLMSpec with Prefetch (P):

Evaluate RaLMSpec with Optimal Speculation Stride (S):

Evaluate RaLMSpec with Asynchronous Verification (A):

Evaluate RaLMSpec with Prefetch (P), Asynchronous Verification (A), and Optimal Speculation Stride (S):

Diverse workloads

Evaluation

Citation

About

Releases

Packages

Contributors 3

Languages

JackFram/ralm-sys

Folders and files

Latest commit

History

Repository files navigation

RaLMSpec - A Speculation and Caching Framework for Iterative Retrieval-augmented Language Models

How to Run RaLMSpec

Evaluate RaLMSpec:

Evaluate RaLMSpec with Prefetch (P):

Evaluate RaLMSpec with Optimal Speculation Stride (S):

Evaluate RaLMSpec with Asynchronous Verification (A):

Evaluate RaLMSpec with Prefetch (P), Asynchronous Verification (A), and Optimal Speculation Stride (S):

Diverse workloads

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages