From d11a5eab3ba331a9cc581db7aca0c496afbda663 Mon Sep 17 00:00:00 2001 From: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:41:04 -0700 Subject: [PATCH] Apply suggestions from code review Co-authored-by: Ryan McCormick --- .../Part_8-semantic_caching/README.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/Conceptual_Guide/Part_8-semantic_caching/README.md b/Conceptual_Guide/Part_8-semantic_caching/README.md index 9f13bb80..45bbd3fa 100644 --- a/Conceptual_Guide/Part_8-semantic_caching/README.md +++ b/Conceptual_Guide/Part_8-semantic_caching/README.md @@ -26,7 +26,7 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. --> -# Semantic caching +# Semantic Caching When deploying large language models (LLMs) or LLM-based workflows there are two key factors to consider: the performance and cost-efficiency @@ -37,10 +37,9 @@ pressing need for optimization strategies that can maintain high-quality outputs while minimizing operational expenses. Semantic caching emerges as a powerful solution to reduce computational costs -for LLM-based applications. Unlike traditional caching, it considers -the content and context of incoming requests. +for LLM-based applications. -## Definition and Main Benefits +## Definition and Benefits **_Semantic caching_** is a caching mechanism that takes into account the semantics of the incoming request, rather than just the raw data itself. @@ -60,7 +59,7 @@ This approach offers several benefits including, but not limited to: - One of the primary benefits of semantic caching is its ability to significantly improve response times. By retrieving cached responses for similar queries, the system can bypass the need for full model inference, - resulting in the reduced latency. + resulting in reduced latency. + **Increased Throughput** @@ -85,7 +84,7 @@ This approach offers several benefits including, but not limited to: ## Sample Reference Implementation -In this tutorial we provide a reference implementation for Semantic Cache in +In this tutorial we provide a reference implementation for a Semantic Cache in [semantic_caching.py.](./artifacts/semantic_caching.py) There are 3 key dependencies: * [SentenceTransformer](https://sbert.net/): a Python framework for computing @@ -99,7 +98,7 @@ developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. - This library is used for the embedding store and extracting the most similar embedded prompt from the cached requests (or from the index store). - - This is a mighty library with a great variety of CPu and GPU accelerated + - This is a mighty library with a great variety of CPU and GPU accelerated algorithms. - Alternatives include [annoy](https://github.com/spotify/annoy), or [cuVS](https://github.com/rapidsai/cuvs). However, note that cuVS already @@ -107,7 +106,7 @@ clustering of dense vectors. * [Theine](https://github.com/Yiling-J/theine): High performance in-memory cache. - We will use it as our exact match cache backend. After the most similar - prompt is identified, the corresponding cached response id extracted from + prompt is identified, the corresponding cached response is extracted from the cache. This library supports multiple eviction policies, in this tutorial we use "LRU". - One may also look into [MemCached](https://memcached.org/about) as a @@ -124,12 +123,12 @@ as our example, focusing on demonstrating how to cache responses for the non-streaming case. The principles covered here can be extended to handle streaming scenarios as well. -### Cutomising vllm backend +### Customising vLLM Backend First, let's start by cloning Triton's vllm backend repository. This will provide the necessary codebase to implement our semantic caching example. -``bash +```bash git clone https://github.com/triton-inference-server/vllm_backend.git ``` @@ -143,7 +142,7 @@ wget -P vllm_backend/src/utils/ https://raw.githubusercontent.com/triton-inferen ``` Now that we have added the semantic caching script, let's proceed by making -some adjustments in `/vllm_backend/src/model.py`. These changes will integrate +some adjustments in `vllm_backend/src/model.py`. These changes will integrate the semantic caching functionality into the model. First, ensure that you import the necessary classes from `semantic_caching.py`: @@ -234,11 +233,12 @@ but make sure to specify proper paths to the cloned `vllm_backend` repository and replace `` with the latest release of Triton. ```bash -docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G \ ---ulimit memlock=-1 --ulimit stack=67108864 \ --v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \ --v /path/to/vllm_backend/samples/model_repository:/work/model_repository \ --w /work nvcr.io/nvidia/tritonserver:-vllm-python-py3 +docker run --gpus all -it --net=host --rm \ + --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \ + -v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \ + -v /path/to/vllm_backend/samples/model_repository:/workspace/model_repository \ + -w /workspace \ + nvcr.io/nvidia/tritonserver:-vllm-python-py3 ``` When inside the container, make sure to install required dependencies: