Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Ryan McCormick <[email protected]>
  • Loading branch information
oandreeva-nv and rmccorm4 authored Oct 23, 2024
1 parent 0fb90d6 commit d11a5ea
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions Conceptual_Guide/Part_8-semantic_caching/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Semantic caching
# Semantic Caching

When deploying large language models (LLMs) or LLM-based workflows
there are two key factors to consider: the performance and cost-efficiency
Expand All @@ -37,10 +37,9 @@ pressing need for optimization strategies that can maintain
high-quality outputs while minimizing operational expenses.

Semantic caching emerges as a powerful solution to reduce computational costs
for LLM-based applications. Unlike traditional caching, it considers
the content and context of incoming requests.
for LLM-based applications.

## Definition and Main Benefits
## Definition and Benefits

**_Semantic caching_** is a caching mechanism that takes into account
the semantics of the incoming request, rather than just the raw data itself.
Expand All @@ -60,7 +59,7 @@ This approach offers several benefits including, but not limited to:
- One of the primary benefits of semantic caching is its ability to
significantly improve response times. By retrieving cached responses for
similar queries, the system can bypass the need for full model inference,
resulting in the reduced latency.
resulting in reduced latency.

+ **Increased Throughput**

Expand All @@ -85,7 +84,7 @@ This approach offers several benefits including, but not limited to:

## Sample Reference Implementation

In this tutorial we provide a reference implementation for Semantic Cache in
In this tutorial we provide a reference implementation for a Semantic Cache in
[semantic_caching.py.](./artifacts/semantic_caching.py) There are 3 key
dependencies:
* [SentenceTransformer](https://sbert.net/): a Python framework for computing
Expand All @@ -99,15 +98,15 @@ developed by Facebook AI Research for efficient similarity search and
clustering of dense vectors.
- This library is used for the embedding store and extracting the most
similar embedded prompt from the cached requests (or from the index store).
- This is a mighty library with a great variety of CPu and GPU accelerated
- This is a mighty library with a great variety of CPU and GPU accelerated
algorithms.
- Alternatives include [annoy](https://github.com/spotify/annoy), or
[cuVS](https://github.com/rapidsai/cuvs). However, note that cuVS already
has an integration in Faiss, more on this can be found [here.](https://docs.rapids.ai/api/cuvs/nightly/integrations/faiss/)
* [Theine](https://github.com/Yiling-J/theine): High performance in-memory
cache.
- We will use it as our exact match cache backend. After the most similar
prompt is identified, the corresponding cached response id extracted from
prompt is identified, the corresponding cached response is extracted from
the cache. This library supports multiple eviction policies, in this
tutorial we use "LRU".
- One may also look into [MemCached](https://memcached.org/about) as a
Expand All @@ -124,12 +123,12 @@ as our example, focusing on demonstrating how to cache responses for the
non-streaming case. The principles covered here can be extended to handle
streaming scenarios as well.

### Cutomising vllm backend
### Customising vLLM Backend

First, let's start by cloning Triton's vllm backend repository. This will
provide the necessary codebase to implement our semantic caching example.

``bash
```bash
git clone https://github.com/triton-inference-server/vllm_backend.git
```

Expand All @@ -143,7 +142,7 @@ wget -P vllm_backend/src/utils/ https://raw.githubusercontent.com/triton-inferen
```

Now that we have added the semantic caching script, let's proceed by making
some adjustments in `/vllm_backend/src/model.py`. These changes will integrate
some adjustments in `vllm_backend/src/model.py`. These changes will integrate
the semantic caching functionality into the model.

First, ensure that you import the necessary classes from `semantic_caching.py`:
Expand Down Expand Up @@ -234,11 +233,12 @@ but make sure to specify proper paths to the cloned `vllm_backend`
repository and replace `<xx.yy>` with the latest release of Triton.

```bash
docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \
-v /path/to/vllm_backend/samples/model_repository:/work/model_repository \
-w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3
docker run --gpus all -it --net=host --rm \
--shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \
-v /path/to/vllm_backend/samples/model_repository:/workspace/model_repository \
-w /workspace \
nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3
```

When inside the container, make sure to install required dependencies:
Expand Down

0 comments on commit d11a5ea

Please sign in to comment.