Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt caching #5603

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Prompt caching #5603

wants to merge 13 commits into from

Conversation

enyst
Copy link
Collaborator

@enyst enyst commented Dec 14, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

The example message is not used with Anthropic models, which are the only models that support prompt caching. Therefore, setting cache_prompt on the example message was unnecessary.

This change removes the cache_prompt parameter from the example message, making the caching points clearer and more accurate:

One cache point for the system message (at start)
Three more cache points from the reversed messages loop for user/tool messages
This gives us exactly 4 cache points total, which is what we want for Anthropic models.


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:5f3b761-nikolaik   --name openhands-app-5f3b761   docker.all-hands.dev/all-hands-ai/openhands:5f3b761

The example message is not used with Anthropic models, which are the only models
that support prompt caching. Therefore, setting cache_prompt on the example
message was unnecessary.
@enyst enyst changed the title Remove unnecessary prompt caching from example message Prompt caching Dec 14, 2024
@enyst enyst marked this pull request as draft December 14, 2024 20:11
@enyst
Copy link
Collaborator Author

enyst commented Dec 14, 2024

@openhands-agent We want a little more data about prompt caching here. Take a look at llm.py and its computation of stats like prompt tokens, input tokens, cache hits or writes etc. Understand how it works.

As you can see, sometimes we have, sometimes we don't have cache writes or hits. When we have data about them, change the var names and the format they're displayed in, to:

Input tokens: {total_input_tokens}
Cache hits: {cache_hits} - cache writes: {cache_writes} - input: {input_tokens}
Output tokens: {output_tokens}

where input_tokens is the difference between total input tokens - (cache writes + cache reads)

Important: pay attention to when they're available

@openhands-agent
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

OVERVIEW:

✓ RESOLVED:

  • Token statistics display format implemented correctly
  • Input token calculation logic fixed
  • Cache statistics handling improved
  • Variable naming enhanced for clarity

CONCLUSION:
All identified issues appear to have been successfully resolved with no remaining concerns. The changes fully address the original feedback while maintaining proper functionality and edge case handling.

@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Dec 15, 2024
@enyst enyst added the lint-fix label Dec 15, 2024
@enyst
Copy link
Collaborator Author

enyst commented Dec 15, 2024

@openhands-agent This PR is excellent. It does the right things. Now, read this PR's diff, and then, read the completion cost method in llm.py. That method computes the cost using litellm's function and its internal prices for tokens. It also uses, if available, the configuration variables for input_cost_per_token and similar for output. (we allow the user to specify those if necessary)

But we have a problem: if the tokens were cached, that is, if prompt was enabled and working, it is possible that the prices per token are different for cache hit tokens vs uncached (regular input) tokens and even for cache writes!

Propose a solution for that. Do try to look up and understand litellm's cost per token, so that we use it correctly. Keep your changes minimal.

@openhands-agent
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

Concise Overview:

Status: PARTIALLY RESOLVED

Remaining Issues:

  1. Solution lacks integration with litellm's actual pricing structure
  2. Pricing assumptions (80% discount, 20% premium, 70/30 split) are arbitrary rather than based on litellm's documentation
  3. Need for proper research of litellm's token cost calculations

Next Steps Required:

  • Research and implement actual litellm pricing structure
  • Create configurable pricing mechanism based on litellm's system
  • Remove arbitrary pricing assumptions

Overall Assessment: The proposed changes attempted to address the caching pricing structure but failed to properly integrate with litellm's existing system. Further work is needed to fully resolve the issues.

@enyst
Copy link
Collaborator Author

enyst commented Dec 15, 2024

@openhands-agent The current solution to the last comment here, in PR 5603, has two problems:

  • it removes the use of the litellm's CostPerToken extra-kwarg to its cost function. I think we need to first look carefully into what the CostPerToken implementation does, and see if we can use it more to solve our problem computing costs with prompt caching, rather than removing it directly. Investigate this. Explain your conclusions in plain English.
  • it estimates discounts. This is wrong. But how do we do instead?

Let's think about the latter: the discount is a real thing for any provider that offers prompt caching, they all have a discount for cache hits, so I think we should do two things:

  1. try to find if litellm already has some other var for it that we can use; and if not,
  2. add it as a configurable option to config just like the cost per token etc are.

On the other hand, the increment for cache writes tokens, is an Anthropic-only thing afaik. Read the llm.py code to see other anthropic specific handling. Lets try computing the cost of cache writes only if the provider / model has anthropic in the name...

@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Dec 15, 2024
@openhands-agent
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

Here's a concise overview of the changes and remaining issues:

Status: Partially Resolved - Significant Issues Remaining

Completed:

  • Added configuration options for discounts and cache writes

Outstanding Issues:

  1. Missing investigation and explanation of litellm's CostPerToken implementation
  2. No verification if litellm has existing variables for cache hit discounts
  3. Incomplete review of Anthropic-specific handling in llm.py

Next Steps Required:

  1. Conduct and document investigation of litellm's implementation
  2. Research existing litellm variables before adding new ones
  3. Review and incorporate findings from llm.py regarding Anthropic handling
  4. Implement solutions based on these investigations

Overall Assessment:
The changes made so far appear to be premature implementations without the requested preliminary research and analysis. A more thorough investigation phase is needed before proceeding with code changes.

@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Dec 15, 2024
@enyst enyst added lint-fix and removed lint-fix labels Dec 15, 2024
@enyst
Copy link
Collaborator Author

enyst commented Dec 15, 2024

@openhands-agent One more thing in this PR, 5603. Please read the diff. And add the new config settings to the config.template.toml file, where we try to document a bit all configuration options.

@openhands-agent
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

OVERVIEW:
✓ All issues appear to be successfully resolved. The changes implemented:

  • Added new config settings (cache_hit_discount and cache_write_premium)
  • Included proper documentation for each setting
  • Placed settings appropriately in the file
  • Maintained consistent documentation style

No remaining issues are apparent from the provided information.

config.template.toml Outdated Show resolved Hide resolved
config.template.toml Outdated Show resolved Hide resolved
@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Dec 15, 2024
@enyst enyst force-pushed the enyst/prompt-caching branch from a286da4 to 5f3b761 Compare December 15, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants