-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt caching #5603
base: main
Are you sure you want to change the base?
Prompt caching #5603
Conversation
The example message is not used with Anthropic models, which are the only models that support prompt caching. Therefore, setting cache_prompt on the example message was unnecessary.
@openhands-agent We want a little more data about prompt caching here. Take a look at llm.py and its computation of stats like prompt tokens, input tokens, cache hits or writes etc. Understand how it works. As you can see, sometimes we have, sometimes we don't have cache writes or hits. When we have data about them, change the var names and the format they're displayed in, to: Input tokens: {total_input_tokens} where input_tokens is the difference between total input tokens - (cache writes + cache reads) Important: pay attention to when they're available |
OVERVIEW: ✓ RESOLVED:
CONCLUSION: |
@openhands-agent This PR is excellent. It does the right things. Now, read this PR's diff, and then, read the completion cost method in llm.py. That method computes the cost using litellm's function and its internal prices for tokens. It also uses, if available, the configuration variables for input_cost_per_token and similar for output. (we allow the user to specify those if necessary) But we have a problem: if the tokens were cached, that is, if prompt was enabled and working, it is possible that the prices per token are different for cache hit tokens vs uncached (regular input) tokens and even for cache writes! Propose a solution for that. Do try to look up and understand litellm's cost per token, so that we use it correctly. Keep your changes minimal. |
Concise Overview: Status: PARTIALLY RESOLVED Remaining Issues:
Next Steps Required:
Overall Assessment: The proposed changes attempted to address the caching pricing structure but failed to properly integrate with litellm's existing system. Further work is needed to fully resolve the issues. |
@openhands-agent The current solution to the last comment here, in PR 5603, has two problems:
Let's think about the latter: the discount is a real thing for any provider that offers prompt caching, they all have a discount for cache hits, so I think we should do two things:
On the other hand, the increment for cache writes tokens, is an Anthropic-only thing afaik. Read the llm.py code to see other anthropic specific handling. Lets try computing the cost of cache writes only if the provider / model has anthropic in the name... |
Here's a concise overview of the changes and remaining issues: Status: Partially Resolved - Significant Issues Remaining Completed:
Outstanding Issues:
Next Steps Required:
Overall Assessment: |
@openhands-agent One more thing in this PR, 5603. Please read the diff. And add the new config settings to the config.template.toml file, where we try to document a bit all configuration options. |
OVERVIEW:
No remaining issues are apparent from the provided information. |
a286da4
to
5f3b761
Compare
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
The example message is not used with Anthropic models, which are the only models that support prompt caching. Therefore, setting cache_prompt on the example message was unnecessary.
This change removes the cache_prompt parameter from the example message, making the caching points clearer and more accurate:
One cache point for the system message (at start)
Three more cache points from the reversed messages loop for user/tool messages
This gives us exactly 4 cache points total, which is what we want for Anthropic models.
Link of any specific issues this addresses
To run this PR locally, use the following command: