Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to stream LLM responses without waiting for a full response? #224

Open
witold-gren opened this issue Oct 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@witold-gren
Copy link
Contributor

Please describe what you are trying to do with the component
It is possible to provide stream response from ollama model? The current model provides the entire answer: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/conversation.py#L1513.

I checked how looks response from Ollama API and it just response tokens by tokens:

{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.28350888Z","response":"J","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.382914643Z","response":"ako","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.453863468Z","response":" s","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.523662503Z","response":"zt","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.602522129Z","response":"uc","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.6787896Z","response":"z","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.751536982Z","response":"na","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.82446238Z","response":" intel","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.896017956Z","response":"ig","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.967416444Z","response":"enc","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.038923347Z","response":"ja","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.110432138Z","response":",","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.181888705Z","response":" mo","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.253508119Z","response":"im","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.324975342Z","response":" z","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.396822724Z","response":"ad","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.46840745Z","response":"an","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.540148976Z","response":"iem","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.61190247Z","response":" jest","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.68361878Z","response":" pom","done":false}

Describe the solution you'd like
The plugin would return the answer sentence by sentence. Then, in the conversation window itself, we would not see one response from the model, but we would see, for example, 4-5 one-sentence responses. This functionality could be configurable where we could enable or disable it during model configuration.

Additional context
Currently I am using Nvidia Jetson AGX Orin 64GB and I have run the LLM 11B model on it (this device consumes very little electricity). However, the response generation time takes approximately 15-20 seconds. Is it possible to reply with sentences generated by the Ollama server? This solution would allow the generation of sentence by sentence via STT (piper) and the effect of generating this text would not be visible.

@witold-gren witold-gren added the enhancement New feature or request label Oct 14, 2024
@acon96
Copy link
Owner

acon96 commented Oct 16, 2024

As far as I know: Home Assistant doesn't support this currently. As soon as Home Assistant has support for doing this via the chat UI or via the STT agent, then I would consider doing it.

@witold-gren
Copy link
Contributor Author

Thanks for your reply. I asked a similar question on the Home Assistant Discord, but I haven't received any answer yet - so I guess it can't be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants