Is it possible to stream LLM responses without waiting for a full response? #224

witold-gren · 2024-10-14T22:28:57Z

Please describe what you are trying to do with the component
It is possible to provide stream response from ollama model? The current model provides the entire answer: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/conversation.py#L1513.

I checked how looks response from Ollama API and it just response tokens by tokens:

{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.28350888Z","response":"J","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.382914643Z","response":"ako","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.453863468Z","response":" s","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.523662503Z","response":"zt","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.602522129Z","response":"uc","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.6787896Z","response":"z","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.751536982Z","response":"na","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.82446238Z","response":" intel","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.896017956Z","response":"ig","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:41.967416444Z","response":"enc","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.038923347Z","response":"ja","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.110432138Z","response":",","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.181888705Z","response":" mo","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.253508119Z","response":"im","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.324975342Z","response":" z","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.396822724Z","response":"ad","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.46840745Z","response":"an","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.540148976Z","response":"iem","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.61190247Z","response":" jest","done":false}
{"model":"SpeakLeash/bielik-11b-v2.3-instruct:Q6_K","created_at":"2024-10-14T16:00:42.68361878Z","response":" pom","done":false}

Describe the solution you'd like
The plugin would return the answer sentence by sentence. Then, in the conversation window itself, we would not see one response from the model, but we would see, for example, 4-5 one-sentence responses. This functionality could be configurable where we could enable or disable it during model configuration.

Additional context
Currently I am using Nvidia Jetson AGX Orin 64GB and I have run the LLM 11B model on it (this device consumes very little electricity). However, the response generation time takes approximately 15-20 seconds. Is it possible to reply with sentences generated by the Ollama server? This solution would allow the generation of sentence by sentence via STT (piper) and the effect of generating this text would not be visible.

The text was updated successfully, but these errors were encountered:

acon96 · 2024-10-16T01:42:30Z

As far as I know: Home Assistant doesn't support this currently. As soon as Home Assistant has support for doing this via the chat UI or via the STT agent, then I would consider doing it.

witold-gren · 2024-10-17T21:41:49Z

Thanks for your reply. I asked a similar question on the Home Assistant Discord, but I haven't received any answer yet - so I guess it can't be done.

witold-gren added the enhancement New feature or request label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to stream LLM responses without waiting for a full response? #224

Is it possible to stream LLM responses without waiting for a full response? #224

witold-gren commented Oct 14, 2024

acon96 commented Oct 16, 2024

witold-gren commented Oct 17, 2024

Is it possible to stream LLM responses without waiting for a full response? #224

Is it possible to stream LLM responses without waiting for a full response? #224

Comments

witold-gren commented Oct 14, 2024

acon96 commented Oct 16, 2024

witold-gren commented Oct 17, 2024