Evaluating Home LLM and collaboration with core developers #179

allenporter · 2024-07-01T00:03:45Z

allenporter
Jul 1, 2024

Hi there,

I have been working with a group of other home assistant core developers on the LLM APIs and the evaluation in https://github.com/allenporter/home-assistant-datasets for setting a quality baseline.

Are you interested in collaborating on a quality evaluation for Home LLM? Generally I am interested in establishing a quality baseline to show how your fine tuning or other fixes are improving quality over base local LLMs, and helping the community run their own local LLMs.

If you are interested and/or want to join the home assistant LLM group on discord to discuss let me know!

acon96 · 2024-07-01T20:18:19Z

acon96
Jul 1, 2024
Maintainer

Hi that sounds interesting. I currently don't have much time to commit to this project and am mostly just making fixes to the integration as it breaks. If all you're looking for are some local models to benchmark then all of the models that I've fine-tuned are on HuggingFace here: https://huggingface.co/acon96 If you need them in a different format (i.e. safetensors) just let me know and I can upload them when I get a bit.

I have my own eval script in this repo that at least tries to measure how accurate the model is but it does utilize a similar test set as the training set (no direct examples but device names and request are re-used). I would definitely like to see results from completely held out examples since I couldn't find any other datasets online when I built my own dataset.

2 replies

allenporter Jul 18, 2024
Author

I ran an eval on a couple local models. The initial results were not great, but I don't necessarily blame the model as it may be something about the experiment setup (given the flexibility of home-llm and all the different models and backends, it took a bit to ensure it was setup right). I need to look closer at the results and determine if there are additional types of data that would be helpful. I think it may be solvable with just additional data examples covering these use cases. (I have more data if we want to drill in)

Ollama just added a function calling API so that is nice and an opportunity to use here also. I am not yet sure if the format from home-llm will be easy to adopt to use it (template example for mistral, python client api). I may also try out comparing mistral with tool calling as another baseline to compare with.

Appreciate the work here and happy to collaborate on building up more datasets.

allenporter Aug 5, 2024
Author

@acon96 I ended up making a home assistant LLM leaderboard https://github.com/allenporter/home-assistant-datasets/tree/main/reports and my takeaway is that its impressive that home-llm can compete with llama 3.1 on many tasks despite it being a 3B model. (Notably functionary another fine tuned tool use model does quite well., though xlam doesn't really).

Now that ollama supports tool use, i think it might be interesting to update home-llm to support home assistant intents rather than service calls so that i can be used natively. Curious to hear your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating Home LLM and collaboration with core developers #179

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Evaluating Home LLM and collaboration with core developers #179

allenporter Jul 1, 2024

Replies: 1 comment · 2 replies

acon96 Jul 1, 2024 Maintainer

allenporter Jul 18, 2024 Author

allenporter Aug 5, 2024 Author

allenporter
Jul 1, 2024

Replies: 1 comment 2 replies

acon96
Jul 1, 2024
Maintainer

allenporter Jul 18, 2024
Author

allenporter Aug 5, 2024
Author