FR: Add Diagnostic Entities on Scanners #363

agittins · 2024-11-08T10:16:12Z

From message from Lash-L — Today at 2:13 PM

Thoughts on adding a device per each scanner and then adding some entities to it?
I see on a lot of issues you do a lot of debugging diagnostics. What if some of those attributes became easier for users to see and parse.
i.e. "Average update interval"

I'm dead keen on this, but haven't worked out how/what exactly yet, so let's try now.... DRAFT FOR DISCUSSION

These would all probably be on a 1 minute or longer update cycle.

Proxy Avg Update Interval (or peak avg update interval?)
- answers: What's the best this proxy can do at catching every ad from a device?
- interpretation: 0-2 Excellent, 1-2 Great, 3 Moderate
  - search all (recent) hist_intervals for the given proxy, and find the one with the lowest mean().
  - report the mean/min/max of that interval set.
  - this works by identifying the device that makes this scanner look the best it can, avoiding benchmarking against devices with long intervals or lossy signal paths.
  - this is "aliased" by Bermuda's ~1-second update rate, so a proxy can't do better than that.
Proxy Reporting Stats
- How well does the proxy forward ads to HA? Do we always have fresh data from the proxy every second?
- Replace the stale_updates count, to instead feed a hist_interval_updates[ fresh, stale] list. We increment fresh or stale on each update cycle, depending on whether the proxy has given us any new data. If update is fresh and stale count is not zero, we first insert a new tuple in the list. Then our list contains pairs of contiguous fresh/stale update counts.
- Entities:
  - Proxy Update Loss (%) = 1 - (sum(fresh) / sum(fresh+stale))
    - how often this proxy fails to provide fresh data. Esphome should be 0%. Shelly should be 33% or 0% depending on whether it's rate-limiting is synchronous across all devices or not.
  - Proxy Avg Outage Duration (s) = mean(stale)
  - Proxy Avg Outage Frequency (Hz) = mean(fresh) / sum(fresh+stale)
    - multiply by seconds in a day for outages/day

I think we could do something very similar to the proxy stats for devices. It's a little trickier in that "outages" are legitimate for devices, because sometimes they leave home, while proxies aren't expected to. But by trimming the lists based on keeping sum(fresh)+sum(stale) below a certain time limit, we get a good "recent stats as of now" measurement, and HA's history of that entity shows it's variation over time - so you'd see your phone performing well, but then doing "poorly" for a few hours because you were out at work, etc.

So for devices, we check if any proxy has a fresh update for us and update fresh/stale accordingly.

Something to keep in mind is that proxy entries in the devices{} dict will also be metadevices in future.

The text was updated successfully, but these errors were encountered:

agittins · 2024-11-08T12:46:16Z

It would also be great if the diagnostics could include an n-most-recent list of each of these sensors too.

Either by storing them into something in Bermuda or by having the diagnostics do a query against recorder for the data.

The goal here is that someone can work out their own issues between the exposed entities and the docs, or they can share screenshots, or they can upload a diagnostics.

At some point I'd like to write a bot/workflow for issues that will parse out useful information when a diags is included in the ticket, which should make triage quicker. (Raised #365 )

Lash-L · 2024-11-08T13:53:51Z

What do you think about a binary sensor to go along with this data? Either one to encompass all of it or one or each entity.

Basically healthy or unhealthy.

That way the user can immediately see - Oh something is wrong with this proxy, let me dive in and figure out what.

Maybe also a FAQ for each entity on troubleshooting steps. i.e. If Proxy Avg Update Interval > 3 then try:

1. 3., etc.

agittins · 2024-11-09T01:56:44Z

Yeah, great idea! I was thinking about using the "Repairs" feature, which allows using URIs to provide solutions. I don't know how annoying that might get for this sort of thing, but we could have a Button entity to "Check for Repairs" so that it only created Repairs when the user asks for help, perhaps prompted by the Health indicator going off.

But yes a simple binary that makes an easy automation target might be a great idea.

Lash-L · 2024-11-13T18:00:36Z

So this seems a bit more difficult that I originally thought.

Since the bermuda device scanner can exist on all of the bermuda devices, I think we would have to go for scanner in coordinator.scanners

Then for device, see if the scanner is on that device, and then do whatever math we want.

Or potentially the coordinator could hold the data we want?

What are your thoughts @agittins

agittins · 2024-11-13T20:17:22Z

Yes, my bad - I glazed over in my brain when it comes to the specifics of the scanner/device interdependence.

I'm heading to bed but I'll try and formulate some thoughts later.

agittins assigned agittins and unassigned agittins Nov 8, 2024

agittins mentioned this issue Nov 8, 2024

Create a github workflow/bot for parsing diagnostics in issues #365

Open

agittins assigned Lash-L Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Add Diagnostic Entities on Scanners #363

FR: Add Diagnostic Entities on Scanners #363

agittins commented Nov 8, 2024

agittins commented Nov 8, 2024 •

edited

Loading

Lash-L commented Nov 8, 2024

agittins commented Nov 9, 2024

Lash-L commented Nov 13, 2024

agittins commented Nov 13, 2024

FR: Add Diagnostic Entities on Scanners #363

FR: Add Diagnostic Entities on Scanners #363

Comments

agittins commented Nov 8, 2024

agittins commented Nov 8, 2024 • edited Loading

Lash-L commented Nov 8, 2024

agittins commented Nov 9, 2024

Lash-L commented Nov 13, 2024

agittins commented Nov 13, 2024

agittins commented Nov 8, 2024 •

edited

Loading