Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Add Diagnostic Entities on Scanners #363

Open
5 tasks
agittins opened this issue Nov 8, 2024 · 5 comments
Open
5 tasks

FR: Add Diagnostic Entities on Scanners #363

agittins opened this issue Nov 8, 2024 · 5 comments
Assignees

Comments

@agittins
Copy link
Owner

agittins commented Nov 8, 2024

From message from Lash-L — Today at 2:13 PM

Thoughts on adding a device per each scanner and then adding some entities to it?
I see on a lot of issues you do a lot of debugging diagnostics. What if some of those attributes became easier for users to see and parse.
i.e. "Average update interval"

I'm dead keen on this, but haven't worked out how/what exactly yet, so let's try now.... DRAFT FOR DISCUSSION

These would all probably be on a 1 minute or longer update cycle.

  • Proxy Avg Update Interval (or peak avg update interval?)

    • answers: What's the best this proxy can do at catching every ad from a device?
    • interpretation: 0-2 Excellent, 1-2 Great, 3 Moderate
      • search all (recent) hist_intervals for the given proxy, and find the one with the lowest mean().
      • report the mean/min/max of that interval set.
      • this works by identifying the device that makes this scanner look the best it can, avoiding benchmarking against devices with long intervals or lossy signal paths.
      • this is "aliased" by Bermuda's ~1-second update rate, so a proxy can't do better than that.
  • Proxy Reporting Stats

    • How well does the proxy forward ads to HA? Do we always have fresh data from the proxy every second?
    • Replace the stale_updates count, to instead feed a hist_interval_updates[ fresh, stale] list. We increment fresh or stale on each update cycle, depending on whether the proxy has given us any new data. If update is fresh and stale count is not zero, we first insert a new tuple in the list. Then our list contains pairs of contiguous fresh/stale update counts.
    • Entities:
      • Proxy Update Loss (%) = 1 - (sum(fresh) / sum(fresh+stale))
        • how often this proxy fails to provide fresh data. Esphome should be 0%. Shelly should be 33% or 0% depending on whether it's rate-limiting is synchronous across all devices or not.
      • Proxy Avg Outage Duration (s) = mean(stale)
      • Proxy Avg Outage Frequency (Hz) = mean(fresh) / sum(fresh+stale)
        • multiply by seconds in a day for outages/day

I think we could do something very similar to the proxy stats for devices. It's a little trickier in that "outages" are legitimate for devices, because sometimes they leave home, while proxies aren't expected to. But by trimming the lists based on keeping sum(fresh)+sum(stale) below a certain time limit, we get a good "recent stats as of now" measurement, and HA's history of that entity shows it's variation over time - so you'd see your phone performing well, but then doing "poorly" for a few hours because you were out at work, etc.

So for devices, we check if any proxy has a fresh update for us and update fresh/stale accordingly.

Something to keep in mind is that proxy entries in the devices{} dict will also be metadevices in future.

@agittins agittins assigned agittins and unassigned agittins Nov 8, 2024
@agittins
Copy link
Owner Author

agittins commented Nov 8, 2024

It would also be great if the diagnostics could include an n-most-recent list of each of these sensors too.

Either by storing them into something in Bermuda or by having the diagnostics do a query against recorder for the data.

The goal here is that someone can work out their own issues between the exposed entities and the docs, or they can share screenshots, or they can upload a diagnostics.

At some point I'd like to write a bot/workflow for issues that will parse out useful information when a diags is included in the ticket, which should make triage quicker. (Raised #365 )

@Lash-L
Copy link
Contributor

Lash-L commented Nov 8, 2024

What do you think about a binary sensor to go along with this data? Either one to encompass all of it or one or each entity.

Basically healthy or unhealthy.

That way the user can immediately see - Oh something is wrong with this proxy, let me dive in and figure out what.

Maybe also a FAQ for each entity on troubleshooting steps. i.e. If Proxy Avg Update Interval > 3 then try:

    1. 3., etc.

@agittins
Copy link
Owner Author

agittins commented Nov 9, 2024

Yeah, great idea! I was thinking about using the "Repairs" feature, which allows using URIs to provide solutions. I don't know how annoying that might get for this sort of thing, but we could have a Button entity to "Check for Repairs" so that it only created Repairs when the user asks for help, perhaps prompted by the Health indicator going off.

But yes a simple binary that makes an easy automation target might be a great idea.

@Lash-L
Copy link
Contributor

Lash-L commented Nov 13, 2024

So this seems a bit more difficult that I originally thought.

Since the bermuda device scanner can exist on all of the bermuda devices, I think we would have to go for scanner in coordinator.scanners

Then for device, see if the scanner is on that device, and then do whatever math we want.

Or potentially the coordinator could hold the data we want?

What are your thoughts @agittins

@agittins
Copy link
Owner Author

Yes, my bad - I glazed over in my brain when it comes to the specifics of the scanner/device interdependence.

I'm heading to bed but I'll try and formulate some thoughts later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants