-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident management tool #9
Comments
@kirillyu Thank you for the feedback and we glad that the project fits the need.
BI Engine (Alerting) supports Grafana HA out of the box using LBA for performance and redundancy. We aim to solve this kind of visibility issues and provide easy and clean interface for Alerting in Grafana. |
Thanks for the quick and detailed answer. It is not entirely clear what LBA means in this context. Can you expand on the term? |
@kirillyu BI Engine has totally different approach than native Alerting. It designed to run in a separate Docker container, when native Alerting runs as a part of the Grafana container or installation. Engine connects to the Grafana using URL and Service token to get panel/query configuration and run query. Following Grafana HA best practices (https://grafana.com/docs/grafana/latest/setup-grafana/set-up-for-high-availability/) it's recommended to use LBA/Proxy, which you specify in the
I hope it's clear. We will add diagram in the documentation for the upcoming release. |
Got it, but not totally clear. This article is about grafana HA, and I already have it. Can your container be deployed as a cluster with event exchange for dedublication? |
@kirillyu As I replied, supporting distributed cron jobs and API requests are on our roadmap. We won't use deduplication, we will use distributed locking to prevent starting same alerts (cron jobs). |
I had to make sure this is what this is about, thanks! |
@kirillyu We published a new blog post and a hands-on tutorials to highlight our latest development: https://volkovlabs.io/blog/big-1.6.0-20240117/ We are looking for the feedback and would be interested to hear your thoughts. |
@mikhail-vl
|
See swagger API release! Cool |
@kirillyu Yes, Swagger added in v1.7.0 and in the documentation: https://volkovlabs.io/big/api/. Also, in v1.7.0 we implemented distributed alert scheduling. Each scheduler assigns alerts independently. If one of the engines dies, the rest will take a load. BI(G) supports HA at all levels now: https://volkovlabs.io/big/high-availability/ |
@kirillyu Thank you for sharing, that's impressive! I would like to hear about your experience with OnCall when ready. Integration with OnCall is next on our list to implement together with increasing test coverage and small features before the official release. |
Now it's for BIG systems :) |
I'm very deep into incident management. Grafana OnCall is actually poorly suited for this task. The paradox is that she has everything for this. This idea was inspired by the BIG tool. Grafana has very wide possibilities, which need to be pulled out of it with great difficulty. It's the same with OnCall. Therefore, without further details, it seems to me that OnCall is better used as an engine, rather than an integration. Make an interface on top of it. What's wrong with it? Lack of flexibility! At the input, you set the conditions under which you need to trigger a certain escalation flow. So, for example, if you want to trigger two commands at once based on some alert, then it is very difficult to do this. Let's move on. The escalations themselves are also complete hardcode and they do not work with variables. I can’t make one universal escalation, the simplest one, where I simply specify the command that needs to be notified in the form of a variable that is taken from a trigger or from a metric label. For each command and for each unique variant of escalation flow, a new escalation is made. I have a huge number of commands, it’s just expensive to implement. The last case is the UI/UX of the “incident” itself. From there I can request a specific person or team and look at the resolution notes along with the status history. But in my opinion this is just terribly limited functionality, it seems I should be able to send an alert again, manually increase the escalation step, make an announcement/call to managers, or send out a conference call to engineers, where they need to go to for a solution, integrate this with chatops, you can receive information immediately from the chats, the incident manager will thereby receive contact with the end engineers, and finally, just convert this into post mortem after the incident and that’s it - then this would be a tool. And the tool itself for resolving incidents could provide at least the MTTR metric, not to mention the other key ones shown. We fill out the rules themselves in the form of yaml, the rotation of duty officers and their communication options, too, OnCall does not support this well either. Probably without going into unnecessary details, I’ll dwell on this. At the same time, OnCall has both acknowledgment and rotation of calendars on which you can send alerts and, most importantly, escalation steps in case the one who initially received the alert failed. |
It turned out chaotic, but this is the only thing about OnCall for now :) |
Привет, это просто очень крутой проект, моя команда очень ищет нечто подобное на рынке, что совмещало бы в себе простоту, функциональность и гибкость.
Паттерн использования:
Отдельно есть вопрос не касаемо развития функционала:
Есть ли возможность инсталляции HA? Кластеризация или обмен алертами, дедубликация? У нас больше 15 датацентров, больше ста команд, хочется дать всем возможность юзать алерты, с понятным интерфейсом, устойчивые. Сделать так чтобы саппорты всех команд имели общую панель из которой могли бы управлять инцидентами основанными на алертах из любой части бизнеса, быстро определяли команду
The text was updated successfully, but these errors were encountered: