Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check-mysql-replication-status: Lag flapping protection #92

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

DrMurx
Copy link

@DrMurx DrMurx commented Dec 18, 2018

Pull Request Checklist

General

  • Update Changelog following the conventions laid out here

  • Update README with any necessary configuration snippets

  • RuboCop passes

  • Existing tests pass

  • First test written 😄

Purpose

MariaDB/MySQL sometimes wrongly reports a very high replication lag for a short moment. Flapping protection helps mitigating this issue better than setting occurrences in sensu's checks definition because you don't lose any alerting granularity.

I had to do major refactoring on check-mysql-replication-status to allow testing.

While doing so, I discovered two minor flaws. I've fixed one of them in dedicated commit upfront so that they could be cherry-picked independently if needed.

The other flaw affects the warning/critical thresholds which are exclusive, contrary to the common practice in other checks. Fixing this would be a breaking change, though, so I left a comment in code for a future release.

Known Compatibility Issues

@DrMurx
Copy link
Author

DrMurx commented Dec 19, 2018

@majormoses The codeclimate-test-reporter in the (existing) test/spec_helper.rb prevents the tests from running. Can I ditch it?

@majormoses
Copy link
Member

The codeclimate-test-reporter in the (existing) test/spec_helper.rb prevents the tests from running. Can I ditch it?

ya we kinda abandoned that as we had pretty low unit test coverage and I have decided to focus more on integration testing which can't really be measured in the same way. We can drop it and bring it back later if we get enough coverage to see the value in it.

@majormoses
Copy link
Member

Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.

@majormoses
Copy link
Member

I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for low_flap_threshold and high_flap_threshold

Copy link
Member

@majormoses majormoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work I have a couple questions, comments, and recommendations.

bin/check-mysql-replication-status.rb Show resolved Hide resolved
bin/check-mysql-replication-status.rb Outdated Show resolved Hide resolved
bin/check-mysql-replication-status.rb Outdated Show resolved Hide resolved
bin/check-mysql-replication-status.rb Outdated Show resolved Hide resolved
# check-mysql-replication-status_spec
#
# DESCRIPTION:
# rspec tests for check-mysql-replication-status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing tests, honestly I keep maintaining all these plugins and its a lot of work, it really helps when there are tests because I won't pretend to know every service in as much detail as all the awesome people (like yourself) that know them. The biggest bang for buck testing IMHO opinion is integration testing. I have written a blog post on writing integration tests for infrastructure: https://blog.sensuapp.org/writing-sensu-plugin-tests-with-test-kitchen-and-serverspec-b646d2eeee51 if you want any ideas or help please feel free to hit me up in slack on tag me in an issue/pr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see an issue with my choice of unit test - since the checkscript itself is required by the test, the check will be executed once after the rspec tests have been finished and fail with unknown due to missing MySQL credentials.

I didn't come up with this pattern for a checkscript unit test myself, just borrowed it from another plugin's test. Any suggestions how to solve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@DrMurx
Copy link
Author

DrMurx commented Dec 21, 2018

I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for low_flap_threshold and high_flap_threshold

Sensu's flapping protection is meant to prevent an unnecessary storm of alerting events in case a check is flapping. This PR does a completely different thing: it works around a condition of MySQL/MariaDB during which a replication slave incorrectly reports very high lag times (in my case it's >10 days!) for a short moment.

Of course I want any real lag to trigger a prompt alert, but I want the check to requery SHOW SLAVE STATUS under this specific condition.

Maybe "flapping protection" isn't the best name; maybe you have a different suggestion?

@DrMurx DrMurx force-pushed the lag-flapping-protection branch from 787474c to b8db04e Compare December 21, 2018 11:09
@DrMurx
Copy link
Author

DrMurx commented Jan 22, 2019

@majormoses I'd like to go on with this one. With your experience in the plugin ecosystem, can you guide me to a different way of testing without including the check script? And maybe you have a better proposal for naming the feature because "flapping" is already a standard term in monitoring solutions? As I wrote earlier, we're dealing with an bug/side effect of MySQL/MariaDB replication which rarely report outliers in the replication lag.

long: '--flapping-sleep=VALUE',
description: 'Sleep between flapping protection retries',
default: 1,
proc: lambda { |s| s.to_i } # rubocop:disable Lambda
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to use a lambda over a proc that I am not seeing? This would remove the need to disable the cop.

Suggested change
proc: lambda { |s| s.to_i } # rubocop:disable Lambda
proc: proc { |s| s.to_i }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest rework

@majormoses
Copy link
Member

I am having a hard time coming up with a good name as its pretty much trying to work around a reporting bug. Maybe something like --discard-invalid-results-threshold, --discard-invalid-results-retries, --discard-invalid-results-sleep? I can't say I like that either but nothing obvious is coming to mind.

Properly detect when server is not a slave
@DrMurx DrMurx force-pushed the lag-flapping-protection branch from b8db04e to 98c20a9 Compare June 11, 2019 09:19
@DrMurx
Copy link
Author

DrMurx commented Jun 17, 2019

@majormoses I've picked up work on this PR again, and I renamed the feature to lag outlier; I think that wording expresses that it's an exceptional condition, yet unrelated to the flapping protection sensu already offers.

Tests are now running properly. I removed the CodeClimate from spec_helper.rb, and added what you pointed out in comment #92 (comment)

@DrMurx DrMurx force-pushed the lag-flapping-protection branch 3 times, most recently from d46418e to bd31e24 Compare June 17, 2019 15:47
@DrMurx DrMurx force-pushed the lag-flapping-protection branch from bd31e24 to 67c3823 Compare June 17, 2019 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants