Skip to content

Commit

Permalink
CA-391651: Fix spike in derived RRD metrics (#6086)
Browse files Browse the repository at this point in the history
`xcp-rrdd` used to ignore timestamps of measurement times provided by
the RRDD plugins, and always calculated change in all values against a
single timestamp captured at the time of reading on the server. This can
potentially trigger a case where the time that passed between
measurements is not equal to the time that passed between readings (or
other measurements), and values would be calculated incorrectly. This
can be triggered on a host with 100% CPU load, where `cpu0-P0` derived
metric, for example, would often be > 1.0.

This PR reworks `xcp-rrdd` and the plugin infrastructure to calculate
values based on the actual length of time passed since the last value
for this particular metric was recorded. This required extensive
architectural changes, explained in detail in commits themselves - *so
best reviewed by commit*.

I've tested this over the weekend, and haven't registered any spikes in
metrics on a fully-loaded host. Will run BST+BVT as well, since RRDD is
at the center of a lot of other things and is quite fragile.
  • Loading branch information
last-genius authored Dec 4, 2024
2 parents 95fc08f + 1d7fe35 commit 8971c8d
Show file tree
Hide file tree
Showing 32 changed files with 675 additions and 411 deletions.
9 changes: 3 additions & 6 deletions doc/content/design/plugin-protocol-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -62,7 +62,7 @@ reported datasources.
### Example
```
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -96,7 +96,7 @@ Protocol V2
|data checksum |32 |int32 |binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values|
|metadata checksum |32 |int32 |binary-encoded crc32 of the metadata string (see below) |
|number of datasources|32 |int32 |only needed if the metadata has changed - otherwise RRDD can use a cached value |
|timestamp |64 |int64 |Unix epoch |
|timestamp |64 |double|Unix epoch |
|datasource values |n * 64 |int64 \| double |n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float] |
|metadata length |32 |int32 | |
|metadata |(string length)*8|string| |
Expand Down Expand Up @@ -193,6 +193,3 @@ This means that for a normal update, RRDD will only have to read the header plus
the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of
datasources exported by the plugin. If the metadata changes RRDD will have to
read all the data (and parse the metadata).

n.b. the timestamp reported by plugins is not currently used by RRDD - it uses
its own global timestamp.
57 changes: 45 additions & 12 deletions doc/content/xcp-rrdd/design/plugin-protocol-v2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: RRDD plugin protocol v2
layout: default
design_doc: true
revision: 1
status: released (7.0)
Expand All @@ -19,7 +20,7 @@ DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -58,9 +59,10 @@ This should always be present.
* The JSON data itself, encoding the values and metadata associated with the
reported datasources.

### Example
```
{
"timestamp": 1339685573,
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
Expand Down Expand Up @@ -90,19 +92,32 @@ Protocol V2

|value|bits|format|notes|
|-----|----|------|-----|
|header string |(string length)*8|string|"Datasources" as in the V1 protocol |
|header string |(string length)*8|string|"DATASOURCES" as in the V1 protocol |
|data checksum |32 |int32 |binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values|
|metadata checksum |32 |int32 |binary-encoded crc32 of the metadata string (see below) |
|number of datasources|32 |int32 |only needed if the metadata has changed - otherwise RRDD can use a cached value |
|timestamp |64 |int64 |Unix epoch |
|datasource values |n * 64 |int64 |n is the number of datasources exported by the plugin |
|timestamp |64 |double|Unix epoch |
|datasource values |n * 64 |int64 \| double |n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float] |
|metadata length |32 |int32 | |
|metadata |(string length)*8|string| |

All integers are bigendian. The metadata will have the same JSON-based format as
All integers/double are bigendian. The metadata will have the same JSON-based format as
in the V1 protocol, minus the timestamp and `value` key-value pair for each
datasource, for example:
datasource.

| field | values | notes | required |
|-------|--------|-------|----------|
|description|string|Description of the datasource|no|
|owner|host \| vm \| sr|The object to which the data relates|no, default host|
|value_type|int64 \| float|The type of the datasource|yes|
|type|absolute \| derive \| gauge|The type of measurement being sent. Absolute for counters which are reset on reading, derive stores the derivative of the recorded values (useful for metrics which continually increase like amount of data written since start), gauge for things like temperature|no, default absolute|
|default|true \| false|Whether the source is default enabled or not|no, default false|
|units|<TBD>|The units the data should be displayed in|no|
|min||The minimum value for the datasource|no, default -infinity|
|max||The maximum value for the datasource|no, default +infinity|


### Example
```
{
"datasources": {
Expand All @@ -125,6 +140,27 @@ datasource, for example:
"units":"B",
"min":"-inf",
"max":"inf"
},
{
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
"owner":"host",
"value_type": "float",
"type": "absolute",
"default":"true",
"units": "degC",
"min":"-inf",
"max":"inf"
},
"cpu-temp-cpu1": {
"description": "Temperature of CPU 1",
"owner":"host",
"value_type": "float",
"type": "absolute",
"default":"true",
"units": "degC",
"min":"-inf",
"max":"inf"
}
}
}
Expand All @@ -140,13 +176,13 @@ if header != expected_header:
raise InvalidHeader()
if data_checksum == last_data_checksum:
raise NoUpdate()
if data_checksum != md5sum(encoded_timestamp_and_values):
if data_checksum != crc32(encoded_timestamp_and_values):
raise InvalidChecksum()
if metadata_checksum == last_metadata_checksum:
for datasource, value in cached_datasources, values:
update(datasource, value)
else:
if metadata_checksum != md5sum(metadata):
if metadata_checksum != crc32(metadata):
raise InvalidChecksum()
cached_datasources = create_datasources(metadata)
for datasource, value in cached_datasources, values:
Expand All @@ -157,6 +193,3 @@ This means that for a normal update, RRDD will only have to read the header plus
the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of
datasources exported by the plugin. If the metadata changes RRDD will have to
read all the data (and parse the metadata).

n.b. the timestamp reported by plugins is not currently used by RRDD - it uses
its own global timestamp.
Loading

0 comments on commit 8971c8d

Please sign in to comment.