`hydra investigate show-monitor` doesn't work (Fedora 38, podman) #7080

michoecho · 2024-01-09T20:42:07Z

On a fresh Fedora 38 system, with podman (and podman-docker) installed, using SCT master, I run hydra investigate show-monitor 0d40b784-fe2e-484b-9b2d-7e78c3cbe186 (the specific test ID doesn't matter) and I receive the following output:

Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
There is scylladb/hydra:v1.56-bump_paramiko_3.4.0 in local cache, using it.
Obtaining QA SSH keys...
QA SSH keys obtained.
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
Going to run './sct.py  investigate show-monitor 0d40b784-fe2e-484b-9b2d-7e78c3cbe186'...
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
WARN[0000] Using the Podman API service with TCP sockets is not recommended, please see `podman system service` manpage for details
Obtaining QA SSH keys...
QA SSH keys obtained.
/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py:209: UserWarning: Connecting to https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io:9243 using SSL with verify_certs=False is insecure.
  warnings.warn(
install-bash-completion current path: /home/fedora/scylla-cluster-tests
When Podman daemon is used we don't login
New directory created: /home/fedora/sct-results/20240109-203016-495036-investigate-show-monitor
Search monitoring stack archive files for test id 0d40b784-fe2e-484b-9b2d-7e78c3cbe186 and restoring...
Checking that docker is available...
Docker is available
Restoring monitoring stack from archive 0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz
Download file https://cloudius-jenkins-test.s3.amazonaws.com/0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz to directory /tmp/tmpashh5vo3
Downloading 0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz from cloudius-jenkins-test
Downloaded finished
Creating monitoring stack directories for longevity-large-partitions-8h-2024--monitor-node-0d40b784-1 cluster
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #1]
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #2]
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
'start_dockers': Number of retries exceeded!
Dockers are not started. Error: Encountered a bad command exit code!

Command: 'cd /tmp/tmpashh5vo3/scylla-monitoring-src;\n            echo "" > UA.sh\n            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b \'-storage.tsdb.retention.time=100y\'             -c \'GF_USERS_DEFAULT_THEME=dark\''

Exit code: 1

Stdout:

Loading prometheus data from /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34
Wait for alert manager container to start
Error: Alertmanager container failed to start
For more information use: docker logs aalert-6000

Stderr:




Errors were found when restoring Scylla monitoring stack
Killing Grafana service
Killing Prometheus service
Killing Alert service

The error message suggests to look at logs of aalert-6000, but the output podman logs aalert-6000 is empty.
Running the same start-all.sh on the same data manually (outside of the containers) works as expected.

The text was updated successfully, but these errors were encountered:

michoecho · 2024-01-09T20:43:33Z

The above is actually a different issue that I initially wanted to report. The errors above come from a fresh Fedora 38 AMI. On my own machine with Fedora 38, I see the following:

Initial part same as above [...]
Error executing command: "cd /tmp/tmpudnwr4bu/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpudnwr4bu/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpudnwr4bu/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #1]
Start docker containers [try #2]
'start_dockers': Number of retries exceeded!
Dockers are not started. Error: Can't allocate a free port
Errors were found when restoring Scylla monitoring stack
Killing Grafana service
Killing Prometheus service
Killing Alert service

Here the error message is some nonsense about free ports, but looking into container logs and straceing the process tree shows that Prometheus is failing with some filesystem permission errors (on queries.active), because apparently some files are created with the wrong UID. (But I won't go deeper into the details for now, because this might be some problem with my system).

michoecho · 2024-01-09T20:48:47Z

Bottom line: show-monitor seems to be broken on a fresh Fedora with podman. Please check if that's really the case or maybe I'm just doing something wrong.

(It's also broken in a different way on my own non-fresh Fedora, but that could be something wrong with my system, so please just assure me that it works on a fresh Fedora, and I might investigate my own problems from there).

fruch · 2024-01-09T22:23:39Z

I've found two issues:

podman now defaults to short-name-mode="enforcing", and the monitor stack doesn't able to pull images when working via socket , it doesn't have tty to offer which repository to alias it, setting short-name-mode="permissive" on /etc/containers/registries.conf alleviate this issue
scylla-monitoring is using DOCKER_HOST inside it's scripts, cause the configuration we set for it to be ignored, I think it's a bug and those are used for something completely different, and need to be renamed, once I've rename I've able to manually start the stack via podman socket

DOCKER_HOST=tcp://localhost:3345 bash -x ./start-all.sh -v 2024.1

* podman now defaults to short-name-mode="enforcing", and the monitor stack doesn't able to pull images when working via socket, it doesn't have tty to offer which repository to alias it, setting short-name-mode="permissive" on ~/.config/containers/registries.conf alleviate this issue * scylla-monitoring was using DOCKER_HOST inside it's start scripts, hence breaking our ability to use it to point to podman socket, a fix was subbmitted but until then we patch those out of the scripts so older collected monitoring data can be used Ref: scylladb/scylla-monitoring#2149 Fixes: scylladb#7080

* podman now defaults to short-name-mode="enforcing", and the monitor stack doesn't able to pull images when working via socket, it doesn't have tty to offer which repository to alias it, setting short-name-mode="permissive" on ~/.config/containers/registries.conf alleviate this issue * scylla-monitoring was using DOCKER_HOST inside it's start scripts, hence breaking our ability to use it to point to podman socket, a fix was subbmitted but until then we patch those out of the scripts so older collected monitoring data can be used Ref: scylladb/scylla-monitoring#2149 Fixes: #7080

github-actions bot assigned michoecho Jan 9, 2024

bhalevy unassigned michoecho Jan 10, 2024

fruch mentioned this issue Jan 10, 2024

set DOCKER_HOST only if needed scylladb/scylla-monitoring#2149

Merged

fruch mentioned this issue Jan 10, 2024

fix(monitorstack): make it work when running hydra via podman #7085

Merged

2 tasks

fruch closed this as completed in #7085 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hydra investigate show-monitor` doesn't work (Fedora 38, podman) #7080

`hydra investigate show-monitor` doesn't work (Fedora 38, podman) #7080

michoecho commented Jan 9, 2024 •

edited

Loading

michoecho commented Jan 9, 2024

michoecho commented Jan 9, 2024

fruch commented Jan 9, 2024 •

edited

Loading

hydra investigate show-monitor doesn't work (Fedora 38, podman) #7080

hydra investigate show-monitor doesn't work (Fedora 38, podman) #7080

Comments

michoecho commented Jan 9, 2024 • edited Loading

michoecho commented Jan 9, 2024

michoecho commented Jan 9, 2024

fruch commented Jan 9, 2024 • edited Loading

`hydra investigate show-monitor` doesn't work (Fedora 38, podman) #7080

`hydra investigate show-monitor` doesn't work (Fedora 38, podman) #7080

michoecho commented Jan 9, 2024 •

edited

Loading

fruch commented Jan 9, 2024 •

edited

Loading