Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra investigate show-monitor doesn't work (Fedora 38, podman) #7080

Closed
michoecho opened this issue Jan 9, 2024 · 3 comments · Fixed by #7085
Closed

hydra investigate show-monitor doesn't work (Fedora 38, podman) #7080

michoecho opened this issue Jan 9, 2024 · 3 comments · Fixed by #7085

Comments

@michoecho
Copy link
Contributor

michoecho commented Jan 9, 2024

On a fresh Fedora 38 system, with podman (and podman-docker) installed, using SCT master, I run hydra investigate show-monitor 0d40b784-fe2e-484b-9b2d-7e78c3cbe186 (the specific test ID doesn't matter) and I receive the following output:

Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
There is scylladb/hydra:v1.56-bump_paramiko_3.4.0 in local cache, using it.
Obtaining QA SSH keys...
QA SSH keys obtained.
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
Going to run './sct.py  investigate show-monitor 0d40b784-fe2e-484b-9b2d-7e78c3cbe186'...
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
WARN[0000] Using the Podman API service with TCP sockets is not recommended, please see `podman system service` manpage for details
Obtaining QA SSH keys...
QA SSH keys obtained.
/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py:209: UserWarning: Connecting to https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io:9243 using SSL with verify_certs=False is insecure.
  warnings.warn(
install-bash-completion current path: /home/fedora/scylla-cluster-tests
When Podman daemon is used we don't login
New directory created: /home/fedora/sct-results/20240109-203016-495036-investigate-show-monitor
Search monitoring stack archive files for test id 0d40b784-fe2e-484b-9b2d-7e78c3cbe186 and restoring...
Checking that docker is available...
Docker is available
Restoring monitoring stack from archive 0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz
Download file https://cloudius-jenkins-test.s3.amazonaws.com/0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz to directory /tmp/tmpashh5vo3
Downloading 0d40b784-fe2e-484b-9b2d-7e78c3cbe186/20240101_165351/monitor-set-0d40b784.tar.gz from cloudius-jenkins-test
Downloaded finished
Creating monitoring stack directories for longevity-large-partitions-8h-2024--monitor-node-0d40b784-1 cluster
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #1]
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #2]
Error executing command: "cd /tmp/tmpashh5vo3/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
'start_dockers': Number of retries exceeded!
Dockers are not started. Error: Encountered a bad command exit code!

Command: 'cd /tmp/tmpashh5vo3/scylla-monitoring-src;\n            echo "" > UA.sh\n            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpashh5vo3/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b \'-storage.tsdb.retention.time=100y\'             -c \'GF_USERS_DEFAULT_THEME=dark\''

Exit code: 1

Stdout:

Loading prometheus data from /tmp/tmpashh5vo3/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34
Wait for alert manager container to start
Error: Alertmanager container failed to start
For more information use: docker logs aalert-6000

Stderr:




Errors were found when restoring Scylla monitoring stack
Killing Grafana service
Killing Prometheus service
Killing Alert service

The error message suggests to look at logs of aalert-6000, but the output podman logs aalert-6000 is empty.
Running the same start-all.sh on the same data manually (outside of the containers) works as expected.

@michoecho
Copy link
Contributor Author

The above is actually a different issue that I initially wanted to report. The errors above come from a fresh Fedora 38 AMI. On my own machine with Fedora 38, I see the following:

Initial part same as above [...]
Error executing command: "cd /tmp/tmpudnwr4bu/scylla-monitoring-src;
            echo "" > UA.sh
            ./start-all.sh             $(grep -q -- --no-renderer ./start-all.sh && echo "--no-renderer")              $(grep -q -- --no-loki ./start-all.sh && echo "--no-loki")              -g 3000 -m 6000 -p 9090             -s /tmp/tmpudnwr4bu/scylla-monitoring-src/config/scylla_servers.yml             -d /tmp/tmpudnwr4bu/monitoring_data_dir/longevity-large-partitions-8h-2024--monitor-node-0d40b784-1/20240101T165654Z-5c5d923f9c1dad34 -v 2024.1             -b '-storage.tsdb.retention.time=100y'             -c 'GF_USERS_DEFAULT_THEME=dark'"; Exit status: 1
Start docker containers [try #1]
Start docker containers [try #2]
'start_dockers': Number of retries exceeded!
Dockers are not started. Error: Can't allocate a free port
Errors were found when restoring Scylla monitoring stack
Killing Grafana service
Killing Prometheus service
Killing Alert service

Here the error message is some nonsense about free ports, but looking into container logs and straceing the process tree shows that Prometheus is failing with some filesystem permission errors (on queries.active), because apparently some files are created with the wrong UID. (But I won't go deeper into the details for now, because this might be some problem with my system).

@michoecho
Copy link
Contributor Author

Bottom line: show-monitor seems to be broken on a fresh Fedora with podman. Please check if that's really the case or maybe I'm just doing something wrong.

(It's also broken in a different way on my own non-fresh Fedora, but that could be something wrong with my system, so please just assure me that it works on a fresh Fedora, and I might investigate my own problems from there).

@fruch
Copy link
Contributor

fruch commented Jan 9, 2024

I've found two issues:

  1. podman now defaults to short-name-mode="enforcing", and the monitor stack doesn't able to pull images when working via socket , it doesn't have tty to offer which repository to alias it, setting short-name-mode="permissive" on /etc/containers/registries.conf alleviate this issue

  2. scylla-monitoring is using DOCKER_HOST inside it's scripts, cause the configuration we set for it to be ignored, I think it's a bug and those are used for something completely different, and need to be renamed, once I've rename I've able to manually start the stack via podman socket

DOCKER_HOST=tcp://localhost:3345 bash -x ./start-all.sh -v 2024.1

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Jan 10, 2024
* podman now defaults to short-name-mode="enforcing", and the monitor stack
  doesn't able to pull images when working via socket,
  it doesn't have tty to offer which repository to alias it,
  setting short-name-mode="permissive" on ~/.config/containers/registries.conf
  alleviate this issue

* scylla-monitoring was using DOCKER_HOST inside it's start scripts, hence breaking
  our ability to use it to point to podman socket, a fix was subbmitted
  but until then we patch those out of the scripts so older collected monitoring
  data can be used

Ref: scylladb/scylla-monitoring#2149
Fixes: scylladb#7080
fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Jan 11, 2024
* podman now defaults to short-name-mode="enforcing", and the monitor stack
  doesn't able to pull images when working via socket,
  it doesn't have tty to offer which repository to alias it,
  setting short-name-mode="permissive" on ~/.config/containers/registries.conf
  alleviate this issue

* scylla-monitoring was using DOCKER_HOST inside it's start scripts, hence breaking
  our ability to use it to point to podman socket, a fix was subbmitted
  but until then we patch those out of the scripts so older collected monitoring
  data can be used

Ref: scylladb/scylla-monitoring#2149
Fixes: scylladb#7080
fruch added a commit that referenced this issue Jan 11, 2024
* podman now defaults to short-name-mode="enforcing", and the monitor stack
  doesn't able to pull images when working via socket,
  it doesn't have tty to offer which repository to alias it,
  setting short-name-mode="permissive" on ~/.config/containers/registries.conf
  alleviate this issue

* scylla-monitoring was using DOCKER_HOST inside it's start scripts, hence breaking
  our ability to use it to point to podman socket, a fix was subbmitted
  but until then we patch those out of the scripts so older collected monitoring
  data can be used

Ref: scylladb/scylla-monitoring#2149
Fixes: #7080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants