Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update smartctl.py #46

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Update smartctl.py #46

wants to merge 5 commits into from

Conversation

1234Erwan
Copy link

Try to add hard raid support for smartctl.py

Try to add hard raid support for smartctl.py

Signed-off-by: 1234Erwan <[email protected]>
Signed-off-by: 1234Erwan <[email protected]>
@olivierlambert
Copy link
Member

Hey thanks a lot for your contribution! We'll review your PR ASAP 👍

Copy link
Contributor

@gthvn1 gthvn1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx a lot for your PR. I added some comments.

@@ -14,32 +14,47 @@ def _list_disks():
disks = []
result = run_command(['smartctl', '--scan'])
for line in result['stdout'].splitlines():
if line.startswith('/dev/') and not line.startswith('/dev/bus/'):
disks.append(line.split()[0])
disks.append(line.split()[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can get the type of the disk here. No need to an extra function _list_raids.
Maybe return a tuple here with line.split()[0] and line.split()[2]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'v tryed te solve that.

Copy link
Contributor

@gthvn1 gthvn1 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most readable way to do that is to use a dict. For example you can do:
devices.append({'name': line.split()[0], 'type': line.split()[2]})
By doing this way we know what is expecting and when calling the smartctl later in the code you won't need the index you will be able to use the explicit field.

Copy link
Contributor

@gthvn1 gthvn1 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I used devices here because it looks like it is closer from the man page of the smartctl and IMHO we are getting the device name and the device type so it looks better :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also renamed _list_disks into _list_devices if you decide to rename it.

return disks

@error_wrapped
def _list_raids():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you are getting the type. It is not specifically for raids. And I think that this function is not needed because we are already doing the scan in _list_disks(). See my previous comment

@error_wrapped
def get_information(session, args):
results = {}
raids = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed if disks holds a tuple with device and type

with OperationLocker():
disks = _list_disks()
raids = _list_raids()
for disk in disks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can get disk, disk_type from disks if it holds the tuple

Removed unnecessary lines from smartctl.py

Signed-off-by: 1234Erwan <[email protected]>
@gthvn1
Copy link
Contributor

gthvn1 commented Sep 3, 2024

I thought about a solution like this instead of managing an index.

--- a/SOURCES/etc/xapi.d/plugins/smartctl.py
+++ b/SOURCES/etc/xapi.d/plugins/smartctl.py
@@ -14,36 +14,31 @@ def _list_disks():
     disks = []
     result = run_command(['smartctl', '--scan'])
     for line in result['stdout'].splitlines():
-        disks.append(line.split()[2])
-        disks.append(line.split()[0])
+        disks.append({'name': line.split()[0], 'type': line.split()[2]})
     return disks

 @error_wrapped
 def get_information(session, args):
     results = {}
-    i = 0
     with OperationLocker():
         disks = _list_disks()
         for disk in disks:
-            cmd = run_command(["smartctl", "-j", "-a", "-d", disks[i], disks[i+1]], check=False)
-            results[disk] = json.loads(cmd['stdout'])
-            i = i + 2
+            cmd = run_command(["smartctl", "-j", "-a", "-d", disk['type'], disk['name']], check=False)
+            results[disk['name']] = json.loads(cmd['stdout'])
         return json.dumps(results)

 @error_wrapped
 def get_health(session, args):
     results = {}
-    i = 0
     with OperationLocker():
         disks = _list_disks()
         for disk in disks:
-            cmd = run_command(["smartctl", "-j", "-H", "-d", disks[i], disks[i+1]])
+            cmd = run_command(["smartctl", "-j", "-H", "-d", disk['type'], disk['name']])
             json_output = json.loads(cmd['stdout'])
             if json_output['smart_status']['passed']:
-                results[disk] = "PASSED"
+                results[disk['name']] = "PASSED"
             else:
-                results[disk] = "FAILED"
-            i = i + 2
+                results[disk['name']] = "FAILED"
         return json.dumps(results)

@stormi
Copy link
Member

stormi commented Oct 17, 2024

How do we move on on this PR?

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 17, 2024

@

How do we move on on this PR?

I think that managing index is error prone and IMHO using the solution I posted looks better but I don't have a strong opinion so if it is ok for you we can merge it and fix it later. @1234Erwan what do you think?

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 25, 2024

Also I just tested the patch on a physical machine that has a megaraid and it fails:

', stderr: '', Traceback (most recent call last):
  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 127, in wrapper
    return func(*args, **kwds)
  File "/etc/xapi.d/plugins/smartctl.py", line 40, in get_health
    cmd = run_command(["smartctl", "-j", "-H", "-d", disks[i], disks[i+1]])
  File "/etc/xapi.d/plugins/xcpngutils/__init__.py", line 79, in run_command
    raise ProcessException(code, command, stdout, stderr)
ProcessException: Command '['smartctl', '-j', '-H', '-d', 'megaraid,0', '/dev/bus/0']' failed with code: 4

In fact the issue is because the command smartctl -j -H -d megaraid,0 /dev/bus/0 returns an output:

[10:01 r620-x1 ~]# smartctl -j -H -d megaraid,0 /dev/bus/0
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      0
    ],
    "svn_revision": "4883",
    "platform_info": "x86_64-linux-4.19.0+1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-j",
      "-H",
      "-d",
      "megaraid,0",
      "/dev/bus/0"
    ],
    "messages": [
      {
        "string": "Warning: This result is based on an Attribute check.",
        "severity": "warning"
      }
    ],
    "exit_status": 4
  },
  "device": {
    "name": "/dev/bus/0",
    "info_name": "/dev/bus/0 [megaraid_disk_00] [SAT]",
    "type": "sat+megaraid,0",
    "protocol": "ATA"
  },
  "smart_status": {
    "passed": true
  }
}

But the return code is not 0... it is 4.
So we need to understand if the error is "normal" and if yes fix it to be able to use megaraid.
Note: this error happens only with health function, the information function works well. It is probably why you didn't notice it @1234Erwan .
@1234Erwan I will take some time and I will have a look to how to fix it.

Copy link
Contributor

@gthvn1 gthvn1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health commands failed on my host, pytest fails and pycodecheck also fails.
No worries we will have a look.

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 25, 2024

In fact the exit code status means : SMART status check returned "DISK FAILING".
So I think that we need to add a better check of exit status because return values of smartctl are defined by a bitmask so we can exctract usefull information for the user.

@1234Erwan
Copy link
Author

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 25, 2024

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

When you say it works you mean also the xe host-call-plugin host-uuid=<uuid> plugin=smartctl.py fn=health right?
No worries for the skills it already help us 👍 I will try to understand why I have the issue and we will update your PR to pass the checks. Thanks for the answer.

@1234Erwan
Copy link
Author

my python skills are quite limited, what is sure is that even if my script is not clean at all it works on my hard raid

When you say it works you mean also the xe host-call-plugin host-uuid=<uuid> plugin=smartctl.py fn=health right? No worries for the skills it already help us 👍 I will try to understand why I have the issue and we will update your PR to pass the checks. Thanks for the answer.

like 2 month ago i have replaced the original plugin in my xcp-ng with my last experiment and it's working on my hard RAID.
Capture d'écran 2024-10-25 112215

Add usage of tuples and disk renamed by device.

Signed-off-by: 1234Erwan <[email protected]>
@1234Erwan
Copy link
Author

@gthvn1 Thanks for all your suggestion and for your help, i'v just implemented your idea.
It works for me I'm waiting to know if it works for you too.

results[disk] = json.loads(cmd['stdout'])
devices = _list_devices()
for device in devices:
cmd = run_command(["smartctl", "-j", "-a", "-d", devices[i]['name'], devices[i]['type']], check=False)
Copy link
Contributor

@gthvn1 gthvn1 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you don't need the index anymore. Now you can use device['name'] instead of devices[i]['name']. So you can remove the init i = 0 and also the increment i = i + 1.

cmd = run_command(["smartctl", "-j", "-H", disk])
devices = _list_devices()
for device in devices:
cmd = run_command(["smartctl", "-j", "-H", "-d", devices[i]['name'], devices[i]['type']])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. If I'm right you don't need index anymore

Copy link
Contributor

@gthvn1 gthvn1 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I understand why it fails on my machine. It is because the exit status of the command is 4. In you case I think that it is 0 so it is ok. To fix my machine you just need to add the parameter check=False) at the end of the call. Like this: cmd = run_command(["smartctl", "-j", "-H", "-d", device['name'], device['type']], check=False)

Signed-off-by: 1234Erwan <[email protected]>
@1234Erwan
Copy link
Author

It's done, you've finnaly wrote the entierly of the code yourself 😅.

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 25, 2024

Your code helps us to think about the issue and even if it changed a little bit it helped us a lot 👍
Also you confirm that your last commit is working well on your host right?

@gthvn1
Copy link
Contributor

gthvn1 commented Oct 25, 2024

And for information I have created a branch gtn-pr-46 that has your current commit. The reason is because as it is a local branch it triggers more check (like pytest and pycodestyle) than yours. So don't worry I will use this new branch and fix the tests but your commits will be in. I will probably squash them into one commit but it will be there and your contribution will remain :)

@1234Erwan
Copy link
Author

Fine, do as you please. I don't care much for my meager contribution. I much prefer to know that smartctl will work for everyone.

@1234Erwan
Copy link
Author

Your code helps us to think about the issue and even if it changed a little bit it helped us a lot 👍 Also you confirm that your last commit is working well on your host right?

No it didn't working, as you've said earlier the exit status is not 0.

devices = _list_devices()
for device in devices:
cmd = run_command(["smartctl", "-j", "-a", "-d", device['name'], device['type']], check=False)
results[device] = json.loads(cmd['stdout'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't work because device is a dict and cannot be used as a key. The problem with megraid is that we now have 2 devices with the same name. Something like:

/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device

So if we use the name like it is done in the current version we will have the second /dev/bus/0 that will overwrite the first one. So in XO we will see information and health but at least info will only be the last device... But maybe it is ok. Otherwise I think about using a key like /dev/bus/0@megaraid,1. I'm currently checking with XO team what can be done. For now just use device["name"] I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my old program that works with increments of i, it worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me the output of smartctl --scan ? With this I will be able to get the equivalent of what is returned by _list_devices() and I will be able to run some unittests.
When you say it works you try using xe host-call-plugin host-uuid=<HOSTUUID> plugin=smartctl.py fn=information and xe host-call-plugin host-uuid=<HOSTUUID> plugin=smartctl.py fn=health or just by looking the result in XO?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that from the xe host-call.
image

cmd = run_command(["smartctl", "-j", "-H", disk])
devices = _list_devices()
for device in devices:
cmd = run_command(["smartctl", "-j", "-H", "-d", device['name'], device['type']], check=False)
json_output = json.loads(cmd['stdout'])
if json_output['smart_status']['passed']:
results[disk] = "PASSED"
Copy link
Contributor

@gthvn1 gthvn1 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disk looks wrong... should be device["name"] probably ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test that immediatly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seems to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that you can give a try using this one smartctl.py ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the issue. This new version of smartctl.py should work better. Can you give it a try?
And thanks a lot for your help and patience for testing it 👍 I don't have such hardware myself so it is really appreciate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I have no problem testing the code, moreover it only takes two commands 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the last version should work 🤞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1234Erwan , Have you had the opportunity to test the latest version?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gthvn1, I pulled your latest smartctl.py and gave it a spin. It runs successfully on my host (XO reports All disks are healthy; no errors in /var/log/xensource.log)

For context:

# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants