Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gimlet-seq: Record why the power state changed #1953

Merged
merged 16 commits into from
Dec 19, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 14, 2024

Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about why the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.

This commit resolves this as described in #1950 by adding a new field to
the SetState variant in the drv-gimlet-seq-server ringbuffer, so
that the reason a power state change occurred can be recorded. A new IPC
function, Sequencer.set_state_with_reason, is added to the cpu_seq
IPC API. This is equivalent to Sequencer.set_state but with the
addition of a StateChangeReason argument in addition to the desired
power state, and the sequencer task will record the provided reason in
its ringbuffer. This way, we can distinguish between the various reasons
a power state change may have occurred when debugging such issues.

All Hubris-internal callers of Sequencer.set_state are updated to
instead use Sequencer.set_state_with_reason. In particular,
host-sp-comms will record a variety of different StateChangeReasons,
allowing us to indicate whether the host requested a normal
power-off/reboot, the host OS panicked or failed to boot, or the host
CPU reset itself. Other callers like control-plane-agent and thermal
are simpler and just say "it was the control plane" or "overheat",
respectively. For backwards compatibility with existing callers of
Sequencer.set_state via hiffy, the set_state IPC is left as-is,
and will be recorded in the ringbuffer with StateChangeReason::Other.
Since all Hubris tasks now use the new API, Other basically just
means hiffy.

The StateChangeReason enum also generates counters, so that the total
number of power state changes can be tracked.

Also, while I was here, I've changed the Trace::SetState entry in the
drv-gimlet-seq-server ringbuf from a tuple-like enum variant to a
struct-like enum variant with named fields. This entry includes two
PowerState fields, one recording the previous power state and the
other recording the new power state that has been set. IMHO, using a
tuple-like variant to represent this is a bit unfortunate, as in
Humility, we'll see two values of the same type and it's not immediately
obvious which is the previous state and which is the new state. This
must be determined based on the order of the fields in the ringbuf
entry, which requires referencing the Hubris code to determine.

I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two PowerStates
is actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.

Fixes #1950

Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about *why* the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.

This commit resolves this as described in #1950 by adding a new field to
the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so
that the reason a power state change occurred can be recorded. Clients
of the `cpu_seq` IPC API must now provide a `StateChangeReason` when
calling `Sequencer.set_state`, along with the desired power state, and
the sequencer task will record the provided reason in its ringbuffer.
This way, we can distinguish between the various reasons a power state
change may have occurred when debugging such issues.

The `StateChangeReason` enum also generates counters, so that the total
number of power state changes can be tracked.

Fixes #1950
Currently, the `Trace::SetState` ringbuf entry in the sequencer is a
tuple-like enum variant. This entry includes two `PowerState` fields,
one recording the previous power state and the other recording the new
power state that has been set. IMHO, using a tuple-like variant to
represent this is a bit unfortunate, as in Humility, we'll see two
values of the same type and it's not immediately obvious which is the
previous state and which is the new state. This must be determined based
on the order of the fields in the ringbuf entry, which requires
referencing the Hubris code to determine.

I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two `PowerState`s
is actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.

If this is controversial for some reason, I'm happy to land it in a
separate PR, but I figured it was nice to do while I was messing with
the sequencer ringbuf.
@hawkw hawkw requested review from cbiffle and jgallagher December 14, 2024 19:57
@hawkw
Copy link
Member Author

hawkw commented Dec 14, 2024

Comment on lines 37 to 49
pub enum StateChangeReason {
/// TThe system has just received power, so the sequencer has booted the
/// host CPU.
InitialPowerOn = 1,
/// A power state change was requested by the control plane.
ControlPlane,
/// The host OS requested that the system power off without rebooting.
HostPowerOff,
/// The host OS requested that the system reboot.
HostReboot,
/// The system powered off because a component has overheated.
Overheat,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very open to suggestions about the naming of these variants, if anyone dislikes the ones I came up with...

drv/cpu-seq-api/src/lib.rs Show resolved Hide resolved
drv/mock-gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I'm not sure what to suggest about your (very good) question about hiffy backwards compatibility - might be worth checking with mfg and host-os folks who are probably the biggest users of that?

drv/mock-gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
drv/cpu-seq-api/src/lib.rs Show resolved Hide resolved
Since manufacturing and test automation currently uses the
`Sequencer.set_state` IPC via Hiffy, let's avoid breaking it and instead
introduce a separate `Sequencer.set_state_with_reason`. Now, calls to
`set_state` without a reason will get `StateChangeReason::Other`. In
practice, this means Hiffy, as all Hubris-internal callers now use `set_state_with_reason`.
task/host-sp-comms/src/main.rs Outdated Show resolved Hide resolved
task/host-sp-comms/src/main.rs Outdated Show resolved Hide resolved
@hawkw hawkw requested a review from jgallagher December 16, 2024 20:13
Instead of just setting it to `HostReboot` always, hang onto the last
power off until reaching A0, so that the reason can be sent for the
`set_state` call to reboot, as well as the power off.

Also, just use `StateChangeReason` here instead of our own enum, and add
it to the `host-sp-comms` ringbuf as well.
@hawkw hawkw requested a review from rmustacc December 17, 2024 18:51
@hawkw
Copy link
Member Author

hawkw commented Dec 17, 2024

@jgallagher Alright, I think all the review feedback has been addressed and I'd love another review whenever you've got the chance!

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

drv/gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
@hawkw hawkw force-pushed the eliza/sequencer-reason branch from 0939d49 to 007470b Compare December 18, 2024 18:02
@hawkw hawkw enabled auto-merge (squash) December 19, 2024 19:53
@hawkw hawkw merged commit 11da9c0 into master Dec 19, 2024
125 checks passed
@hawkw hawkw deleted the eliza/sequencer-reason branch December 19, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gimlet sequencer: Add ringbuf logging for source/cause of A2 transitions
5 participants