WIP: Add an example of how to checkpoint #2088

abejgonzalez · 2024-10-17T00:36:22Z

This is a cleaner example of how to checkpoint into user-space Linux than the documentation (documentation is good for baremetal). We should eventually upstream this branch + it's fixes. Leaving this here for others temporarily.

Related PRs / Issues:

Type of change:

Bug fix
New feature
Other enhancement

Impact:

RTL change
Software change (RISC-V software)
Build system change
Other

Contributor Checklist:

Did you set main as the base branch?
Is this PR's title suitable for inclusion in the changelog and have you added a changelog:<topic> label?
Did you state the type-of-change/impact?
Did you delete any extraneous prints/debugging code?
Did you mark the PR with a changelog: label?
(If applicable) Did you update the conda .conda-lock.yml file if you updated the conda requirements file?
(If applicable) Did you add documentation for the feature?
(If applicable) Did you add a test demonstrating the PR?

(If applicable) Did you mark the PR as Please Backport?

CI Help:
Add the following labels to modify the CI for a set of features.
Generally, a label added only affect subsequent changes to the PR (i.e. new commits, force pushing, closing/reopening).
See ci:* for full list of labels:

ci:fpga-deploy - Run FPGA-based E2E testing
ci:local-fpga-buildbitstream-deploy - Build local FPGA bitstreams for platforms that are released
ci:disable - Disable CI

abejgonzalez · 2024-12-05T17:32:50Z

There are few major issues related to architectural checkpoints as they exist currently:

Loading the state into harts ideally is done s.t. all state is loaded then all harts are started at the same time. This reduces the chance that a multi-core simulation hits a race condition where one hart starting before another leads to simulation stalls. A case where this isn't supported is https://github.com/ucb-bar/testchipip/blob/6b2eb77dda56998b05030567e9998680b7fc55fc/src/main/resources/testchipip/csrc/testchip_dtm.cc#L287 where this loads the CLINT with mtime{cmp} using write_chunk. write_chunk re-starts the main hart unecessarily. There should be a non-restart version of this function in dtm_t.
The current checkpointing script uses the 1st hart's PC or inst. for starting the checkpoint (

chipyard/scripts/generate-ckpt.sh

Line 150 in d4ebf93

echo "until insn 0 $INSN" >> $CMDS_FILE

- see the pc/insn). Instead this check should be against all harts in the system.
Since harts cannot be all restarted at the same time, it makes sense to restart the main hart (hart that saw the checkpoint PC or inst.) first then the other after. This allows checkpointing in Linux where the main hart is running and the other harts are waiting.
PLIC state is not restored in the checkpoint. If checkpointing in Linux past when the interrupt controller is setup (see screenshot below) the PLIC will be setup (setting enable bits, priorities, thresholds). This matters only when there is a device on the system that uses interrupts otherwise the PLIC will be not setup (since there are no interrupt sources).

MMIO device state is not checkpointed. For example, the UART can be used with the checkpoint if the MMIO registers are reset to proper values after a checkpoint. For example, use write_chunk to set the IE, TXEN, RXEN, TXMARK, RXMARK registers.
write_chunk in the DTM code defaults to doing 64b writes irrespective of the function arguments. Therefore you can't do writes to non 64b aligned addresses (problematic for MMIO registers which are often aligned to 32b addresses).
MAJOR Checkpointing for Linux doesn't work due to some weird non-trivial behavior with virtual memory. From my specific test case, after a reloaded checkpoint, Linux will jump to an incorrect address, leading to a failure later. This incorrect address was received from the stack from VA=VA1 PA=PA1. While the VA is correct the PA associated with it is incorrect, instead it should be PA2. In Spike, Linux properly changes VA1 -> PA2 unlike the RTL. It is unclear why this is happening. To fix this issue, the Spike cosimulation should be fixed to do the following:

Pass PA of an LD/ST from RTL into cosim Spike (now just called Spike). In Spike, Spike will check its memory to see what the translation is for that LD/ST. If it matches the RTL's PA, then Spike will store that VA->PA mapping into a new infinite TLB ("TLB holding 'valid mappings' verified by the RTL") i.e. the mapping is "good". If the translation from Spike's memory doesn't match, 1st check in the infinite TLB to see if the most recent entry matches the RTLs mapping. If it does then make the entry recent in the TLB. Else, then there is a divergence.

Add an example of how to checkpoint

4ad5f90

abejgonzalez added changelog:added ci:disable labels Oct 17, 2024

abejgonzalez self-assigned this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add an example of how to checkpoint #2088

WIP: Add an example of how to checkpoint #2088

abejgonzalez commented Oct 17, 2024

abejgonzalez commented Dec 5, 2024

WIP: Add an example of how to checkpoint #2088

Are you sure you want to change the base?

WIP: Add an example of how to checkpoint #2088

Conversation

abejgonzalez commented Oct 17, 2024

abejgonzalez commented Dec 5, 2024