Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add an example of how to checkpoint #2088

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

abejgonzalez
Copy link
Contributor

This is a cleaner example of how to checkpoint into user-space Linux than the documentation (documentation is good for baremetal). We should eventually upstream this branch + it's fixes. Leaving this here for others temporarily.

Related PRs / Issues:

Type of change:

  • Bug fix
  • New feature
  • Other enhancement

Impact:

  • RTL change
  • Software change (RISC-V software)
  • Build system change
  • Other

Contributor Checklist:

  • Did you set main as the base branch?
  • Is this PR's title suitable for inclusion in the changelog and have you added a changelog:<topic> label?
  • Did you state the type-of-change/impact?
  • Did you delete any extraneous prints/debugging code?
  • Did you mark the PR with a changelog: label?
  • (If applicable) Did you update the conda .conda-lock.yml file if you updated the conda requirements file?
  • (If applicable) Did you add documentation for the feature?
  • (If applicable) Did you add a test demonstrating the PR?
  • (If applicable) Did you mark the PR as Please Backport?

CI Help:
Add the following labels to modify the CI for a set of features.
Generally, a label added only affect subsequent changes to the PR (i.e. new commits, force pushing, closing/reopening).
See ci:* for full list of labels:

  • ci:fpga-deploy - Run FPGA-based E2E testing
  • ci:local-fpga-buildbitstream-deploy - Build local FPGA bitstreams for platforms that are released
  • ci:disable - Disable CI

@abejgonzalez
Copy link
Contributor Author

There are few major issues related to architectural checkpoints as they exist currently:

  1. Loading the state into harts ideally is done s.t. all state is loaded then all harts are started at the same time. This reduces the chance that a multi-core simulation hits a race condition where one hart starting before another leads to simulation stalls. A case where this isn't supported is https://github.com/ucb-bar/testchipip/blob/6b2eb77dda56998b05030567e9998680b7fc55fc/src/main/resources/testchipip/csrc/testchip_dtm.cc#L287 where this loads the CLINT with mtime{cmp} using write_chunk. write_chunk re-starts the main hart unecessarily. There should be a non-restart version of this function in dtm_t.
  2. The current checkpointing script uses the 1st hart's PC or inst. for starting the checkpoint (
    echo "until insn 0 $INSN" >> $CMDS_FILE
    - see the pc/insn). Instead this check should be against all harts in the system.
  3. Since harts cannot be all restarted at the same time, it makes sense to restart the main hart (hart that saw the checkpoint PC or inst.) first then the other after. This allows checkpointing in Linux where the main hart is running and the other harts are waiting.
  4. PLIC state is not restored in the checkpoint. If checkpointing in Linux past when the interrupt controller is setup (see screenshot below) the PLIC will be setup (setting enable bits, priorities, thresholds). This matters only when there is a device on the system that uses interrupts otherwise the PLIC will be not setup (since there are no interrupt sources).
image
  1. MMIO device state is not checkpointed. For example, the UART can be used with the checkpoint if the MMIO registers are reset to proper values after a checkpoint. For example, use write_chunk to set the IE, TXEN, RXEN, TXMARK, RXMARK registers.
  2. write_chunk in the DTM code defaults to doing 64b writes irrespective of the function arguments. Therefore you can't do writes to non 64b aligned addresses (problematic for MMIO registers which are often aligned to 32b addresses).
  3. MAJOR Checkpointing for Linux doesn't work due to some weird non-trivial behavior with virtual memory. From my specific test case, after a reloaded checkpoint, Linux will jump to an incorrect address, leading to a failure later. This incorrect address was received from the stack from VA=VA1 PA=PA1. While the VA is correct the PA associated with it is incorrect, instead it should be PA2. In Spike, Linux properly changes VA1 -> PA2 unlike the RTL. It is unclear why this is happening. To fix this issue, the Spike cosimulation should be fixed to do the following:

Pass PA of an LD/ST from RTL into cosim Spike (now just called Spike). In Spike, Spike will check its memory to see what the translation is for that LD/ST. If it matches the RTL's PA, then Spike will store that VA->PA mapping into a new infinite TLB ("TLB holding 'valid mappings' verified by the RTL") i.e. the mapping is "good". If the translation from Spike's memory doesn't match, 1st check in the infinite TLB to see if the most recent entry matches the RTLs mapping. If it does then make the entry recent in the TLB. Else, then there is a divergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant