8000 Riscv, kernel trapping, and "double" faults · Issue #1374 · seL4/seL4 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Riscv, kernel trapping, and "double" faults #1374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
midnightveil opened this issue Jan 21, 2025 · 5 comments
Open
2 of 3 tasks

Riscv, kernel trapping, and "double" faults #1374

midnightveil opened this issue Jan 21, 2025 · 5 comments

Comments

@midnightveil
Copy link
Contributor
midnightveil commented Jan 21, 2025

Whilst modifying a RISC-V build of seL4, I was getting messages that looked like this:

Caught cap fault in send phase at address 0
while trying to handle:
vm fault on data at address 0 with status ��7
in thread ��ffffffff8401bc00 "idle_thread" at address 0
With stack:
Invalid vspace

The kernel would then continue as if nothing had happened.

This was eventually traced down to the modifications corrupting the stack, eventually resulting in a memory access fault within the kernel memory, which got attributed to the idle thread. 1

  1. I am under the impression that seL4 should not have nested exceptions (in normal operation).

  2. It appears that the AArch64 traps.S code calls a kernelDataAbort() function if there is a trap taken whilst within the kernel, as per

    BEGIN_FUNC(cur_el_sync)
    lsp_i x19
    /* Read esr and branch to respective labels */
    mrs x25, ESR
    lsr x24, x25, #ESR_EC_SHIFT
    cmp x24, #ESR_EC_CEL_DABT
    b.eq cur_el_da
    cmp x24, #ESR_EC_CEL_IABT
    b.eq cur_el_ia
    b cur_el_inv
    cur_el_da:
    #ifdef CONFIG_DEBUG_BUILD
    mrs x0, ELR
    bl kernelDataAbort
    #endif /* CONFIG_DEBUG_BUILD */
    b halt
    cur_el_ia:
    #ifdef CONFIG_DEBUG_BUILD
    mrs x0, ELR
    bl kernelPrefetchAbort
    #endif /* CONFIG_DEBUG_BUILD */
    b halt
    cur_el_inv:
    b invalid_vector_entry
    END_FUNC(cur_el_sync)

  3. In contrast, the RISC-V traps.S assumes that any trap is from u-mode. I'm not quite sure how traps within s-mode end up back here (perhaps openSBI forwards them back?), but this is what I was observing. It looks like the linux kernel stores a 0 in sscratch whilst in the kernel, otherwise it stores it stores the tp (~-> sp), of which we could do a very similar thing: 2. Unless of course there's an architectural way to determine this (mcause? do we have access to that in s-mode?)

  4. We should never get "faults" from the idle thread, since it's a pseudo-thread managed by the kernel?

  5. (This is more of a query) The fact that there is a function called handleDoubleFault that gets invoked by handleFault() when sendFaultIPC fails:

    status = sendFaultIPC(tptr);
    if (status != EXCEPTION_NONE) {
    handleDoubleFault(tptr, fault);

    1. I guess it makes sense in that there is a "fault" when trying to do a fault IPC, but this was confusing to debug because "double fault" makes me think of nested exceptions. The MCS name of handleNoFaultHandler makes more sense. Maybe it would make sense to rename this function, or, at the very least, add a doc comment explaining what this function is doing.
    2. On a similar note, why does most invocations of handleFault() set current_fault global, which is then immediately read back by handleFault()? Most cases seem to be immediately setting current_fault before calling. There are a few cases where this does not do so immediately beforehand, but would it not be clearer for this to be two separate codepaths (if it's even necessary?)

In conclusion (unless there is something wrong in the above), what I think should be adjusted (and can probably help with):

  • Faults originating from the idle_thread should appear as a kernel abort/assertion, as these should not be possible (possibly unnecessary if the next part exists? but does every arch have this?)
  • RISC-V exception code should either:
    1. Detect based off architectural registers that we came from a fault within S-mode, and do kernelDataAbort() as necessary (please advise how), or
    2. Store a sentinel 0 within sscratch that we can check upon a fault to see if we came from the kernel, and then abort
  • Document or rename handleDoubleFault such that its meaning / origin is more clear.

Footnotes

  1. It was in particular when switching to the idle thread, so I think there was nothing that would stop an exception inside s-mode from being attributed to an arbitrary thread.

  2. We seem to store the current thread registers on the core-local stack (SMP) otherwise in ksCurThread's registers (not-SMP). I'm not particularly sure why this couldn't be the same in SMP and non-SMP cases? But irregardless...

@lsf37
Copy link
Member
lsf37 commented Jan 21, 2025
  1. I am under the impression that seL4 should not have nested exceptions (in normal operation).

Yes, that is correct. There should be no faults at all generated while in kernel mode or in the idle thread. There are some delayed faults on ARM that cannot be masked. IIRC, kernelDataAbort was for that, but I might be wrong. Again, IIRC, RISC-V does not have these.

  1. We should never get "faults" from the idle thread, since it's a pseudo-thread managed by the kernel?

Yes

  1. (handleDoubleFault)

handleDoubleFault is commented in the Haskell spec. It's not a good idea to modify the kernel without reading that spec first - a lot of design structure and documentation is in there. Amongst other things what phases invocations go through, etc.

We should put a big comment somewhere that says that. Currently there isn't because that knowledge was usually passed along from the current kernel engineers.

Using the global variable current_fault is a mechanism for implementing the corresponding exception passing mechanism from Haskell without passing structures on the stack or taking addresses of local variables. The compiler is usually better at optimising these accesses than humans, so we are preferring the regular structure over eliminating redundant reads in C.

@midnightveil
Copy link
Contributor Author
midnightveil commented Jan 21, 2025

handleDoubleFault is commented in the Haskell spec. It's not a good idea to modify the kernel without reading that spec first - a lot of design structure and documentation is in there. Amongst other things what phases invocations go through, etc.

Is there a more up to date version of https://sel4.systems/Info/Docs/seL4-spec.pdf anywhere? (even just a single platform is useful, because there is arch-independent things). It makes it a bit easier to ctrl+f through if it's in a single document.
Fair — I had actually run into this before, I just forget about it when I'm spending a lot of time reading C kernel code.

IIRC, kernelDataAbort was for that, but I might be wrong. Again, IIRC, RISC-V does not have these.

I'm not sure whether you're saying RISC-V shouldn't have an equivalent, or if RISC-V doesn't have an equivalent because it has no delayed faults like on ARM. If it's the first, I would argue that it should, because it was useful for catching a kernel bug in the past. At the very least in debug mode would make sense to me, as it would have me time debugging (and probably another person a few months ago with the same code).

@Indanz
Copy link
Contributor
Indanz commented Jan 21, 2025
1. I am under the impression that seL4 should not have nested exceptions (in normal operation).

True, it keeps IRQs disabled while in kernel mode.

4. We should never get "faults" from the idle thread, since it's a pseudo-thread managed by the kernel?

True, but like Gerwin said, ARM has asynchronous faults (SError) which theoretically can arrive much later. Though I don't think it can arrive late enough for a task switch to complete in practice.

Also, we do have an SMP bug somewhere which can cause faults for the idle thread, at least with Clang.

* [ ]  Faults originating from the idle_thread should appear as a kernel abort/assertion, as these should not be possible (possibly unnecessary if the next part exists? but does every arch have this?)* [ ]  RISC-V exception code should either:

It's still good to know if it happens while the idle task is running, as that shouldn't happen.

2. We seem to store the current thread registers on the core-local stack (SMP) otherwise in `ksCurThread`'s registers (not-SMP). I'm not particularly sure why this couldn't be the same in SMP and non-SMP cases?

It could be the same. It's simply that unicore code came first and was kept unmodified when SMP was added, to avoid invalidating any proofs.

@lsf37
Copy link
Member
lsf37 commented Jan 22, 2025

Is there a more up to date version of https://sel4.systems/Info/Docs/seL4-spec.pdf anywhere? (even just a single platform is useful, because there is arch-independent things).

There currently isn't but there probably should be. I've made a TODO item for myself to make it at least a CI artefact, so that it should become available for all proof configs that are built regularly. Then it would also be relatively easy to update during releases.

This is the abstract spec, though, not the Haskell code. The Haskell has comments, the formal spec not so much (it does have some explanations, but the narrative was already there in Haskell so it wasn't duplicated).

IIRC, kernelDataAbort was for that, but I might be wrong. Again, IIRC, RISC-V does not have these.

I'm not sure whether you're saying RISC-V shouldn't have an equivalent, or if RISC-V doesn't have an equivalent because it has no delayed faults like on ARM.

I was just trying to explain why RISC-V doesn't currently have it, but AARCH64 does. It's not an oversight or bug, but it does sound like it would be useful for debugging to have it. If it's in assembly and basically a new entry point, it's not going to be covered by the proofs, so we should be extra careful to make sure it either only exists in debug mode or that it cannot interfere with normal operation.

@midnightveil
Copy link
Contributor Author

This is the abstract spec, though, not the Haskell code. The Haskell has comments, the formal spec not so much (it does have some explanations, but the narrative was already there in Haskell so it wasn't duplicated).

The Haskell is literate haskell, so it's LaTeX. Is there not an equivalent PDF for that? (I might have confused sel4-spec with the haskell spec).

it would be useful for debugging to have it. If it's in assembly and basically a new entry point, it's not going to be covered by the proofs, so we should be extra careful to make sure it either only exists in debug mode or that it cannot interfere with normal operation.

Yep, that's what I was going for :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0