smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

kent-mcleod · 2022-06-07T06:49:17Z

Update (5.Mar.2025): With #867 being recently merged, this PR needs to be rebased. I think the top 5 commits are still worth reapplying so I'll keep the PR open.
Because it's been so long since doing this work, I can't find the issue it was addressing. But essentially when a TCB with an SC on one core performs an operation that requires remotely stalling another core, the time consumed by this stall isn't properly accounted. But there's also some correctness bugs where some operations are performed outside of the kernel lock.

kent-mcleod · 2022-06-07T08:37:16Z

include/smp/lock.h

-            /* we only handle irq_remote_call_ipi here as other type of IPIs
-             * are async and could be delayed. 'handleIPI' may not return
-             * based on value of the 'irqPath'. */
-            handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);


It's actually incorrect to call handleIPI here as the value of big_kernel_lock.node_owners[cpu].next hasn't been correctly initialized until sel4_atomic_exchange returns and so ipiStallCoreCallback called by handleIPI will poll the incorrect lock address.

I agree that it is incorrect, but do we have any data on why we were trying to do handleIPI in here? Were we trying to assure progress of something that otherwise wouldn't happen?

The commit history says something that it can take up to seconds until the atomic exchange succeeds. It does not say which platform, maybe one that was quite bad at SMP support. Modern platform should be better, ARMv8.1 even brings the swap instruction back.

@axel-h do you have a link for that message? The commit that introduced the handleIPI call doesn't have a body: 3d2ae69

I was wondering about this code and eventually I found this discussion that has some background: https://sel4.discourse.group/t/why-does-the-clh-lock-not-use-gcc-for-arm-atomics/103, then I created #856

This those long delays may be fixed by commits Repair barriers in sel4_atomic_exchange and Repair barriers in clh_lock_acquire.

Indanz · 2022-06-07T13:01:00Z

include/smp/lock.h

-            handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
-        }
-
-        arch_pause();


Why also remove arch_pause()? That call has nothing to do with handling IPIs.

I was essentially trying to revert 3d2ae69 which also added arch_pause(). Before the commit the x86 variant directly called __atomic_exchange_n and continued to call __atomic_exchange_n in it's implementation of try_arch_atomic_exchange. Meanwhile, arch_pause is an empty static inline on arm.

This part of the lock is supposed to have minimal bus contention as it's just coordinating the insertion into the FIFO queue. If the pause is needed for x86 then it should either be handled by the compilers implementation of __atomic_exchange_n, or introducing it should have been in a separate commit explaining why just having __atomic_exchange_n was insufficient.

Seem you are aiming at something like what I was experimenting with in axel-h#73

Seem you are aiming at something like what I was experimenting with in axel-h#73

I'm trying to resolve the issue with ipiStallCoreCallback on MCS.

Then you should also revert the other changes in that commit, like introducing try_arch_atomic_exchange_rlx, because the only reason they were made was to be able to handle IPIs within the while loop. If you remove that, ARM can also just use __atomic_exchange_n.

Reading the links @axel-h posted, long delays can happen on architectures where:

The exclusive reservation granule is bigger than one cache line.

Other cores keep writing memory within the lock granule region for other reasons.

In other words, crappy multicore implementations. I'd rather fix those by adding an architecture dependent extra padding around head, as that fixes the fundamental problem. (The only alternative is to use a different locking algorithm.)

axel-h · 2022-06-07T13:04:02Z

include/smp/lock.h

-            handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
-        }
-
-        arch_pause();


Is removing arch_pause() also intended? Seems removing this addresses another issue, it avoids potential delays where we'd miss a chance to do the atomic exchange.

Only ARM's try_arch_atomic_exchange_rlx may return false, and ARM's arch_pause() is empty, so it's fine to remove. Worth mentioning though.

I propose to rename try_arch_atomic_exchange_rlx to arch_atomic_exchange_rlx and to move the while loop into ARM's implementation, clh_lock_acquire is the only caller.

See #867 (comment)

kent-mcleod · 2022-06-07T13:57:23Z

This draft pr is exploring how to handle time accounting in ipiStallCoreCallback on MCS. I just pushed a version that appears to at least pass sel4test for tx2 SMP MCS and x86_64 SMP MCS.

kent-mcleod · 2022-06-07T14:01:41Z

src/smp/ipi.c

-        /* Continue waiting on lock */
-        while (big_kernel_lock.node_owners[getCurrentCPUIndex()].next->value != CLHState_Granted) {
-            if (clh_is_ipi_pending(getCurrentCPUIndex())) {
+    /* Continue waiting on lock */


I think I accidentally reverted a comment I had here:

Because there's no way to remove a node from the lock queue until it has acquired the lock and can then release it, in the syscall context, we need to wait for the lock here and then immediately return. We can't return to higher up the stack because the syscall context is no longer valid since the thread was changed.

kent-mcleod · 2022-06-07T22:38:07Z

src/smp/ipi.c

+    }
+    switchToIdleThread();
+    NODE_STATE(ksSchedulerAction) = SchedulerAction_ResumeCurrentThread;
+    doMaskReschedule(ARCH_NODE_STATE(ipiReschedulePending));


Triggering an IPI within an IPI handler is getting pretty sketchy... I believe this is currently needed because a timeoutFault generated by the current thread being at the end of its budget can make a thread runnable on a different node. As doMaskReschedule is a non-blocking IPI it only delivers an interrupt to the other nodes which only act on it when they independently get the lock. (Compared with blocking IPIs which are executed by the target nodes without acquiring the lock). I expect that this means that if the reschedule IPI is sent back to the original node that triggered this synchronous IPI it will be delivered once that node releases the lock.

Why not let the future schedule call do the doMaskReschedule? Sending reschedule IPIs is non-blocking, but receiving isn't. As the reschedule IPI is handled with the kernel lock taken, doing it here will force a third remote core to block on the kernel lock immediately.

Indanz · 2022-08-26T08:34:18Z

I am wondering if it's not easier to go back to the old code which didn't switch to the idle thread's SC and to do the switch only when the SC is being deleted. That, or do the actual deletion on the remote core at the end of IPICoreStallback. Then all timekeeping issues are bypassed: Either the interrupted SC is charged as it used to be, or the SC is gone so time accounting doesn't matter any more.

What started as a simple bugfix opened a can of worms it seems.

The callback function has one implementation for the IRQ path and one for synchronous path so we make the conditional more obvious.

Restructure the implementation of this function to make it more clear that when it is entered from a non-irq path, then the additional while loop to acquire the lock is necessary because of the changed thread.

When under MCS, the timing needs to be accounted correctly before the thread is switched to the idle thread.

For synchronised IPI interrupts, the kernel lock acquire is skipped as the lock is already held by a different node. Once the IPI call returns, it isn't safe for the node to do anything other than restore the context of the current active thread and return to user. Some asserts are added to try and catch any kernel state inconsistencies at runtime.

kent-mcleod · 2025-03-05T11:54:55Z

After talking a bit to @Indanz , we're starting to consider that it might be better to remove the stallTCB ipi and replace with case-specific operations to be performed on the remote core instead of stalling. This way the time accounting for the stall time problem goes away.

kent-mcleod · 2025-03-05T12:08:52Z

Closing this change for now. I extracted the last commit that fixes a non-mcs SMP bug: #1421

kent-mcleod commented Jun 7, 2022

View reviewed changes

Indanz reviewed Jun 7, 2022

View reviewed changes

axel-h reviewed Jun 7, 2022

View reviewed changes

kent-mcleod commented Jun 7, 2022

View reviewed changes

kent-mcleod added hw-build do all sel4test hardware builds on this PR hw-test sel4test hardware builds + runs for this PR labels Jun 7, 2022

kent-mcleod force-pushed the kent/bkl-1 branch from b8ef6a4 to 305f8c5 Compare June 7, 2022 22:34

kent-mcleod commented Jun 7, 2022

View reviewed changes

kent-mcleod changed the title ~~smp: Don't check for IPIs in sel4_atomic_exchange~~ smp: Address MCS timekeeping issues in ipiStallCoreCallback Jun 7, 2022

kent-mcleod force-pushed the kent/bkl-1 branch from 305f8c5 to 7a680e7 Compare June 7, 2022 23:13

axel-h mentioned this pull request Jun 15, 2022

SMP: Improve kernel lock #871

Open

kent-mcleod added 5 commits March 5, 2025 22:01

smp: Add assert for remoteTCBStall

5d6291c

smp: Add assert for lock state in remoteTCBStall

dc6f088

The callback function has one implementation for the IRQ path and one for synchronous path so we make the conditional more obvious.

smp: Restructure ipiStallCoreCallback

12cc97b

Restructure the implementation of this function to make it more clear that when it is entered from a non-irq path, then the additional while loop to acquire the lock is necessary because of the changed thread.

smp: Handle mcs timing in ipiStallCoreCallback

0c5d891

When under MCS, the timing needs to be accounted correctly before the thread is switched to the idle thread.

kent-mcleod force-pushed the kent/bkl-1 branch from 7a680e7 to 571ea8e Compare March 5, 2025 11:14

kent-mcleod closed this Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!