-
Notifications
You must be signed in to change notification settings - Fork 701
smp: Address MCS timekeeping issues in ipiStallCoreCallback #867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
include/smp/lock.h
Outdated
/* we only handle irq_remote_call_ipi here as other type of IPIs | ||
* are async and could be delayed. 'handleIPI' may not return | ||
* based on value of the 'irqPath'. */ | ||
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually incorrect to call handleIPI
here as the value of big_kernel_lock.node_owners[cpu].next
hasn't been correctly initialized until sel4_atomic_exchange
returns and so ipiStallCoreCallback
called by handleIPI
will poll the incorrect lock address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is incorrect, but do we have any data on why we were trying to do handleIPI
in here? Were we trying to assure progress of something that otherwise wouldn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit history says something that it can take up to seconds until the atomic exchange succeeds. It does not say which platform, maybe one that was quite bad at SMP support. Modern platform should be better, ARMv8.1 even brings the swap instruction back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about this code and eventually I found this discussion that has some background: https://sel4.discourse.group/t/why-does-the-clh-lock-not-use-gcc-for-arm-atomics/103, then I created #856
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This those long delays may be fixed by commits Repair barriers in sel4_atomic_exchange and Repair barriers in clh_lock_acquire.
include/smp/lock.h
Outdated
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath); | ||
} | ||
|
||
arch_pause(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why also remove arch_pause()
? That call has nothing to do with handling IPIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was essentially trying to revert 3d2ae69 which also added arch_pause(). Before the commit the x86 variant directly called __atomic_exchange_n and continued to call __atomic_exchange_n in it's implementation of try_arch_atomic_exchange. Meanwhile, arch_pause is an empty static inline on arm.
This part of the lock is supposed to have minimal bus contention as it's just coordinating the insertion into the FIFO queue. If the pause is needed for x86 then it should either be handled by the compilers implementation of __atomic_exchange_n, or introducing it should have been in a separate commit explaining why just having __atomic_exchange_n was insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem you are aiming at something like what I was experimenting with in axel-h#73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem you are aiming at something like what I was experimenting with in axel-h#73
I'm trying to resolve the issue with ipiStallCoreCallback on MCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you should also revert the other changes in that commit, like introducing try_arch_atomic_exchange_rlx
, because the only reason they were made was to be able to handle IPIs within the while loop. If you remove that, ARM can also just use __atomic_exchange_n
.
Reading the links @axel-h posted, long delays can happen on architectures where:
- The exclusive reservation granule is bigger than one cache line.
- Other cores keep writing memory within the lock granule region for other reasons.
In other words, crappy multicore implementations. I'd rather fix those by adding an architecture dependent extra padding around head
, as that fixes the fundamental problem. (The only alternative is to use a different locking algorithm.)
include/smp/lock.h
Outdated
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath); | ||
} | ||
|
||
arch_pause(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is removing arch_pause()
also intended? Seems removing this addresses another issue, it avoids potential delays where we'd miss a chance to do the atomic exchange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only ARM's try_arch_atomic_exchange_rlx
may return false, and ARM's arch_pause()
is empty, so it's fine to remove. Worth mentioning though.
I propose to rename try_arch_atomic_exchange_rlx
to arch_atomic_exchange_rlx
and to move the while loop into ARM's implementation, clh_lock_acquire
is the only caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #867 (comment)
This draft pr is exploring how to handle time accounting in ipiStallCoreCallback on MCS. I just pushed a version that appears to at least pass sel4test for tx2 SMP MCS and x86_64 SMP MCS. |
/* Continue waiting on lock */ | ||
while (big_kernel_lock.node_owners[getCurrentCPUIndex()].next->value != CLHState_Granted) { | ||
if (clh_is_ipi_pending(getCurrentCPUIndex())) { | ||
/* Continue waiting on lock */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I accidentally reverted a comment I had here:
- Because there's no way to remove a node from the lock queue until it has acquired the lock and can then release it, in the syscall context, we need to wait for the lock here and then immediately return. We can't return to higher up the stack because the syscall context is no longer valid since the thread was changed.
} | ||
switchToIdleThread(); | ||
NODE_STATE(ksSchedulerAction) = SchedulerAction_ResumeCurrentThread; | ||
doMaskReschedule(ARCH_NODE_STATE(ipiReschedulePending)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggering an IPI within an IPI handler is getting pretty sketchy... I believe this is currently needed because a timeoutFault generated by the current thread being at the end of its budget can make a thread runnable on a different node. As doMaskReschedule is a non-blocking IPI it only delivers an interrupt to the other nodes which only act on it when they independently get the lock. (Compared with blocking IPIs which are executed by the target nodes without acquiring the lock). I expect that this means that if the reschedule IPI is sent back to the original node that triggered this synchronous IPI it will be delivered once that node releases the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not let the future schedule call do the doMaskReschedule
? Sending reschedule IPIs is non-blocking, but receiving isn't. As the reschedule IPI is handled with the kernel lock taken, doing it here will force a third remote core to block on the kernel lock immediately.
I am wondering if it's not easier to go back to the old code which didn't switch to the idle thread's SC and to do the switch only when the SC is being deleted. That, or do the actual deletion on the remote core at the end of IPICoreStallback. Then all timekeeping issues are bypassed: Either the interrupted SC is charged as it used to be, or the SC is gone so time accounting doesn't matter any more. What started as a simple bugfix opened a can of worms it seems. |
The callback function has one implementation for the IRQ path and one for synchronous path so we make the conditional more obvious.
Restructure the implementation of this function to make it more clear that when it is entered from a non-irq path, then the additional while loop to acquire the lock is necessary because of the changed thread.
When under MCS, the timing needs to be accounted correctly before the thread is switched to the idle thread.
For synchronised IPI interrupts, the kernel lock acquire is skipped as the lock is already held by a different node. Once the IPI call returns, it isn't safe for the node to do anything other than restore the context of the current active thread and return to user. Some asserts are added to try and catch any kernel state inconsistencies at runtime.
After talking a bit to @Indanz , we're starting to consider that it might be better to remove the stallTCB ipi and replace with case-specific operations to be performed on the remote core instead of stalling. This way the time accounting for the stall time problem goes away. |
Closing this change for now. I extracted the last commit that fixes a non-mcs SMP bug: #1421 |
Update (5.Mar.2025): With #867 being recently merged, this PR needs to be rebased. I think the top 5 commits are still worth reapplying so I'll keep the PR open.
Because it's been so long since doing this work, I can't find the issue it was addressing. But essentially when a TCB with an SC on one core performs an operation that requires remotely stalling another core, the time consumed by this stall isn't properly accounted. But there's also some correctness bugs where some operations are performed outside of the kernel lock.