8000 smp: Address MCS timekeeping issues in ipiStallCoreCallback by kent-mcleod · Pull Request #867 · seL4/seL4 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

smp: Address MCS timekeeping issues in ipiStallCoreCallback #867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

kent-mcleod
Copy link
Member
@kent-mcleod kent-mcleod commented Jun 7, 2022

Update (5.Mar.2025): With #867 being recently merged, this PR needs to be rebased. I think the top 5 commits are still worth reapplying so I'll keep the PR open.
Because it's been so long since doing this work, I can't find the issue it was addressing. But essentially when a TCB with an SC on one core performs an operation that requires remotely stalling another core, the time consumed by this stall isn't properly accounted. But there's also some correctness bugs where some operations are performed outside of the kernel lock.

/* we only handle irq_remote_call_ipi here as other type of IPIs
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
Copy link
Member Author
@kent-mcleod kent-mcleod Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually incorrect to call handleIPI here as the value of big_kernel_lock.node_owners[cpu].next hasn't been correctly initialized until sel4_atomic_exchange returns and so ipiStallCoreCallback called by handleIPI will poll the incorrect lock address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is incorrect, but do we have any data on why we were trying to do handleIPI in here? Were we trying to assure progress of something that otherwise wouldn't happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit history says something that it can take up to seconds until the atomic exchange succeeds. It does not say which platform, maybe one that was quite bad at SMP support. Modern platform should be better, ARMv8.1 even brings the swap instruction back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axel-h do you have a link for that message? The commit that introduced the handleIPI call doesn't have a body: 3d2ae69

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this code and eventually I found this discussion that has some background: https://sel4.discourse.group/t/why-does-the-clh-lock-not-use-gcc-for-arm-atomics/103, then I created #856

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This those long delays may be fixed by commits Repair barriers in sel4_atomic_exchange and Repair barriers in clh_lock_acquire.

handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
}

arch_pause();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why also remove arch_pause()? That call has nothing to do with handling IPIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was essentially trying to revert 3d2ae69 which also added arch_pause(). Before the commit the x86 variant directly called __atomic_exchange_n and continued to call __atomic_exchange_n in it's implementation of try_arch_atomic_exchange. Meanwhile, arch_pause is an empty static inline on arm.

This part of the lock is supposed to have minimal bus contention as it's just coordinating the insertion into the FIFO queue. If the pause is needed for x86 then it should either be handled by the compilers implementation of __atomic_exchange_n, or introducing it should have been in a separate commit explaining why just having __atomic_exchange_n was insufficient.

Copy link
Member
@axel-h axel-h Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem you are aiming at something like what I was experimenting with in axel-h#73

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem you are aiming at something like what I was experimenting with in axel-h#73

I'm trying to resolve the issue with ipiStallCoreCallback on MCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you should also revert the other changes in that commit, like introducing try_arch_atomic_exchange_rlx, because the only reason they were made was to be able to handle IPIs within the while loop. If you remove that, ARM can also just use __atomic_exchange_n.

Reading the links @axel-h posted, long delays can happen on architectures where:

  • The exclusive reservation granule is bigger than one cache line.
  • Other cores keep writing memory within the lock granule region for other reasons.

In other words, crappy multicore implementations. I'd rather fix those by adding an architecture dependent extra padding around head, as that fixes the fundamental problem. (The only alternative is to use a different locking algorithm.)

handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
}

arch_pause();
Copy link
Member
@axel-h axel-h Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing arch_pause() also intended? Seems removing this addresses another issue, it avoids potential delays where we'd miss a chance to do the atomic exchange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only ARM's try_arch_atomic_exchange_rlx may return false, and ARM's arch_pause() is empty, so it's fine to remove. Worth mentioning though.

I propose to rename try_arch_atomic_exchange_rlx to arch_atomic_exchange_rlx and to move the while loop into ARM's implementation, clh_lock_acquire is the only caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kent-mcleod
Copy link
Member Author

This draft pr is exploring how to handle time accounting in ipiStallCoreCallback on MCS. I just pushed a version that appears to at least pass sel4test for tx2 SMP MCS and x86_64 SMP MCS.

/* Continue waiting on lock */
while (big_kernel_lock.node_owners[getCurrentCPUIndex()].next->value != CLHState_Granted) {
if (clh_is_ipi_pending(getCurrentCPUIndex())) {
/* Continue waiting on lock */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I accidentally reverted a comment I had here:

  • Because there's no way to remove a node from the lock queue until it has acquired the lock and can then release it, in the syscall context, we need to wait for the lock here and then immediately return. We can't return to higher up the stack because the syscall context is no longer valid since the thread was changed.

@kent-mcleod kent-mcleod added hw-build do all sel4test hardware builds on this PR hw-test sel4test hardware builds + runs for this PR labels Jun 7, 2022
}
switchToIdleThread();
NODE_STATE(ksSchedulerAction) = SchedulerAction_ResumeCurrentThread;
doMaskReschedule(ARCH_NODE_STATE(ipiReschedulePending));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggering an IPI within an IPI handler is getting pretty sketchy... I believe this is currently needed because a timeoutFault generated by the current thread being at the end of its budget can make a thread runnable on a different node. As doMaskReschedule is a non-blocking IPI it only delivers an interrupt to the other nodes which only act on it when they independently get the lock. (Compared with blocking IPIs which are executed by the target nodes without acquiring the lock). I expect that this means that if the reschedule IPI is sent back to the original node that triggered this synchronous IPI it will be delivered once that node releases the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not let the future schedule call do the doMaskReschedule? Sending reschedule IPIs is non-blocking, but receiving isn't. As the reschedule IPI is handled with the kernel lock taken, doing it here will force a third remote core to block on the kernel lock immediately.

@kent-mcleod kent-mcleod changed the title smp: Don't check for IPIs in sel4_atomic_exchange smp: Address MCS timekeeping issues in ipiStallCoreCallback Jun 7, 2022
@Indanz
Copy link
Contributor
Indanz commented Aug 26, 2022

I am wondering if it's not easier to go back to the old code which didn't switch to the idle thread's SC and to do the switch only when the SC is being deleted. That, or do the actual deletion on the remote core at the end of IPICoreStallback. Then all timekeeping issues are bypassed: Either the interrupted SC is charged as it used to be, or the SC is gone so time accounting doesn't matter any more.

What started as a simple bugfix opened a can of worms it seems.

The callback function has one implementation for the IRQ path and one
for synchronous path so we make the conditional more obvious.
Restructure the implementation of this function to make it more clear
that when it is entered from a non-irq path, then the additional while
loop to acquire the lock is necessary because of the changed thread.
When under MCS, the timing needs to be accounted correctly before the
thread is switched to the idle thread.
For synchronised IPI interrupts, the kernel lock acquire is skipped as
the lock is already held by a different node. Once the IPI call returns,
it isn't safe for the node to do anything other than restore the context
of the current active thread and return to user.

Some asserts are added to try and catch any kernel state inconsistencies
at runtime.
@kent-mcleod
Copy link
Member Author

After talking a bit to @Indanz , we're starting to consider that it might be better to remove the stallTCB ipi and replace with case-specific operations to be performed on the remote core instead of stalling. This way the time accounting for the stall time problem goes away.

@kent-mcleod
Copy link
Member Author

Closing this change for now. I extracted the last commit that fixes a non-mcs SMP bug: #1421

@kent-mcleod kent-mcleod closed this Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hw-build do all sel4test hardware builds on this PR hw-test sel4test hardware builds + runs for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0