TLB simulation #36

HanyzPAPU2 · 2023-08-07T13:38:21Z

This pull request adds TLB simulation to RISC-V virtual translation scheme and implementis the SFENCE.VMA instruction.

Also a misunderstanding the WPRI CSRs has been fixed and a minor performance optimalization has been implemented.

doc/reference/riscv_tlb.rst

vhotspur · 2023-08-22T12:28:09Z

src/device/cpu/riscv_rv32ima/cpu.c

+            // Here there has been some problem with the pagetable while we tried to set the AD bits
+            // We still use the cached translation and act as if nothing happened
+            // This is done to introduce bugs which show on improper ASID management and SFENCE usage
+            alert("Used Cached Address translation that is not present in pagetable!");


Please, what does the specification says about this? This is specified as undefined behavior?

The Privileged RISC-V manual says:

4.3.2 Virtual Address Translation Process
page 82:
1 Let a be satp.ppn × PAGESIZE, and let i = LEVELS − 1. (For Sv32, PAGESIZE=212 and
LEVELS=2.) [...]
2 Let pte be the value of the PTE at address a+va.vpn[i]×PTESIZE. (For Sv32, PTESIZE=4.) [...]
[...]
7 If pte.a = 0, or if the original memory access is a store and pte.d = 0, either raise a page-fault exception corresponding to the original access type, or:

If a store to pte would violate a PMA or PMP check, raise an access-fault exception corresponding to the original access type.

Perform the following steps atomically:

Compare pte to the value of the PTE at address a + va.vpn[i] × PTESIZE.

If the values match, set pte.a to 1 and, if the original memory access is a store, also set pte.d to 1.

If the comparison fails, return to step 2

[...]
page 83:
The results of implicit address-translation reads in step 2 may be held in a read-only, incoherent
address-translation cache but not shared with other harts. The address-translation cache may hold
an arbitrary number of entries, including an arbitrary number of entries for the same address and
ASID. Entries in the address-translation cache may then satisfy subsequent step 2 reads if the
ASID associated with the entry matches the ASID loaded in step 0 or if the entry is associated
with a global mapping. To ensure that implicit reads observe writes to the same memory locations,
an SFENCE.VMA instruction must be executed after the writes to flush the relevant cached
translations.
The address-translation cache cannot be used in step 7; accessed and dirty bits may only be updated
in memory directly.

Note:
It is permitted for multiple address-translation cache entries to co-exist for the same address.
This represents the fact that in a conventional TLB hierarchy, it is possible for multiple entries
to match a single address if, for example, a page is upgraded to a superpage without first clearing
the original non-leaf PTE’s valid bit and executing an SFENCE.VMA with rs1=x0, or if multiple
TLBs exist in parallel at a given level of the hierarchy. In this case, just as if an SFENCE.VMA
is not executed between a write to the memory-management tables and subsequent implicit read
of the same address: it is unpredictable whether the old non-leaf PTE or the new leaf PTE is
used, but the behavior is otherwise well defined.

So translating virtual addresses that are not in the current pagetable-structure can be done utilizing the TLB if the ASID matches or the mapping is global, but there is a problem with setting the Accessed and Dirty bits in the pagetable-structure.

This is what the specification says about the A and D bits:

4.3.1 Addressing and Memory Protection
page 81
Each leaf PTE contains an accessed (A) and dirty (D) bit. The A bit indicates the virtual page has
been read, written, or fetched from since the last time the A bit was cleared. The D bit indicates
the virtual page has been written since the last time the D bit was cleared.
[...]
When a virtual page is accessed and the A bit is clear, or is written and the D bit is clear, the
implementation sets the corresponding bit(s) in the PTE. The PTE update must be atomic
with respect to other accesses to the PTE, and must atomically check that the PTE is valid
and grants sufficient permissions. Updates of the A bit may be performed as a result of
speculation, but updates to the D bit must be exact (i.e., not speculative), and observed in
program order by the local hart. Furthermore, the PTE update must appear in the global
memory order no later than the explicit memory access, or any subsequent explicit memory
access to that virtual page by the local hart. The ordering on loads and stores provided by
FENCE instructions and the acquire/release bits on atomic instructions also orders the PTE
updates associated with those loads and stores as observed by remote harts.
The PTE update is not required to be atomic with respect to the explicit memory access that
caused the update, and the sequence is interruptible. However, the hart must not perform
the explicit memory access before the PTE update is globally visible.

The PTE needs to be updated when the virtual page is accessed/dirtied as compared to the value of the PTE as it is currently (which we got from the TLB). But the PTE needs to be updated in memory and in the current pagetable-structure, where the translation does not exist.

This situation is not described in the specification (or at least I have not found a description) but based on the note about having multiple TLB entries for the same address with the same ASID I would guess that this situation can be classified as undefined behavior.

I've thought about this a bit more, and I think that in the case stated above, the A/D bits should be set in the cached PTE which has been used in the wrong translation.
I've come to this conclusion, because the standard allows the caching of PTEs based on their physical address, so the update of AD bits would be based the physical address of the incorrect PTE.

Moreover the current implementation of the TLB is not exactly true to the specification, but fixing this by keeping track of where the cached PTEs lie in physical memory could be argued to be equivalent to the version in the specification.

So there are two options, either reimplement the TLB to be true to the spec (and cache PTEs based on the physical address), or keep the current version (caching whole translations) and add the mitigation described above.

I think that caching the whole translation is more intuitive, especially for dumping the contents of the TLB, but maybe we should stick to the specification word for word.

I have been thinking about this and I am still not sure what is a better option here. I am more inclined to follow the specification word by word.

By the way, what implications it would have if we go with the simpler scheme when A and D bits must be always set (i.e. the first of the two schemes from page 81)?

@lbulej, what do you think?

I am also inclined to following the standard.
It also should not be too much work, I think I could get it done over the weekend.
The system tests are still applicable, so that is even less work to be done.

BTW, @vhotspur also mentioned the option of always requiring A=D=1.
[...]
So with respect to the OS course, we could easily go with A=D=1, because at the moment our assignments make to use of the A/D bits.

Yes, we could, but the only difference in the execution would be raising a page-fault when the bits are incorrect.
I think this difference is not that significant.
Thanks to your clarification I understand how it should work and I don't think it should introduce many problems.

On a somewhat unrelated note—in some earlier comments you mention that "the standard allows the caching of PTEs based on their physical address".

The standard states:

The results of implicit address-translation reads in step 2 may be held in a read-only, incoherent address-translation cache.
[...]
Entries in the address-translation cache may then satisfy subsequent step 2 reads.

The read in step 2 is based on a physical address, and the cached PTEs can be used for satisfying this read request.
As I understand it, the TLB can only be used for the reads in step 2 and not to cache whole translations.

This makes the TLB operate on the level of physical rather than virtual addresses on its input.

As I understand it, the whole address translation process must still be executed for each memory request.

One interesting conclusion can be deduced from this interpretation: even if an ASID is stolen, as long as the translation table structures are separate in physical memory, the address spaces will not interfere with each other.

The conclusion from my last comment seemed unexpected for a TLB to me, so I decided to read how TLBs are implemented in practice.
I've found the following two articles:

[1] https://ieeexplore.ieee.org/abstract/document/9221630/
[2] https://arxiv.org/abs/1905.06825

From them, I've found out, that the caching described in Section 4.3.2 on page 83 after the description of the address translation process does not describe how a TLB should behave, but rather a Page Table Walk cache [1, 2].
But a TLB could be seen as caching the leaf PTE of the translation based solely on the virtual address and ASID (skipping the implicit reads of the non-leaf PTEs), and it can be understood to fulfill the last read from the translation algorithm.

A TLB should work as a cache of the whole translation, and the only part of the spec that addresses this issue is Section 4.2.1 Supervisor Memory-Management Fence Instruction, which defines memory ordering constraints on the implicit accesses to the page-table structures.

The closest mention I could find is on page 77 and it reads:

A consequence of this specification is that an implementation may use any translation for an
address that was valid at any time since the most recent SFENCE.VMA that subsumes that address.

This is done as to allow "wider variety of dynamic caching structures and memory-management schemes" but it have left me rather confused.

The problem which started this discussion was, what to do with the AD bits on cached PTEs which we used from the TLB.
Now I see that this problem arises only when the cached PTE has A=0 or D=0 and the access is a write.
Then we need to perform the Compare-And-Swap in memory and update the PTE (or restart).
For this to work, we need to keep the physical address on which the PTE lies in the TLB; without it we could not perform the CAS without doing the whole pagewalk.

I do not think MSIM needs the Page Table Walk cache, as it would not introduce a new point of failure

The results of implicit address-translation reads in step 2 may be held in a read-only, incoherent address-translation cache.
[...]
Entries in the address-translation cache may then satisfy subsequent step 2 reads.

The read in step 2 is based on a physical address, and the cached PTEs can be used for satisfying this read request.
As I understand it, the TLB can only be used for the reads in step 2 and n 628C ot to cache whole translations.

Hmm, I don't think I would interpret it that way. The purpose of the TLB is to cache the end result of the VA->PA translation. If there is a TLB miss, the spec allows implementations to use a hierarchy of various other caches to speed-up the translation (so that the CPU does not have to walk 4-5 levels of page tables). These are assumed to be read-only but potentially incoherent, but I believe the OS writer should be at most aware of the TLB (not the architecture-specific implementation of all the translation caching hardware) and the need for telling the CPU about page table changes or changes in association between an ASID and a page table, hence the need for sfence.vma.

As I understand it, the whole address translation process must still be executed for each memory request.

Yes, but most of the time it runs off the TLB. If that fails, the hardware may or may not use partial translation caches to speed up things, but the programmer's responsibility is to behave nicely with respect to the TLB only. Anything else must be subsumed by it.

One interesting conclusion can be deduced from this interpretation: even if an ASID is stolen, as long as the translation table structures are separate in physical memory, the address spaces will not interfere with each other.

I'm not sure about this. Stealing ASID means giving it to some other process, together with potentially different physical address of the first page table. But if the TLB contains some cached entries with A=1 or A=D=1 for a particular ASID and VA combination (there are similarities in the layout of process memory), it will happily use the old translation result cached in the TLB.

Maybe you are right, but I would stick to the more conservative interpretation. The RISC-V hardware is all about making provisions for various (expensive) optimizations while still making simple implementations reasonably efficient. You can do a CPU with TLB only (without partial translation caches) and it should still be efficient (i.e., not require doing the translation all the time).

Now I see that maybe the whole problem with interpreting the spec is that when it says "translation process", it does not mean "when the CPU accesses a particular virtual address" in general, but when the CPU actually has to do the page walk, i.e., after a TLB miss. Then it makes sense that the spec does not talk TLB in the translation process, because TLB is out of the game at that point—only the (potential) partial translation caches remain.

Perhaps I was that only one who did not realize that, sorry about that if that's the case.

Still, it should not change the earlier conclusion about the A/D update behavior. Let's see.

For explicit memory accesses:

read

on TLB hit, A=1 (otherwise the mapping would not be in TLB), do nothing to PTE

on TLB miss, walk page tables, update PTE in memory to set A=1, cache the end result with A=1

write

on TLB hit, A = 1, D=1, do nothing to PTE

on TLB hit, A = 1, D=0, walk page tables, update PTE in memory to set A=D=1, cache the end results with A=D=1

on TLB miss, walk page tables, update PTE in memory to set A=D=1, cache the end results with A=D=1

There are no implicit writes (apart from writes to PTEs, but these need to go through page tables).

So we are left with implicit reads (prefetching, speculation).

Prefetches would be triggered by an explicit memory access and often don't cross page boundaries, so we could assume that it will most run off the TLB without triggering any updates to A bits.

Speculation can probably take an arbitrary (computed) VA and try to fetch it, dealing with the possibility of a TLB miss. In that case, it can run the translation process and get a PTE from memory. If the PTE has A=0, the spec explicitly allows the hardware to update the PTE to set A=1. It is not defined, if the result of the translation triggered by speculation is going to be stored in the TLB (I would wager not, because speculation would be able to pollute the TLB, but the partial translation caches are a different story). If the speculation got a TLB hit, the entry will already have A=1.

:-) We seem to have arrived at the same/similar conclusion regarding what the spec is talking about in the end, but GH did not update the comments while I was writing it down. Anyway, MSIM certainly does not need a page walk cache, TLB is enough if we want to play with ASIDs and sfence.vma.

I've changed the implementation to behave as we have discussed here.
(commits 6460832, c11dcac and 77a505f)

src/device/cpu/riscv_rv32ima/tlb.c

tests/rvtests/unit-tests/Makefile

HanyzPAPU · 2023-09-12T13:39:05Z

I reimplemented the TLB to be fully associative with a full LRU eviction strategy.

I also realized, that we can make the ASID recycle/steal tests fail if we make the TLB large enough for those tests.
If there are more entries available than is the number of ASIDs, then this fully asociative cache will not have to evict any entries between the threads with the same ASID run (even if other 511 threads run in between).

I have tested this on DeutschOS and I have been able to fail these tests by not flushing the TLB correctly.
The test ran with 520 threads and the full 512 available ASIDs.

vhotspur · 2023-09-18T12:38:40Z

I think we can merge this now. Thanks a lot!

But fefore doing that, would you, please, revert the version number changes you introduced? It works better for me if the versions are bumped directly in master rather than keeping the counter updated in various PRs. And I think I will do a major release anyway :-)

This reverts commit 3aefb83.

This reverts commit 94c1bfa.

HanyzPAPU added 30 commits August 25, 2022 11:14

added some csr tests

1387736

added WLRL illegal-instruction test

6193ae8

8000 csr WARL test

2b6aef9

tests of system instruction exceptions

e05a86f

more complex counter exceptions

9344ec2

fixed tests

e854260

csr instruction stub

2641ce4

top level csr functionality

ff0e74a

csr routing switch done

3f2ab7d

counter read implemented

5ffbae4

counter write/set/clear

2bfe2ab

mvendorid marchid mimpid mhartid

760ab9a

csr instruction groundwork and sie

3b22742

stvec and scounteren

7dd4dd8

fixed sip and sie to be WARL

ac16d2b

sscratch

c7ed19c

sepc

f45c329

scause

ae9f514

stval

08cdaae

senvcfg

6a4860a

satp

3677b29

scontext

45a0569

mhpmevent

aeb024e

misa

7245d8f

medeleg

eca4276

mideleg

386086a

mie

4ffba08

mip

0a075bd

mtvec

6ef243f

mcounteren

734d931

HanyzPAPU added 4 commits August 7, 2023 15:12

instr cache puts MRU items at front

20212ee

integrated upstream

8bdb9b2

bats

3a20cc5

Merge branch 'master' of github.com:d-iii-s/msim

252a612

vhotspur self-requested a review August 8, 2023 09:27

fixed tlb debug print formatting error

1523cf9