implement paged attention #367

**Description of change

Divide the KV Cache into pages to enable more fine-grained and efficient memory management

Related Issues:

Linked Issues:

Issue [Feature Request] - qwen3-8B demo; larger batch size #358

Copilot

Pull Request Overview

This PR implements a paged attention mechanism by dividing the KV cache into fixed-size pages and adds corresponding wrappers and tests.

Introduce paged_attention CUDA kernel and host wrapper to manage paged KV caches
Update Python bindings in runtime_kernel_wrapper.cu and expose new API
Add and adjust Python unit tests for paged attention and decoding with normalization

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/runtime_python/test_paged_attention.py	New test to compare `paged_attention` kernel against a Torch ref
tests/runtime_python/test_decoding_w_norm.py	Updated decoding test to include RMSNorm and rotary arguments
tests/runtime_python/runtime_kernel_wrapper.cu	Added `paged_attention` & updated `single_batch_decoding` wrappers
include/mirage/persistent_kernel/tasks/paged_attention.cuh	Full implementation of the paged attention device task
include/mirage/persistent_kernel/tasks/kernel.h	Exposed the new `paged_attention` task in the kernel header

Comments suppressed due to low confidence (3)

tests/runtime_python/test_paged_attention.py:162

[nitpick] This test only prints random ratios without any assertions. Consider adding deterministic checks (e.g., assert torch.allclose(mirage_output, torch_out, atol=1e-2)) to automatically validate correctness.

    if torch.rand(1).item() < 0.05:

tests/runtime_python/test_decoding_w_norm.py:141

[nitpick] This prints the output ratio but lacks assertions. Adding assert statements to compare mirage_output and torch_output within tolerance will make the test fail on regressions.

    print(torch_output / mirage_output)

tests/runtime_python/runtime_kernel_wrapper.cu:116

[nitpick] The parameter name rotary_emd appears to be a typo. Consider renaming it to rotary_emb or rotary_embed for consistency and clarity.

                                              bool rotary_emd,

tests/runtime_python/test_decoding_w_norm.py

tests/runtime_python/runtime_kernel_wrapper.cu

Copy link

Collaborator

undefined-c0der commented Jun 28, 2025

**Description of change

Divide the KV Cache into pages to enable more fine-grained and efficient memory management

Related Issues:

Linked Issues:

Issue [Feature Request] - qwen3-8B demo; larger batch size #358

implement paged attention

a7ad78f

undefined-c0der requested review from jiazhihao, xinhaoc and Copilot June 28, 2025 02:47

Copilot AI reviewed Jun 28, 2025

View reviewed changes

tests/runtime_python/test_paged_attention.py Show resolved Hide resolved

tests/runtime_python/test_decoding_w_norm.py Show resolved Hide resolved

tests/runtime_python/runtime_kernel_wrapper.cu Show resolved Hide resolved

add sm_scale

e332ca1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement paged attention #367

implement paged attention #367

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

implement paged attention #367

Are you sure you want to change the base?

implement paged attention #367

Conversation

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!