Open
1 of 3 issues completedDescription
Feature Description
A key advantage of MPK is achieving extremely low-latency for LLM serving. Speculative decoding offers a lossless approach to further reducing LLM latency which will push MPK's decoding speedup to the next level.
We will firstly implement model free methods such as prompt-lookup and lookahead decoding. Others will be on the schedule later.
Any comment on the road map && taking tasks you're interested in is welcomed.
General TODO (to be extended)
- Python interface parameters to decide using Spec or not, methods type, parameters needed and so on.
- Parallel decoding support of different kernels
- Mask support for topolgy-aware verification (e.g. Specinfer)
Specified TODO
Please see in different method issue pages.
Methods to support
Model free methods
- Prompt Lookup Decoding @NorthmanPKU
- Lookahead Decoding
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status