Multi-Token Prediction Just Landed in llama.cpp — Here's What's Actually Available
For anyone running local LLMs, inference speed is the constant bottleneck. You either buy faster hardware, run smaller models, or accept the wait. A new feature in llama.cpp changes that equation — but the reality for local operators is more nuanced than the headlines suggest.