v1.1.0
v1.1.0
- ⬆️ Adds support for vllm v0.11.0
- 🔥 Drops support for vllm v0.10.1.1
- ✨ Writes performance metrics to file when
VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLEDis set - 🐛 Fixes a bug where incorrect logits processors were applied to requests under load
- 🐛 Fixes a bug where
/chat/completionsrequired a user-specifiedmax_tokensparam to function
What's Changed
- fix: unbatch removals of requests from input_batch by @tjohnson31415 in #511
- 🐛 fixup more tests to use the default max model length by @joerunde in #512
- ✨ Add vLLM 0.11.0 support by @joerunde in #513
- [CB] consistent max context length by @yannicks1 in #514
- [docs] rephrase comment about continuous batching configuration by @yannicks1 in #518
- [CB] set new_tokens to to max value given the constraints by @yannicks1 in #516
- ✨ add debug perf logger by @joerunde in #515
Full Changelog: v1.0.2...v1.1.0