Releases: vllm-project/vllm-spyre
v1.2.3
Includes a change required to support Torch >= 2.8
What's Changed
- fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551
Full Changelog: v1.2.2...v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0
- ✨ Adds custom GoldenTokenInjector LogitsProcessor for evaluating model quality
- ✨ Initial Granite 4 model support
- 🐛 Fixes a bug where min_tokens was not behaving properly (forced longer sequences than desired)
- 🐛 Fixes a bug in handling top_k that could crash the server
- 📝 Adds runtime_config_validator to check and warn about unsupported model configurations that may not work
What's Changed
- update and expand online example to continious batching by @yannicks1 in #517
- refact: removed unnecessary logits processor by @wallashss in #520
- test: update tests to use golden token injection by @wallashss in #510
- Fix test model revision usage by @prashantgupta24 in #522
- [ppc64le] Update ppc64le dependencies by @Daniel-Schenker in #524
- [CI] Enable model revisions in GHA test by @ckadner in #523
- Manage supported model configurations by @ckadner in #445
- 📜 Add documentation and diagrams on the plugin architecture by @maxdebayser in #530
- ♻️ Simplify env var overrides and add tests by @joerunde in #525
- [Docs] Add arch doc to dev guide view by @rafvasq in #534
- Granite4 2b & 3b support by @yannicks1 in #496
- 🐛 Fix fp8 model name check with quantization check by @gkumbhat in #535
- 📝 add supported torch versions by @joerunde in #528
- add e5-multilingual to known configurations by @maxdebayser in #533
- fix: logits processor state at each step by @wallashss in #544
- fix crashes with the usage of top_k by @tjohnson31415 in #543
- feat: improve golden token injection by @maxdebayser in #540
- Update links to granite FP8 model by @ckadner in #539
- fix: min_tokens > 1 causes long generation with continuous batching by @tjohnson31415 in #545
Full Changelog: v1.1.0...v1.2.0
v1.1.0
v1.1.0
- ⬆️ Adds support for vllm v0.11.0
- 🔥 Drops support for vllm v0.10.1.1
- ✨ Writes performance metrics to file when
VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLEDis set - 🐛 Fixes a bug where incorrect logits processors were applied to requests under load
- 🐛 Fixes a bug where
/chat/completionsrequired a user-specifiedmax_tokensparam to function
What's Changed
- fix: unbatch removals of requests from input_batch by @tjohnson31415 in #511
- 🐛 fixup more tests to use the default max model length by @joerunde in #512
- ✨ Add vLLM 0.11.0 support by @joerunde in #513
- [CB] consistent max context length by @yannicks1 in #514
- [docs] rephrase comment about continuous batching configuration by @yannicks1 in #518
- [CB] set new_tokens to to max value given the constraints by @yannicks1 in #516
- ✨ add debug perf logger by @joerunde in #515
Full Changelog: v1.0.2...v1.1.0
v1.0.2
v1.0.2 patch- test fixes only
This contains fixes for our test suites to run with the full granite 8b models, and to be compatible with post-1.0 versions of the spyre runtime stack
What's Changed
- feat: golden token injector logits processor by @wallashss in #478
- 🐛 fixup full_model marker by @joerunde in #507
- 🐛 use 512 tokens instead of 256 by @joerunde in #509
Full Changelog: v1.0.1...v1.0.2
v1.0.1
1.0.1 Bugfix Release
This Release:
- Fixes a bug where cancelling multiple in-flight requests could crash the vllm server
- Fixes a bug where granite-3.x-8b models were not detected correctly, leading to
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERSnot functioning properly - Fixes a bug where the number of processors was not detected correctly for setting threading configs.
VLLM_SPYRE_NUM_CPUSis now available as a manual override to set the number of cpu cores available to vllm
- Fixes a bug where attempting to run pooling models in continuous batching mode would crash, instead of defaulting to static batching
- Fixes a bug where the lower bound of FMS was not properly specified
- Disables prompt logprobs completely because it's still broken
- Updates the "simple compile backend" to
inductorto align with vLLM
What's Changed
- disable prompt logprobs by @yannicks1 in #486
- [docs] update docs continuous batching by @yannicks1 in #485
- 🐛 correct fms lower bound by @joerunde in #493
- 🎨 scheduler: make holdback queue a local variable by @yannicks1 in #465
- [CB] 🐛 fix padding of position ids by @yannicks1 in #495
- [s390x] Update s390x depencies by @nikheal2 in #494
- fix: logits processors for CB by @wallashss in #484
- [fp8] fix cb scheduler step tests by @yannicks1 in #491
- 🔥 remove auto-marked xfail for fp8, include fp8 tests by default, add xfail manually by @prashantgupta24 in #490
- feat: add VLLM_SPYRE_NUM_CPUS and psutil to help with cpu checks by @tjohnson31415 in #487
- 🐛 implement better checking for granite by @joerunde in #500
- Better Error Handling for attempts to run CB with pooling models by @gmarinho2 in #476
- 🔥 remove unused test parametrizations by @joerunde in #505
- 🔧 Update default simple compile backend by @joerunde in #506
New Contributors
Full Changelog: v1.0.0...v1.0.1
v1.0.0
🎉 vllm-spyre v1.0.0 🎉
This release of vllm-spyre is compatible with the 1.0.0 version of the spyre runtime stack.
See the docs for a list of supported models and configurations
Supported Features:
- ⚡⚡⚡ Production-ready continuous batching (with
VLLM_SPYRE_USE_CB=1) for a gpu-like user experience - 🤓 Accurate text generation results with continuous batching for contexts up to 32k
- 🤏 Support for FP8-quantized models
- 🥅 Support for enforcing pre-compiled model graphs with
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS=1
Known Issues:
- The container image for this release does not have the correct v1.0 spyre runtime stack installed and will not funciton properly, the containerfile is still for demonstration purposes only
- Logits processors (custom and builtin) are not applied to the first generated token (prefill phase). Users might have incorrect results for the sampling params:
min_p,logit_biasandmin_tokens. - It is possible to crash the server with an IndexError and a StackTrace pointing at
logits[self.logits_slice] = -float("inf")if sending and cancelling batches of requests with certain parameters; see #492 - The lower bound for ibm-fms is wrong, it should be <= 1.4.0. The lockfile contains a valid set of dependencies. See #493
- For reranker models, with the sendnn backend the outputs scores can be up to 15% different compared with a sentence-tranformers inference on GPU or CPU.
What's Changed
- [CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch by @yannicks1 in #459
- [fp8] fix tests: increase ISCLOSE_ABS_TOL_QUANTIZATION by @yannicks1 in #460
- Fix dimension of tensor passed to transformer classifier by @maxdebayser in #458
- [CB][FP8] throw error for batch size 1 by @yannicks1 in #467
- fix: tests for graph comparison with FP8 by @wallashss in #462
- Add Sampling Params tests by @gmarinho2 in #379
- ⬆️ bump vllm lower bound to 0.10.1.1 by @prashantgupta24 in #468
- feat: enable custom logits processors by @wallashss in #473
- 🐛 override flex_hdma_p2psize by @joerunde in #475
- test: restored test_swap_decode_programs_for_cb with 32K context by @wallashss in #474
- [Tests] Enable up to 32k by @rafvasq in #472
- ⬆️ bump vllm upper bound to support 0.10.2 by @prashantgupta24 in #463
- get the token_type_ids from pooling params by @maxdebayser in #480
- disable transformers pooler by @maxdebayser in #481
- 🔥 rip out VLLM_SPYRE_TEST_BACKEND_LIST by @prashantgupta24 in #482
- Document supported model configurations by @ckadner in #479
- Disable compilation catalog by @gkumbhat in #471
- 🐛 use eager compile by @joerunde in #488
- [CB] optimization only return last block of prefill logits by @yannicks1 in #464
- [high prio] enable VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by default by @yannicks1 in #477
- fix: custom logits processor by @wallashss in #489
Full Changelog: v0.9.4...v1.0.0
v1.0.0rc3
1.0 Release Candidate 3
Includes 🌶️🌶️🌶️ performance optimizations and one test bugfix
What's Changed
- [CB] optimization only return last block of prefill logits by @yannicks1 in #464
- [high prio] enable VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by default by @yannicks1 in #477
- fix: custom logits processor by @wallashss in #489
Full Changelog: v1.0.0rc2...v1.0.0rc3
v1.0.0rc2
1.0 Release Candidate 2
This contains a critical bugfix for environments that do not have gcc and python3-devel installed
What's Changed
- 🔥 rip out VLLM_SPYRE_TEST_BACKEND_LIST by @prashantgupta24 in #482
- Document supported model configurations by @ckadner in #479
- Disable compilation catalog by @gkumbhat in #471
- 🐛 use eager compile by @joerunde in #488
Full Changelog: v1.0.0rc1...v1.0.0rc2