Skip to content

Releases: vllm-project/vllm-spyre

v1.2.3

12 Nov 17:46
9d049db

Choose a tag to compare

Includes a change required to support Torch >= 2.8

What's Changed

  • fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551

Full Changelog: v1.2.2...v1.2.3

v1.2.2

05 Nov 18:16
f081f4f

Choose a tag to compare

What's Changed

  • Remove aftu script copying, use directly by @rafvasq in #548

Full Changelog: v1.2.1...v1.2.2

v1.2.1

31 Oct 16:07
ddf3c4d

Choose a tag to compare

v1.2.1 Torch profiler bugfix release

  • 🐛 Fixes a bug where the aiu profiler crashes in tensor parallel mode

What's Changed

  • [profiler] fix multi-aiu profiling and add setable options by @mcalman in #519

Full Changelog: v1.2.0...v1.2.1

v1.2.0

29 Oct 17:14
07928f2

Choose a tag to compare

v1.2.0

  • ✨ Adds custom GoldenTokenInjector LogitsProcessor for evaluating model quality
  • ✨ Initial Granite 4 model support
  • 🐛 Fixes a bug where min_tokens was not behaving properly (forced longer sequences than desired)
  • 🐛 Fixes a bug in handling top_k that could crash the server
  • 📝 Adds runtime_config_validator to check and warn about unsupported model configurations that may not work

What's Changed

Full Changelog: v1.1.0...v1.2.0

v1.1.0

10 Oct 23:28
dff277b

Choose a tag to compare

v1.1.0

  • ⬆️ Adds support for vllm v0.11.0
  • 🔥 Drops support for vllm v0.10.1.1
  • ✨ Writes performance metrics to file when VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLED is set
  • 🐛 Fixes a bug where incorrect logits processors were applied to requests under load
  • 🐛 Fixes a bug where /chat/completions required a user-specified max_tokens param to function

What's Changed

Full Changelog: v1.0.2...v1.1.0

v1.0.2

07 Oct 21:14
0c9b971

Choose a tag to compare

v1.0.2 patch- test fixes only

This contains fixes for our test suites to run with the full granite 8b models, and to be compatible with post-1.0 versions of the spyre runtime stack

What's Changed

Full Changelog: v1.0.1...v1.0.2

v1.0.1

06 Oct 20:41
0ae7872

Choose a tag to compare

1.0.1 Bugfix Release

This Release:

  1. Fixes a bug where cancelling multiple in-flight requests could crash the vllm server
  2. Fixes a bug where granite-3.x-8b models were not detected correctly, leading to VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS not functioning properly
  3. Fixes a bug where the number of processors was not detected correctly for setting threading configs.
    1. VLLM_SPYRE_NUM_CPUS is now available as a manual override to set the number of cpu cores available to vllm
  4. Fixes a bug where attempting to run pooling models in continuous batching mode would crash, instead of defaulting to static batching
  5. Fixes a bug where the lower bound of FMS was not properly specified
  6. Disables prompt logprobs completely because it's still broken
  7. Updates the "simple compile backend" to inductor to align with vLLM

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

29 Sep 22:42
88350f8

Choose a tag to compare

🎉 vllm-spyre v1.0.0 🎉

This release of vllm-spyre is compatible with the 1.0.0 version of the spyre runtime stack.

See the docs for a list of supported models and configurations

Supported Features:

  • ⚡⚡⚡ Production-ready continuous batching (with VLLM_SPYRE_USE_CB=1) for a gpu-like user experience
  • 🤓 Accurate text generation results with continuous batching for contexts up to 32k
  • 🤏 Support for FP8-quantized models
  • 🥅 Support for enforcing pre-compiled model graphs with VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS=1

Known Issues:

  • The container image for this release does not have the correct v1.0 spyre runtime stack installed and will not funciton properly, the containerfile is still for demonstration purposes only
  • Logits processors (custom and builtin) are not applied to the first generated token (prefill phase). Users might have incorrect results for the sampling params: min_p, logit_bias and min_tokens.
  • It is possible to crash the server with an IndexError and a StackTrace pointing at logits[self.logits_slice] = -float("inf") if sending and cancelling batches of requests with certain parameters; see #492
  • The lower bound for ibm-fms is wrong, it should be <= 1.4.0. The lockfile contains a valid set of dependencies. See #493
  • For reranker models, with the sendnn backend the outputs scores can be up to 15% different compared with a sentence-tranformers inference on GPU or CPU.

What's Changed

Full Changelog: v0.9.4...v1.0.0

v1.0.0rc3

27 Sep 01:47
88350f8

Choose a tag to compare

v1.0.0rc3 Pre-release
Pre-release

1.0 Release Candidate 3

Includes 🌶️🌶️🌶️ performance optimizations and one test bugfix

What's Changed

  • [CB] optimization only return last block of prefill logits by @yannicks1 in #464
  • [high prio] enable VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by default by @yannicks1 in #477
  • fix: custom logits processor by @wallashss in #489

Full Changelog: v1.0.0rc2...v1.0.0rc3

v1.0.0rc2

27 Sep 01:20
7ebe354

Choose a tag to compare

v1.0.0rc2 Pre-release
Pre-release

1.0 Release Candidate 2

This contains a critical bugfix for environments that do not have gcc and python3-devel installed

What's Changed

Full Changelog: v1.0.0rc1...v1.0.0rc2