Xinference v1.17.0 Release Notes

✅ Key Highlights

Discover the major improvements and new features in this release

🧩

MThreads GPU (MUSA) Support

Added native support for domestic MThreads GPUs, further improving multi-hardware ecosystem compatibility and bringing more flexibility to your AI deployments.

🖼️

Multi-modal Engine Upgrade

• OCR: Added Apple MLX engine support
• Image Models: Now support multi-engine switching
• Video Models: Added GGUF quantization format support

🚀

vLLM Distributed & Enhancement

• Fixed and improved multi-machine distributed inference for vLLM ≥ 0.11.0
• Added RoPE Scaling and MTP (Multi-Token Prediction) parameter support

🧠

New Model Support

Qwen-Image-Edit-2511

Qwen-Image-2512

🌐 Community Edition Updates

Open-source enhancements and new features for everyone

📦 Installation Methods

pip install:

pip install 'xinference==1.17.0'

Docker:

Pull the latest image or update via pip inside the container

🆕 New Model Support

Qwen-Image-Edit-2511
Qwen-Image-2512

✨ New Features

✓ Support for enable_thinking parameter
✓ Added MThreads GPU (MUSA) support
✓ vLLM ≥ 0.11.0 distributed model launch
✓ OCR multi-engine + MLX backend
✓ Image models multi-engine switching
✓ Video models GGUF quantization
✓ Sentence-Transformers rerank auto batch
✓ Added FP4 inference support
✓ Added MiniMax tool call support

🛠 Enhancements

✓ vLLM MTP & RoPE Scaling parameters
✓ Model metadata updates (DeepSeek, OCR, R1)

🐞 Bug Fixes

✓ Fixed vLLM embedding/rerank empty cache
✓ Fixed worker duplicate selection
✓ Fixed vLLM OCR model stop issue
✓ Fixed model download cancel issue

📚 Documentation Updates

📄 Updated v1.16.0 release notes

🐳 Improved Docker documentation

🔧 vLLM + Torch compatibility notes

🏢 Enterprise Edition Updates

Advanced features for production deployments at scale

☸️

Kubernetes Support

✓ Optimized deployment and scheduling in K8s environments
✓ Improved stability in multi-node scenarios
✓ Enhanced maintainability for multi-replica deployments

⚡

KV Cache Architecture

✓ Decentralized, engine-agnostic KV cache storage
✓ Cross-engine PD separation (Prefill/Decode)
✓ Foundation for heterogeneous inference collaboration

Xinference v1.17.0

Important Notice