🚀 Xinference v1.16.0 Release Notes

New Features, Enhancements, and Bug Fixes

Highlights

🧩 VASTAI GPU (VACC) Support

Added support for VASTAI GPUs, extended to VLM (Vision Language Models) scenarios, further expanding the hardware ecosystem.

🍎 Apple MLX Backend - Continuous Batching

MLX chat models now support continuous batching, enabling concurrent request processing and significantly improving throughput and concurrency performance.

🧠 New Model Support

  • Qwen-Image-Layered
  • Fun-ASR-Nano-2512
  • Fun-ASR-MLT-Nano-2512

⚠️ Python Version Support Change

Starting from this version, Python 3.9 is no longer supported. Please use Python 3.10 or above.

🌐 Community Edition Updates

📦 Installation

  • Pip install: pip install 'xinference==1.16.0'
  • Docker: Pull the latest image or update via pip in the container

🆕 New Model Support

  • Qwen-Image-Layered
  • Fun-ASR-Nano-2512
  • Fun-ASR-MLT-Nano-2512

New Features

  • vLLM Backend: Added vLLM engine support for DeepSeek-V3.2 / DeepSeek-V3.2-Exp
  • VACC (VASTAI GPU): Support for LLM and VLM inference
  • MLX: Chat models support continuous batching for concurrent inference
  • Rerank: Support for async batch processing
  • Model Launch: Added architectures field
  • UI: Image models support configuration via environment variables and custom parameters
  • MiniMaxM2ForCausalLM: Added vLLM backend support

🛠 Enhancements

  • Replica allocation optimization, more contiguous GPU index allocation
  • Docker image upgraded to CUDA 12.9, using vLLM v0.11.2
  • Support for torchaudio 2.9.0
  • Continuous updates to model metadata (JSON) (DeepSeek, GLM, LLaMA, Jina, Z-Image, etc.)

🐞 Bug Fixes

  • Fixed PaddleOCR-VL output anomalies
  • Fixed custom embedding / rerank analysis errors
  • Fixed CPU startup and multi-worker startup issues
  • Fixed OCR API returning empty results
  • Fixed n_gpu parameter handling issues

📚 Documentation Updates

  • Updated new model documentation
  • Added v1.15.0 release documentation

🏢 Enterprise Edition Updates

  • Ascend Performance Optimization: Further improved inference performance and stability on Ascend platform
  • Enhanced Fine-tuning: Strengthened fine-tuning workflow and capabilities, supporting more complex enterprise-level training and tuning requirements