Skip to main content

Release Notes

Xinference v1.14.0 brings exciting new features including parallel model loading, enhanced virtual environments, and improved engine diagnostics.

✨ Key Highlights

Parallel Model Loading

  • Multiple replicas support parallel loading
  • UI displays independent loading progress
  • View all replica statuses on instance page

Enhanced Virtual Environment

Engine Diagnostics

Display specific reasons when engines are unavailable, making troubleshooting easier without guessing why an engine is unavailable

Rerank with llama.cpp

Rerank models now support llama.cpp backend for lightweight inference

vLLM 0.11.1+ Compatible

Full compatibility with the latest vLLM version

FLUX.2-dev Support

High-quality image generation model with enhanced capabilities

🌐 Community Edition

📦 Installation

pip install: pip install 'xinference==1.14.0'

Docker: Pull the latest image or update with pip in container

🆕 New Model Support

  • HunyuanOCR
  • FLUX.2-dev

✨ New Features

  • Parallel model replica loading
  • Manageable and deletable virtual environments
  • rerank supports llama.cpp backend
  • vLLM 0.11.1+ compatibility

🛠️ Build & Fixes

  • Fixed gradio 6.x UI model startup exceptions
  • Fixed GPU selection issues in hybrid CPU/GPU clusters
  • Compatible with xllamacpp 0.2.5+
  • Fixed DeepSeek-OCR Docker errors
  • Fixed multimodal model cache display issues
  • Tool call ID changed to UUID to avoid conflicts

📚 Documentation Updates

  • Updated v1.13.0 release documentation
  • Updated model and documentation generation process

🏢 Enterprise Edition

Performance Improvements

Multi-replica loading and faster inference scheduling for enhanced performance

Enhanced Stability

Fixed multiple issues for more reliable enterprise cluster operations