Xinference v1.14.0 Release Notes

Release Notes

Xinference v1.14.0 brings exciting new features including parallel model loading, enhanced virtual environments, and improved engine diagnostics.

✨ Key Highlights

Parallel Model Loading

Multiple replicas support parallel loading
UI displays independent loading progress
View all replica statuses on instance page

Enhanced Virtual Environment

Manage and delete virtual environments
Documentation

Engine Diagnostics

Display specific reasons when engines are unavailable, making troubleshooting easier without guessing why an engine is unavailable

Rerank with llama.cpp

Rerank models now support llama.cpp backend for lightweight inference

vLLM 0.11.1+ Compatible

Full compatibility with the latest vLLM version

FLUX.2-dev Support

High-quality image generation model with enhanced capabilities

🌐 Community Edition

📦 Installation

pip install: pip install 'xinference==1.14.0'

Docker: Pull the latest image or update with pip in container

🆕 New Model Support

HunyuanOCR
FLUX.2-dev

✨ New Features

Parallel model replica loading
Manageable and deletable virtual environments
rerank supports llama.cpp backend
vLLM 0.11.1+ compatibility

🛠️ Build & Fixes

Fixed gradio 6.x UI model startup exceptions
Fixed GPU selection issues in hybrid CPU/GPU clusters
Compatible with xllamacpp 0.2.5+

Fixed DeepSeek-OCR Docker errors
Fixed multimodal model cache display issues
Tool call ID changed to UUID to avoid conflicts

📚 Documentation Updates

Updated v1.13.0 release documentation
Updated model and documentation generation process

🏢 Enterprise Edition

Performance Improvements

Multi-replica loading and faster inference scheduling for enhanced performance

Enhanced Stability

Fixed multiple issues for more reliable enterprise cluster operations