Xinference is a unified, production-ready inference platform. Effortlessly deploy the latest or your own models using just one command. Whether you are a researcher, developer, or data scientist, Xinference empowers you to unleash the full potential of AI today.

1-click deployment
pip install xinference
or
docker
Unified OpenAI compatible API to access any model
1  from xinference_client import Client
2  client = Client("http://localhost:997")
 
3  model = client.get_model("MODEL_UID")
4  model.chat("What is an inference platform")
Heterogenous GPU and hardware abstraction
GPU optimisation
Model lifecycle management
Autoscaling & performance optimisation
Access 300+ LLMs, embedding & multimodal models
OpenAI Claude G Gemini 𝕏 Grok DS deepseek Q Qwen
Public cloud
AWS GCP Azure Oracle Alibaba Cloud
On-premise
Hybrid environments

A high-performance inference platform that lets you scale with confidence

From single-command deployment to enterprise-grade cluster management — Xinference handles it all.

Comprehensive Model Repository

Integrating 100+ latest models, including mainstream models like DeepSeek, Qwen3, and InternVL — supporting voice, multimodal, and other model types out of the box.

Enterprise-grade Management Functions

Providing fine-tuning support, permission management, monitoring systems, batch processing and other enterprise-grade functions to meet professional domain requirements in finance, healthcare, and more.

Extensive Computing Power Support

Comprehensive support for mainstream computing chips — Nvidia, Intel, AMD, Apple and other heterogeneous hardware — with unified scheduling across heterogeneous computing power.

Multi-Engine Concurrent Inference

Support vLLM, SGLang, Transformer, MLX and other engines to start simultaneously, providing large-scale multi-feature inference services for enterprise workloads.

Enterprise-grade Distributed Deployment

Built on the self-developed Xoscar high-performance distributed computing foundation, supporting stable operation at 200,000-core scale with automatic load balancing and fault recovery.

High Concurrency Optimization

Optimized for enterprise high-concurrency scenarios — supports structured output, provides memory optimisation and performance acceleration, ensuring business continuity and stability.

Xinference on GitHub
OPEN SOURCE

Powered by Xinference, open source by the community, for the community

Xinference is built in the open. Contributions, issues, and ideas from the global developer community drive the project forward — join thousands of engineers already running Xinference in production.

Xinference Enterprise
Delivers better performance and enterprise-grade reliability

Features
Open Source
Enterprise
Hardware Support
NVIDIA only
Comprehensive support for mainstream chips
Run Multiple Models on a Single GPU
Powering higher GPU utilisation
Enterprise Features
Enterprise grade management functions
Super-Charged Performance
N/A
Up to 2× faster than open source

Learn more about our research — Xinference documentation ↗

Trusted by teams building at scale

See how teams optimise performance and cut costs with Xinference

40%
reduction in AI infrastructure costs
DataCore →
faster model inference for production workloads
NeuralSoft →
👤

"Xinference aligned with our vision: to iterate faster, scale smarter, and operate more efficiently across all our AI workloads."

Sarah Chen
Head of AI Platform
CloudScale →
👤

"We chose Xinference not just for what we needed today, but for where we know we're heading. As our AI workloads grow more complex, Xinference gives us the infrastructure to scale without limits."

James Park
Engineering Lead, AI Infrastructure
QuantumAI →
99%
uptime SLA across enterprise deployments
Infratech →
10×
faster model deployment vs. previous solution
Synapse Labs →
increase in model throughput after migration
VectorEdge →
$2M+
annual GPU cost savings across 12 deployments
Orbis AI →
👤

"Switching to Xinference cut our time-to-deploy from days to minutes. The team finally has the breathing room to focus on model quality instead of infrastructure."

Priya Nair
VP of Machine Learning
DeepLayer →

Inferencing made better.
Run any model with total control.

One-click deployment  |  Complete control from day one

Frequently Asked Questions

Everything you need to know about Xinference and how it fits into your AI stack.

What is Xinference and how does it work?

Xinference is an open-source platform that lets you deploy and serve large language models, embedding models, image models, and more — all through a unified API. It abstracts away the complexity of model loading, hardware management, and scaling so your team can focus on building applications.

How does Xinference compare to running models via cloud providers?

Cloud providers charge you for every token processed through their managed AI services, and your data passes through their infrastructure. With Xinference, you deploy models on your own infrastructure — cloud, on-prem, or hybrid.

Xinference is a unified, production-ready inference platform giving you full control over which models to run, which GPU to use, and where to deploy; all while ensuring best-in-class performance and cost optimisation.

How does pricing work?

Pricing is based on the number of nodes per cluster. Xinference Enterprise costs US$15k per node per cluster.

For example, a small deployment of 2 nodes (usually ~16 GPUs) would cost US$30k / annum. A larger deployment of 250 nodes (usually ~2,000 GPUs) would cost US$3.8m / annum. Running multiple clusters would mean multi billing per cluster.

What is the difference between the open source and Enterprise solution?

Xinference Enterprise delivers better performance and enterprise-grade reliability. Our customers pick the Enterprise solution as it delivers comprehensive hardware compatibility, enables running multiple models on a single GPU, and super charges performance with up to 2x greater throughput.

Most importantly, Xinference Enterprise comes with critical enterprise management features like RBAC, audit logs, a unified management console and SLA guarantees.

How does Xinference handle data privacy?

With Xinference, you can choose to run your models on your own infrastructure — cloud or on-premises — so your prompts and data never leave your environment. This makes Xinference purpose-built for industries with strict data requirements like finance and healthcare.

Can Xinference integrate with our existing MLOps stack?

Xinference provides a RESTful API compatible with OpenAI's protocol, meaning any tool already built around OpenAI's API works with Xinference by changing a single line of code. Xinference integrates with popular third-party libraries including LangChain, LlamaIndex, Dify, and Chatbox. Kubernetes deployment via Helm is also supported for teams running containerised infrastructure.