How to Run vLLM as an OpenAI-Compatible API and Stop Burning Time on Local LLM Glue Code
A practical vLLM guide that shows how to serve a model with an OpenAI-compatible API, call it from Python, and understand why vLLM has become a popular choice for serious local or self-hosted inference workflows.
Why developers keep landing on vLLM: a lot of local model experiments die in a swamp of custom wrappers, uneven throughput, and one-off scripts. vLLM became popular because it gives teams a more serious serving layer instead of another toy runner.
What vLLM is good at
vLLM is designed for high-throughput inference and provides an OpenAI-compatible server interface, which is one of the biggest reasons it is useful in real developer workflows. If your app already knows how to talk to an OpenAI-style chat completion endpoint, you can often reuse much more of your integration layer than expected.
That is a major productivity win.
Install vLLM
The exact install path depends on your environment, but the official quickstart includes a straightforward pip install:
pip install vllmIf you are doing this in a dedicated Python project, pair it with a clean environment first.
Start an OpenAI-compatible server
Example:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000Once that process is running, you have a local endpoint that behaves like an OpenAI-style API.
Test it quickly with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a concise coding assistant."},
{"role": "user", "content": "Explain what a healthcheck is in Docker Compose."}
]
}'That alone is a huge workflow simplifier compared with building your own local prompt endpoint from scratch.
Use it from Python with the OpenAI SDK shape
If your client code already uses an OpenAI-compatible SDK, point it at the local base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local-dev",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to retry an HTTP request three times."},
],
)
print(response.choices[0].message.content)This is where the OpenAI-compatible design stops being a buzzword and starts saving engineering time.
Why teams like this setup
It is attractive when you want:
- local experimentation without rewriting your whole app integration
- more control over model hosting
- self-hosted inference for privacy or cost reasons
- a cleaner path from prototype to internal service
What still trips people up
They underestimate hardware constraints
Local inference is still real infrastructure work. Model size, GPU memory, batching behavior, and concurrency expectations all matter.
They treat “it runs” as “it is production ready”
Serving a model once is easy. Operating a reliable inference service is a different job. Throughput, latency, observability, and fallback behavior still matter.
They over-customize too early
One of the best things about vLLM is that it gives you a standardized serving surface. Do not immediately bury that under five custom abstractions unless you have a real reason.
When to choose vLLM
Choose it when you want a serious inference service boundary instead of another notebook-only demo. If your team already has OpenAI-style client code, the compatibility layer alone can justify the decision because it cuts integration churn dramatically.
The real appeal of vLLM is not that it is trendy. It is that it lets developers spend less time inventing local LLM plumbing and more time testing actual product behavior.