Skip to main content
Vast provides an OpenAI API-compatible proxy service that lets you point any application or library that works with the OpenAI API at a Vast Serverless vLLM endpoint instead. If your code already uses the OpenAI Python client (or any OpenAI-compatible HTTP client), you can switch to Vast by changing two values: the API key and the base URL.

Prerequisites

  • A Vast.ai account with a valid API key. You can find your key on the Account page.
  • An active Serverless endpoint running the vLLM template. See the Quickstart guide to create one.

How It Works

Vast runs a lightweight proxy at openai.vast.ai that accepts requests in the OpenAI API format and routes them to your Serverless vLLM endpoint. Your client sends a standard OpenAI request, the proxy translates it into a Vast Serverless call, and the response is returned in the OpenAI format your client expects. This means frameworks and tools built on the OpenAI SDK — such as LangChain, LlamaIndex, or custom chat applications — can use Vast Serverless without any code changes beyond updating credentials.

Migrating from OpenAI (or Another Provider)

If you already have an application that calls the OpenAI API (or another OpenAI-compatible provider such as Together AI, Anyscale, or a self-hosted vLLM instance), migration requires only two changes:
SettingBeforeAfter
API KeyYour OpenAI / provider keyYour Vast API key
Base URLhttps://api.openai.com/v1 (or provider URL)https://openai.vast.ai/<ENDPOINT_NAME>
Replace <ENDPOINT_NAME> with the name of your Serverless endpoint. No other code changes are required — the proxy accepts the same request and response schema for the supported endpoints.
The model field is required by the OpenAI SDK but is ignored by the proxy. The model served is determined entirely by the MODEL_NAME environment variable set in your vLLM endpoint configuration. You can pass any string (including an empty string) for this field.
from openai import OpenAI

client = OpenAI(
    api_key="<YOUR_VAST_API_KEY>",
    base_url="https://openai.vast.ai/<ENDPOINT_NAME>",
)

response = client.chat.completions.create(
    model="",  # model is determined by your endpoint configuration
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain serverless computing in two sentences."},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

Supported Endpoints

The proxy supports the following OpenAI-compatible endpoints exposed by vLLM:
EndpointDescription
/v1/chat/completionsMulti-turn conversational completions
/v1/completionsSingle-prompt text completions
Both endpoints support streaming ("stream": true). For detailed request/response schemas and parameters, see the vLLM template documentation.

Limitations

The OpenAI-compatible proxy is designed for text-in, text-out workloads only. Review the limitations below before integrating.

Text only

The proxy supports text inputs and text outputs only. The following OpenAI features are not supported:
  • Vision / image inputs — Passing images via image_url in message content is not supported.
  • Audio inputs and outputs — The /v1/audio endpoints (speech, transcription, translation) are not available.
  • Image generation — The /v1/images endpoint is not available.
  • Embeddings — The /v1/embeddings endpoint is not available through the proxy.

vLLM-specific differences from the OpenAI specification

Because the proxy routes to a vLLM backend rather than OpenAI’s own service, there are inherent differences between the two:
  • Tokenization — Token counts may differ from OpenAI models because vLLM uses the tokenizer bundled with the open-source model (e.g., Qwen, Llama). This can affect billing estimates and max_tokens behavior.
  • Streaming chunk boundaries — While the proxy uses the same Server-Sent Events (SSE) format, the exact boundaries of streamed chunks may differ. Some chunks may contain empty strings when chunked prefill is enabled.
  • Tool / function calling — Tool calling is supported on models that are fine-tuned for it, but behavior may differ from OpenAI’s implementation. The parallel_tool_calls parameter is not supported. See the vLLM template documentation for details.
  • Unsupported parameters — The following request parameters are accepted but ignored: user, suffix, and image_url.detail.
  • Response fields — vLLM may return additional fields not present in the OpenAI specification (e.g., kv_transfer_params). Standard OpenAI client libraries will safely ignore these.
  • Moderation — No content moderation layer is applied. OpenAI’s /v1/moderations endpoint is not available.