AI Services Gateway API

A secure, unified gateway in front of our self-hosted Large Language Models. It speaks two protocols over the same credentials and routing engine.

OpenAI-Compatible

Drop-in for the OpenAI SDK / Chat Completions API. Returns chat.completion objects.

POST /api/v1/chat/completions
Ollama-Native

The native Ollama /api/chat response shape for existing Ollama clients.

POST /api/v1/chat.php

Authentication

Every request is protected by a Bearer Token (API Key) sent in the Authorization header. Keys are issued and managed by an administrator.

Authorization: Bearer YOUR_API_KEY

There is no model-listing endpoint: the gateway is chat-only and every key is bound to exactly one model server-side.

How Routing Works

Understanding the routing pipeline explains why the model you send is ignored and how requests are distributed:

  1. Authenticate — the key is matched by an indexed lookup digest (constant work regardless of how many keys exist).
  2. Enforce policy — expiry, IP & domain allow-lists, atomic per-minute rate limit, and token quotas are checked.
  3. Resolve the model — the gateway substitutes the model bound to your key; any model you send is overwritten.
  4. Select a server — candidates are the active servers that carry the model. If your key is bound to a cluster, candidates are restricted to that cluster's members (workload isolation).
  5. Weighted load balancing — among candidates, a server is chosen by routing weight (stronger GPUs absorb more traffic), with automatic failover to the next candidate on error.
  6. Circuit breaker — a node that fails repeatedly is briefly skipped, so requests don't pay timeouts against a known-bad server.
Transparent to clients. Clusters, weights, failover and the model binding are all server-side. Your request body stays the same no matter how the backend is organized.

Endpoints

POST /api/v1/chat/completions OpenAI

Content-Type: application/json

Primary OpenAI-compatible endpoint. With the OpenAI SDK, set base_url to https://api.adelk.sa/llm/api/v1 and call client.chat.completions.create(...).

Direct path (no URL rewriting required): /api/v1/chat/completions.php.

POST /api/v1/chat.php Ollama

Content-Type: application/json

Native Ollama-compatible endpoint. Accepts the same request body and returns Ollama's /api/chat response shape (NDJSON when streaming).

Both endpoints accept an identical request body. They differ only in the response format they emit.

Request Parameters

Parameter Type Required Description
messages Array Yes The conversation. Each item has role (system / user / assistant) and content. Content is a string, or an array of parts (text / image_url) for vision.
model String SDK Required by the OpenAI SDK but ignored — the gateway always uses the model bound to your key. Send any non-empty string.
stream Boolean No When true, partial deltas are streamed. Default false.
stream_options Object No OpenAI streaming only. {"include_usage": true} emits a final usage chunk with token counts before [DONE].
temperature, top_p Number No Sampling controls, forwarded to the backend.
max_tokens Number No Maximum tokens to generate. Maps to Ollama num_predict.
presence_penalty, frequency_penalty, seed, stop Number / String[] No Forwarded as generation options. stop accepts a string or array.
response_format Object No {"type":"json_object"} requests JSON output. Keys configured with Force JSON always request JSON.
tools, tool_choice Array / String No Function/tool calling (OpenAI tool schema). Honored only for keys with Tools enabled; otherwise silently ignored.

Vision (image_url) is honored only for keys with Vision enabled, and only data: base64 URIs are forwarded (remote URLs are dropped to prevent SSRF).

Request JSON Schemas

The same body works on both endpoints. Below are the common shapes the gateway supports.

Minimal
{
  "model": "assigned-by-key",
  "messages": [
    { "role": "user", "content": "Hello!" }
  ],
  "stream": false
}
Full (sampling + JSON mode)
{
  "model": "assigned-by-key",
  "messages": [
    { "role": "system", "content": "You are concise." },
    { "role": "user", "content": "Summarize the water cycle." }
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 300,
  "stop": ["\n\n"],
  "seed": 42,
  "response_format": { "type": "json_object" },
  "stream": false
}
Vision (multimodal content)
{
  "model": "assigned-by-key",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this image?" },
        { "type": "image_url",
          "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
      ]
    }
  ]
}
Tool / Function calling
{
  "model": "assigned-by-key",
  "messages": [
    { "role": "user", "content": "What is the weather in Riyadh?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
          "type": "object",
          "properties": { "city": { "type": "string" } },
          "required": ["city"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Code Examples

curl -X POST "https://api.adelk.sa/llm/api/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
        "model": "assigned-by-key",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in one sentence."}
        ],
        "stream": false
     }'
# Native Ollama response shape
curl -X POST "https://api.adelk.sa/llm/api/v1/chat.php" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
        "model": "assigned-by-key",
        "messages": [
            {"role": "user", "content": "Why is the sky blue?"}
        ],
        "stream": false
     }'
import requests

url = "https://api.adelk.sa/llm/api/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY",
}
data = {
    "model": "assigned-by-key",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
}

resp = requests.post(url, headers=headers, json=data)
print(resp.status_code)
print(resp.json()["choices"][0]["message"]["content"])
const url = "https://api.adelk.sa/llm/api/v1/chat/completions";
const apiKey = "YOUR_API_KEY";

const res = await fetch(url, {
    method: "POST",
    headers: {
        "Content-Type": "application/json",
        "Authorization": `Bearer ${apiKey}`,
    },
    body: JSON.stringify({
        model: "assigned-by-key",
        messages: [{ role: "user", content: "Tell me a joke." }],
    }),
});

const data = await res.json();
console.log(data.choices[0].message.content);
from openai import OpenAI

client = OpenAI(
    base_url="https://api.adelk.sa/llm/api/v1",
    api_key="YOUR_API_KEY",
)

# `model` is required by the SDK; the gateway enforces your key's model.
resp = client.chat.completions.create(
    model="assigned-by-key",
    messages=[{"role": "user", "content": "Write a haiku about code."}],
)
print(resp.choices[0].message.content)

Streaming

Set "stream": true. The wire format depends on the endpoint:

OpenAI Server-Sent Events

Content-Type: text/event-stream — each event is a chat.completion.chunk; the stream ends with [DONE].

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[],"usage":{"prompt_tokens":9,"completion_tokens":3,"total_tokens":12}}

data: [DONE]
Ollama NDJSON

Content-Type: application/x-ndjson — one JSON object per line; the last line has done: true.

{"model":"llama3:8b","message":{"role":"assistant","content":"Hello"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":""},"done":true,
 "prompt_eval_count":9,"eval_count":3,"total_duration":123456789}
Mid-stream failures. Once bytes have been sent, the gateway cannot transparently fail over to another server. The stream ends early; the request is logged as partial and the tokens already produced are still counted and billed.

Response Structure

OpenAI Chat Completion
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "llama3:latest",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is your answer..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

When the model calls a tool, message.tool_calls is present and finish_reason is tool_calls.

Ollama Chat Response
{
  "model": "llama3:8b",
  "created_at": "2026-06-29T10:00:00Z",
  "message": {
    "role": "assistant",
    "content": "Here is your answer..."
  },
  "done": true,
  "total_duration": 123456789,
  "prompt_eval_count": 9,
  "eval_count": 12
}

The model field shows the alias configured for the serving node, not necessarily the raw tag.

Key Policies & Limits

These are configured per API key by an administrator and enforced on every request. They are not part of the request body — but they shape the responses you get.

PolicyEffect
Model bindingEach key serves exactly one model; your model field is ignored.
Cluster routingThe key's requests are restricted to a dedicated pool of servers (workload / model isolation).
Rate limitMaximum requests per minute. Exceeding it returns 429.
Token quotasDaily / weekly / monthly token caps (per key or per owning user). Exceeding returns 429.
IP & domain allow-listsRequests from other IPs / referrer domains are rejected with 403.
ExpiryAfter the expiry date the key returns 403.
Tools / VisionFunction calling and image input are honored only when enabled on the key; otherwise stripped silently.
Force JSONAlways requests JSON output, regardless of response_format.
System promptAn administrator-defined system message may be injected ahead of your messages.

Error Codes

Errors are returned as JSON. On the OpenAI endpoint the shape is {"error":{"message":...,"type":...,"code":...}}; on the Ollama endpoint it is {"error":{"message":...,"type":"proxy_error","code":...}}.

StatusMeaningDescription
400Bad Request Invalid JSON body or malformed messages.
401Unauthorized Missing or invalid API key. Check the Authorization header.
403Forbidden Key expired, or IP / domain not allowed.
405Method Not Allowed Only POST is supported.
429Too Many Requests Rate limit or token quota exceeded.
502Bad Gateway The backend returned an invalid response (or a stream broke mid-way).
503Service Unavailable No healthy backend server can currently serve the key's model.

Postman

A ready-to-use Postman collection covers every endpoint and mode (chat, streaming, vision, tools, JSON mode, and the Ollama-native endpoint).

  1. Download and import both files:
  2. In Postman, select the AI Gateway environment (top-right).
  3. Set base_url (e.g. https://api.adelk.sa/llm) and your api_key.
  4. Open any request and hit Send. Authentication is pre-wired via the collection's Bearer token.
The collection's variables default to this server. Override them in the environment for staging / production hosts without editing each request.

AI Services Gateway Manager - Developed by Adelk