AI Services Gateway API

A secure, unified gateway in front of our self-hosted Large Language Models. It speaks two protocols over the same credentials and routing engine.

OpenAI-Compatible

Drop-in for the OpenAI SDK / Chat Completions API. Returns chat.completion objects.

POST /api/v1/chat/completions

Ollama-Native

The native Ollama /api/chat response shape for existing Ollama clients.

POST /api/v1/chat.php

To migrate an existing OpenAI integration, change only the base_url and api_key. The backend model, server pool, and limits are governed by your key — no client changes are required when infrastructure changes.

Authentication

Every request is protected by a Bearer Token (API Key) sent in the Authorization header. Keys are issued and managed by an administrator.

Authorization: Bearer YOUR_API_KEY

There is no model-listing endpoint: the gateway is chat-only and every key is bound to exactly one model server-side.

How Routing Works

Understanding the routing pipeline explains why the model you send is ignored and how requests are distributed:

Authenticate — the key is matched by an indexed lookup digest (constant work regardless of how many keys exist).
Enforce policy — expiry, IP & domain allow-lists, atomic per-minute rate limit, and token quotas are checked.
Resolve the model — the gateway substitutes the model bound to your key; any model you send is overwritten.
Select a server — candidates are the active servers that carry the model. If your key is bound to a cluster, candidates are restricted to that cluster's members (workload isolation).
Weighted load balancing — among candidates, a server is chosen by routing weight (stronger GPUs absorb more traffic), with automatic failover to the next candidate on error.
Circuit breaker — a node that fails repeatedly is briefly skipped, so requests don't pay timeouts against a known-bad server.

Transparent to clients. Clusters, weights, failover and the model binding are all server-side. Your request body stays the same no matter how the backend is organized.

Endpoints

Base URL: https://api.adelk.sa/llm
Full endpoint paths below are relative to this base — e.g. https://api.adelk.sa/llm/api/v1/chat/completions.

POST /api/v1/chat/completions OpenAI

Content-Type: application/json

Primary OpenAI-compatible endpoint. With the OpenAI SDK, set base_url to https://api.adelk.sa/llm/api/v1 and call client.chat.completions.create(...).

Direct path (no URL rewriting required): /api/v1/chat/completions.php.

POST /api/v1/chat.php Ollama

Content-Type: application/json

Native Ollama-compatible endpoint. Accepts the same request body and returns Ollama's /api/chat response shape (NDJSON when streaming).

Both endpoints accept an identical request body. They differ only in the response format they emit.

Request Parameters

Parameter	Type	Required	Description
`messages`	Array	Yes	The conversation. Each item has `role` (`system` / `user` / `assistant`) and `content`. Content is a string, or an array of parts (`text` / `image_url`) for vision.
`model`	String	SDK	Required by the OpenAI SDK but ignored — the gateway always uses the model bound to your key. Send any non-empty string.
`stream`	Boolean	No	When `true`, partial deltas are streamed. Default `false`.
`stream_options`	Object	No	OpenAI streaming only. `{"include_usage": true}` emits a final usage chunk with token counts before `[DONE]`.
`temperature`, `top_p`	Number	No	Sampling controls, forwarded to the backend.
`max_tokens`	Number	No	Maximum tokens to generate. Maps to Ollama `num_predict`.
`presence_penalty`, `frequency_penalty`, `seed`, `stop`	Number / String[]	No	Forwarded as generation options. `stop` accepts a string or array.
`response_format`	Object	No	`{"type":"json_object"}` requests JSON output. Keys configured with Force JSON always request JSON.
`tools`, `tool_choice`	Array / String	No	Function/tool calling (OpenAI tool schema). Honored only for keys with Tools enabled; otherwise silently ignored.

Vision (image_url) is honored only for keys with Vision enabled, and only data: base64 URIs are forwarded (remote URLs are dropped to prevent SSRF).

Request JSON Schemas

The same body works on both endpoints. Below are the common shapes the gateway supports.

Minimal

{
  "model": "assigned-by-key",
  "messages": [
    { "role": "user", "content": "Hello!" }
  ],
  "stream": false
}

Full (sampling + JSON mode)

{
  "model": "assigned-by-key",
  "messages": [
    { "role": "system", "content": "You are concise." },
    { "role": "user", "content": "Summarize the water cycle." }
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 300,
  "stop": ["\n\n"],
  "seed": 42,
  "response_format": { "type": "json_object" },
  "stream": false
}

Vision (multimodal content)

{
  "model": "assigned-by-key",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this image?" },
        { "type": "image_url",
          "image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
      ]
    }
  ]
}

Tool / Function calling

{
  "model": "assigned-by-key",
  "messages": [
    { "role": "user", "content": "What is the weather in Riyadh?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
          "type": "object",
          "properties": { "city": { "type": "string" } },
          "required": ["city"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Code Examples

curl -X POST "https://api.adelk.sa/llm/api/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
        "model": "assigned-by-key",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in one sentence."}
        ],
        "stream": false
     }'

# Native Ollama response shape
curl -X POST "https://api.adelk.sa/llm/api/v1/chat.php" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
        "model": "assigned-by-key",
        "messages": [
            {"role": "user", "content": "Why is the sky blue?"}
        ],
        "stream": false
     }'

import requests

url = "https://api.adelk.sa/llm/api/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY",
}
data = {
    "model": "assigned-by-key",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
}

resp = requests.post(url, headers=headers, json=data)
print(resp.status_code)
print(resp.json()["choices"][0]["message"]["content"])

const url = "https://api.adelk.sa/llm/api/v1/chat/completions";
const apiKey = "YOUR_API_KEY";

const res = await fetch(url, {
    method: "POST",
    headers: {
        "Content-Type": "application/json",
        "Authorization": `Bearer ${apiKey}`,
    },
    body: JSON.stringify({
        model: "assigned-by-key",
        messages: [{ role: "user", content: "Tell me a joke." }],
    }),
});

const data = await res.json();
console.log(data.choices[0].message.content);

from openai import OpenAI

client = OpenAI(
    base_url="https://api.adelk.sa/llm/api/v1",
    api_key="YOUR_API_KEY",
)

# `model` is required by the SDK; the gateway enforces your key's model.
resp = client.chat.completions.create(
    model="assigned-by-key",
    messages=[{"role": "user", "content": "Write a haiku about code."}],
)
print(resp.choices[0].message.content)

Streaming

Set "stream": true. The wire format depends on the endpoint:

OpenAI Server-Sent Events

Content-Type: text/event-stream — each event is a chat.completion.chunk; the stream ends with [DONE].

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
       "choices":[],"usage":{"prompt_tokens":9,"completion_tokens":3,"total_tokens":12}}

data: [DONE]

Ollama NDJSON

Content-Type: application/x-ndjson — one JSON object per line; the last line has done: true.

{"model":"llama3:8b","message":{"role":"assistant","content":"Hello"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":""},"done":true,
 "prompt_eval_count":9,"eval_count":3,"total_duration":123456789}

Mid-stream failures. Once bytes have been sent, the gateway cannot transparently fail over to another server. The stream ends early; the request is logged as partial and the tokens already produced are still counted and billed.

Response Structure

OpenAI Chat Completion

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "llama3:latest",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is your answer..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

When the model calls a tool, message.tool_calls is present and finish_reason is tool_calls.

Ollama Chat Response

{
  "model": "llama3:8b",
  "created_at": "2026-06-29T10:00:00Z",
  "message": {
    "role": "assistant",
    "content": "Here is your answer..."
  },
  "done": true,
  "total_duration": 123456789,
  "prompt_eval_count": 9,
  "eval_count": 12
}

The model field shows the alias configured for the serving node, not necessarily the raw tag.

Key Policies & Limits

These are configured per API key by an administrator and enforced on every request. They are not part of the request body — but they shape the responses you get.

Policy	Effect
Model binding	Each key serves exactly one model; your `model` field is ignored.
Cluster routing	The key's requests are restricted to a dedicated pool of servers (workload / model isolation).
Rate limit	Maximum requests per minute. Exceeding it returns `429`.
Token quotas	Daily / weekly / monthly token caps (per key or per owning user). Exceeding returns `429`.
IP & domain allow-lists	Requests from other IPs / referrer domains are rejected with `403`.
Expiry	After the expiry date the key returns `403`.
Tools / Vision	Function calling and image input are honored only when enabled on the key; otherwise stripped silently.
Force JSON	Always requests JSON output, regardless of `response_format`.
System prompt	An administrator-defined system message may be injected ahead of your messages.

Error Codes

Errors are returned as JSON. On the OpenAI endpoint the shape is {"error":{"message":...,"type":...,"code":...}}; on the Ollama endpoint it is {"error":{"message":...,"type":"proxy_error","code":...}}.

Status	Meaning	Description
400	Bad Request	Invalid JSON body or malformed `messages`.
401	Unauthorized	Missing or invalid API key. Check the `Authorization` header.
403	Forbidden	Key expired, or IP / domain not allowed.
405	Method Not Allowed	Only `POST` is supported.
429	Too Many Requests	Rate limit or token quota exceeded.
502	Bad Gateway	The backend returned an invalid response (or a stream broke mid-way).
503	Service Unavailable	No healthy backend server can currently serve the key's model.

Postman

A ready-to-use Postman collection covers every endpoint and mode (chat, streaming, vision, tools, JSON mode, and the Ollama-native endpoint).

Download and import both files:
- postman_collection.json — the requests
- postman_environment.json — the base_url / api_key variables
In Postman, select the AI Gateway environment (top-right).
Set base_url (e.g. https://api.adelk.sa/llm) and your api_key.
Open any request and hit Send. Authentication is pre-wired via the collection's Bearer token.

The collection's variables default to this server. Override them in the environment for staging / production hosts without editing each request.