AI Services Gateway API
A secure, unified gateway in front of our self-hosted Large Language Models. It speaks two protocols over the same credentials and routing engine.
OpenAI-Compatible
Drop-in for the OpenAI SDK / Chat Completions API.
Returns chat.completion objects.
POST /api/v1/chat/completions
Ollama-Native
The native Ollama /api/chat response shape
for existing Ollama clients.
POST /api/v1/chat.php
base_url and
api_key. The backend model, server pool, and limits are governed by your key — no
client changes are required when infrastructure changes.
Authentication
Every request is protected by a Bearer Token (API Key) sent in the Authorization
header. Keys are issued and managed by an administrator.
There is no model-listing endpoint: the gateway is chat-only and every key is bound to exactly one model server-side.
How Routing Works
Understanding the routing pipeline explains why the model you send is ignored and how
requests are distributed:
- Authenticate — the key is matched by an indexed lookup digest (constant work regardless of how many keys exist).
- Enforce policy — expiry, IP & domain allow-lists, atomic per-minute rate limit, and token quotas are checked.
- Resolve the model — the gateway
substitutes the model bound to your key; any
modelyou send is overwritten. - Select a server — candidates are the active servers that carry the model. If your key is bound to a cluster, candidates are restricted to that cluster's members (workload isolation).
- Weighted load balancing — among candidates, a server is chosen by routing weight (stronger GPUs absorb more traffic), with automatic failover to the next candidate on error.
- Circuit breaker — a node that fails repeatedly is briefly skipped, so requests don't pay timeouts against a known-bad server.
Endpoints
https://api.adelk.sa/llmFull endpoint paths below are relative to this base — e.g.
https://api.adelk.sa/llm/api/v1/chat/completions.
/api/v1/chat/completions
OpenAI
Content-Type: application/json
Primary OpenAI-compatible endpoint. With the OpenAI SDK, set
base_url to https://api.adelk.sa/llm/api/v1 and call
client.chat.completions.create(...).
Direct path (no URL rewriting required):
/api/v1/chat/completions.php.
/api/v1/chat.php
Ollama
Content-Type: application/json
Native Ollama-compatible endpoint. Accepts the same request body and returns
Ollama's /api/chat response shape (NDJSON when streaming).
Both endpoints accept an identical request body. They differ only in the response format they emit.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
messages |
Array | Yes | The conversation. Each item has role (system /
user / assistant) and content. Content is a
string, or an array of parts (text / image_url) for vision. |
model |
String | SDK | Required by the OpenAI SDK but ignored — the gateway always uses the model bound to your key. Send any non-empty string. |
stream |
Boolean | No | When true, partial deltas are streamed. Default false. |
stream_options |
Object | No | OpenAI streaming only. {"include_usage": true} emits a final usage chunk
with token counts before [DONE]. |
temperature, top_p |
Number | No | Sampling controls, forwarded to the backend. |
max_tokens |
Number | No | Maximum tokens to generate. Maps to Ollama num_predict. |
presence_penalty, frequency_penalty, seed,
stop |
Number / String[] | No | Forwarded as generation options. stop accepts a string or array. |
response_format |
Object | No | {"type":"json_object"} requests JSON output. Keys configured with
Force JSON always request JSON. |
tools, tool_choice |
Array / String | No | Function/tool calling (OpenAI tool schema). Honored only for keys with Tools enabled; otherwise silently ignored. |
Vision (image_url) is honored only for keys with
Vision enabled, and only data: base64 URIs are forwarded (remote URLs
are dropped to prevent SSRF).
Request JSON Schemas
The same body works on both endpoints. Below are the common shapes the gateway supports.
Minimal
{
"model": "assigned-by-key",
"messages": [
{ "role": "user", "content": "Hello!" }
],
"stream": false
}
Full (sampling + JSON mode)
{
"model": "assigned-by-key",
"messages": [
{ "role": "system", "content": "You are concise." },
{ "role": "user", "content": "Summarize the water cycle." }
],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 300,
"stop": ["\n\n"],
"seed": 42,
"response_format": { "type": "json_object" },
"stream": false
}
Vision (multimodal content)
{
"model": "assigned-by-key",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image?" },
{ "type": "image_url",
"image_url": { "url": "data:image/png;base64,iVBORw0KGgo..." } }
]
}
]
}
Tool / Function calling
{
"model": "assigned-by-key",
"messages": [
{ "role": "user", "content": "What is the weather in Riyadh?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}
Code Examples
curl -X POST "https://api.adelk.sa/llm/api/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "assigned-by-key",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
"stream": false
}'
# Native Ollama response shape
curl -X POST "https://api.adelk.sa/llm/api/v1/chat.php" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "assigned-by-key",
"messages": [
{"role": "user", "content": "Why is the sky blue?"}
],
"stream": false
}'
import requests
url = "https://api.adelk.sa/llm/api/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_API_KEY",
}
data = {
"model": "assigned-by-key",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
}
resp = requests.post(url, headers=headers, json=data)
print(resp.status_code)
print(resp.json()["choices"][0]["message"]["content"])
const url = "https://api.adelk.sa/llm/api/v1/chat/completions";
const apiKey = "YOUR_API_KEY";
const res = await fetch(url, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${apiKey}`,
},
body: JSON.stringify({
model: "assigned-by-key",
messages: [{ role: "user", content: "Tell me a joke." }],
}),
});
const data = await res.json();
console.log(data.choices[0].message.content);
from openai import OpenAI
client = OpenAI(
base_url="https://api.adelk.sa/llm/api/v1",
api_key="YOUR_API_KEY",
)
# `model` is required by the SDK; the gateway enforces your key's model.
resp = client.chat.completions.create(
model="assigned-by-key",
messages=[{"role": "user", "content": "Write a haiku about code."}],
)
print(resp.choices[0].message.content)
Streaming
Set "stream": true. The wire format depends on the endpoint:
OpenAI Server-Sent Events
Content-Type: text/event-stream — each event is
a chat.completion.chunk; the stream ends with [DONE].
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
"choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",
"choices":[],"usage":{"prompt_tokens":9,"completion_tokens":3,"total_tokens":12}}
data: [DONE]
Ollama NDJSON
Content-Type: application/x-ndjson — one JSON
object per line; the last line has done: true.
{"model":"llama3:8b","message":{"role":"assistant","content":"Hello"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3:8b","message":{"role":"assistant","content":""},"done":true,
"prompt_eval_count":9,"eval_count":3,"total_duration":123456789}
Response Structure
OpenAI Chat Completion
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "llama3:latest",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here is your answer..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
When the model calls a tool, message.tool_calls is
present and finish_reason is tool_calls.
Ollama Chat Response
{
"model": "llama3:8b",
"created_at": "2026-06-29T10:00:00Z",
"message": {
"role": "assistant",
"content": "Here is your answer..."
},
"done": true,
"total_duration": 123456789,
"prompt_eval_count": 9,
"eval_count": 12
}
The model field shows the alias configured for the
serving node, not necessarily the raw tag.
Key Policies & Limits
These are configured per API key by an administrator and enforced on every request. They are not part of the request body — but they shape the responses you get.
| Policy | Effect |
|---|---|
| Model binding | Each key serves exactly one model; your model
field is ignored. |
| Cluster routing | The key's requests are restricted to a dedicated pool of servers (workload / model isolation). |
| Rate limit | Maximum requests per minute. Exceeding it returns
429. |
| Token quotas | Daily / weekly / monthly token caps (per key or per owning
user). Exceeding returns 429. |
| IP & domain allow-lists | Requests from other IPs / referrer domains are
rejected with 403. |
| Expiry | After the expiry date the key returns 403. |
| Tools / Vision | Function calling and image input are honored only when enabled on the key; otherwise stripped silently. |
| Force JSON | Always requests JSON output, regardless of
response_format. |
| System prompt | An administrator-defined system message may be injected ahead of your messages. |
Error Codes
Errors are returned as JSON. On the OpenAI endpoint the shape is
{"error":{"message":...,"type":...,"code":...}}; on the Ollama endpoint it is
{"error":{"message":...,"type":"proxy_error","code":...}}.
| Status | Meaning | Description |
|---|---|---|
| 400 | Bad Request | Invalid JSON body or malformed messages. |
| 401 | Unauthorized | Missing or invalid API key. Check the Authorization header. |
| 403 | Forbidden | Key expired, or IP / domain not allowed. |
| 405 | Method Not Allowed | Only POST is supported. |
| 429 | Too Many Requests | Rate limit or token quota exceeded. |
| 502 | Bad Gateway | The backend returned an invalid response (or a stream broke mid-way). |
| 503 | Service Unavailable | No healthy backend server can currently serve the key's model. |
Postman
A ready-to-use Postman collection covers every endpoint and mode (chat, streaming, vision, tools, JSON mode, and the Ollama-native endpoint).
- Download and import both files:
- postman_collection.json — the requests
- postman_environment.json — the
base_url/api_keyvariables
- In Postman, select the AI Gateway environment (top-right).
- Set
base_url(e.g.https://api.adelk.sa/llm) and yourapi_key. - Open any request and hit Send. Authentication is pre-wired via the collection's Bearer token.