ModelMesh Lite

Capability-driven AI model routing with automatic failover

View the Project on GitHub ApartsinProjects/ModelMesh

Proxy Guide

Deploy ModelMesh as an OpenAI-compatible HTTP proxy. The proxy exposes a standard OpenAI REST API on a configurable port. Internally it leverages all ModelMesh capabilities: multi-provider routing, automatic failover, pool strategies, budget controls, and free-tier aggregation. Any OpenAI SDK client, curl, or plain fetch() call can talk to the proxy without modification.


Overview

Browser / SDK / curl
        |
        v  (OpenAI REST API)
  +-----------+
  |   Proxy   |  port 8080 (default)
  +-----------+
        |
        v  (ModelMesh routing)
  +-----------+     +-----------+     +----------+
  |  Router   | --> |   Pool    | --> |  Model   | --> Provider API
  +-----------+     +-----------+     +----------+

The proxy translates incoming requests to ModelMesh routing calls. When the client sends model: "text-generation", ModelMesh resolves the pool, picks the best active model using the configured rotation strategy, retries with backoff on failure, and rotates to the next provider when one is down or rate-limited.


Quick Start

1. Install

Python (pip):

pip install modelmesh-lite[yaml]

From source:

git clone https://github.com/ApartsinProjects/ModelMesh.git
cd ModelMesh
pip install -e ".[yaml]"

2. Set API Keys

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."

3. Start the Proxy (Auto-detect Mode)

With no config file, the proxy auto-detects providers from environment variables:

python -m modelmesh.proxy

The proxy starts on http://localhost:8080 and creates a default chat-completion pool from all detected providers.

4. Test

# List available models (pool IDs)
curl http://localhost:8080/v1/models

# Send a chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "chat-completion",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

For full control over providers, models, pools, and strategies, use a YAML configuration file.

Minimal Configuration

# modelmesh.yaml
secrets:
  store: modelmesh.env.v1        # Read API keys from environment variables

providers:
  openai.llm.v1:
    api_key: ${secrets:OPENAI_API_KEY}

models:
  gpt-4o-mini:
    provider: openai.llm.v1
    capabilities:
      - generation.text-generation.chat-completion

pools:
  chat:
    capability: generation.text-generation.chat-completion
    strategy: stick-until-failure

Multi-Provider Configuration

# modelmesh.yaml
secrets:
  store: modelmesh.env.v1

providers:
  openai.llm.v1:
    api_key: ${secrets:OPENAI_API_KEY}
    budget:
      daily_limit: 5.00

  anthropic.claude.v1:
    api_key: ${secrets:ANTHROPIC_API_KEY}
    budget:
      daily_limit: 5.00

  groq.api.v1:
    api_key: ${secrets:GROQ_API_KEY}

models:
  gpt-4o-mini:
    provider: openai.llm.v1
    capabilities:
      - generation.text-generation.chat-completion
    delivery:
      synchronous: true
      streaming: true
    features:
      tool_calling: true
      structured_output: true
      json_mode: true
      system_prompt: true
    constraints:
      context_window: 128000
      max_output_tokens: 16384

  claude-3-5-haiku:
    provider: anthropic.claude.v1
    capabilities:
      - generation.text-generation.chat-completion
    delivery:
      synchronous: true
      streaming: true
    features:
      tool_calling: true
      system_prompt: true
    constraints:
      context_window: 200000
      max_output_tokens: 8192

  llama-3.3-70b:
    provider: groq.api.v1
    capabilities:
      - generation.text-generation.chat-completion
    delivery:
      synchronous: true
      streaming: true
    features:
      tool_calling: true
      system_prompt: true
    constraints:
      context_window: 131072
      max_output_tokens: 32768

pools:
  text-generation:
    strategy: modelmesh.stick-until-failure.v1
    capability: generation.text-generation

Start with config:

python -m modelmesh.proxy --config modelmesh.yaml

Configuration Sections

Section Purpose Reference
secrets Secret store backend (env vars, dotenv, cloud vaults) SystemConfiguration
providers Provider registration, API keys, budgets, rate limits SystemConfiguration
models Model definitions with capabilities, features, constraints SystemConfiguration
pools Capability pools with rotation strategies SystemConfiguration

Available Providers

Connector ID Provider Capabilities
openai.llm.v1 OpenAI Chat, embeddings, audio
anthropic.claude.v1 Anthropic Chat
groq.api.v1 Groq Chat (fast inference)
google.gemini.v1 Google Gemini Chat
deepseek.api.v1 DeepSeek Chat
mistral.api.v1 Mistral Chat, embeddings
together.api.v1 Together AI Chat
openrouter.api.v1 OpenRouter Chat (multi-model gateway)
xai.api.v1 xAI (Grok) Chat
cohere.api.v1 Cohere Chat, embeddings

See ConnectorCatalogue for all connectors and config schemas.

Rotation Strategies

Strategy Connector ID Behaviour
Stick-until-failure modelmesh.stick-until-failure.v1 Use one model until it fails, then rotate (default)
Round-robin modelmesh.round-robin.v1 Cycle through models in sequence
Cost-first modelmesh.cost-first.v1 Always pick the model with lowest accumulated cost
Latency-first modelmesh.latency-first.v1 Always pick the model with lowest observed latency
Priority modelmesh.priority-selection.v1 Follow an ordered preference list with fallback
Session-stickiness modelmesh.session-stickiness.v1 Route same-session requests to the same model
Rate-limit-aware modelmesh.rate-limit-aware.v1 Track per-model quotas, switch before exhaustion
Load-balanced modelmesh.load-balanced.v1 Distribute requests using weighted round-robin

For custom strategies, see the FAQ — CDK extension guide. For full strategy config options, see the Connector Catalogue.

Secret Stores

Store Connector ID Source
Environment variables modelmesh.env.v1 process.env / os.environ
Dotenv file modelmesh.dotenv.v1 .env file
AWS Secrets Manager aws.secrets-manager.v1 AWS cloud
Google Secret Manager google.secret-manager.v1 GCP cloud
Azure Key Vault microsoft.key-vault.v1 Azure cloud
1Password Connect 1password.connect.v1 1Password vault

CLI Reference

python -m modelmesh.proxy [OPTIONS]
Flag Default Description
--config PATH (auto-detect) Path to YAML configuration file
--host HOST 0.0.0.0 Bind address
--port PORT 8080 Listen port
--token TOKEN (none) Bearer token for authentication
--log-level LEVEL INFO Logging level: DEBUG, INFO, WARNING, ERROR

Examples:

# Auto-detect providers from env vars
python -m modelmesh.proxy

# Custom config, port, and auth token
python -m modelmesh.proxy --config modelmesh.yaml --port 9090 --token my-secret

# Debug logging
python -m modelmesh.proxy --config modelmesh.yaml --log-level DEBUG

REST API Endpoints

All endpoints follow the OpenAI REST API specification.

GET /v1/models

List available models (pool IDs exposed as virtual model names).

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    { "id": "text-generation", "object": "model", "created": 1710000000, "owned_by": "modelmesh" }
  ]
}

POST /v1/chat/completions

Send a chat completion request. Use the pool ID as the model parameter.

Non-streaming:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-generation",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is ModelMesh?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Streaming:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-generation",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming responses use Server-Sent Events (SSE) with chunked transfer encoding. Each chunk is a data: {json}\n\n line, terminated by data: [DONE].

POST /v1/embeddings

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "embeddings", "input": "Hello world"}'

POST /v1/audio/speech

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts", "input": "Hello world"}'

POST /v1/audio/transcriptions

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: application/json" \
  -d '{"model": "stt", "text": "audio data..."}'

GET /health

Server health and status.

curl http://localhost:8080/health

Response:

{
  "running": true,
  "host": "0.0.0.0",
  "port": 8080,
  "uptime_seconds": 120.5,
  "active_connections": 0,
  "total_requests": 42
}

Authentication

When the proxy is started with --token, all requests must include a bearer token:

curl -H "Authorization: Bearer my-secret" http://localhost:8080/v1/models

CORS

The proxy includes full CORS support (Access-Control-Allow-Origin: *) for browser access. No additional CORS proxy is needed.


Docker Deployment

Prerequisites

Files

File Purpose
Dockerfile Python 3.12-slim image with ModelMesh + PyYAML
docker-compose.yaml Service definition with port mapping and config mount
modelmesh.yaml Pool and provider configuration
.env API keys (gitignored, never committed)

Build and Run

Option A: Pull pre-built image from GHCR (fastest):

docker pull ghcr.io/apartsinprojects/modelmesh:latest

# Run with environment variables
docker run -p 8080:8080 \
  -e OPENAI_API_KEY="sk-..." \
  -e ANTHROPIC_API_KEY="sk-ant-..." \
  ghcr.io/apartsinprojects/modelmesh:latest \
  --host 0.0.0.0 --port 8080

# Or run with config file and env file
docker run -p 8080:8080 \
  --env-file .env \
  -v ./modelmesh.yaml:/app/modelmesh.yaml:ro \
  ghcr.io/apartsinprojects/modelmesh:latest \
  --config /app/modelmesh.yaml --host 0.0.0.0 --port 8080

Option B: Docker Compose (recommended for development):

# 1. Create .env with your API keys
cat > .env << 'EOF'
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
EOF

# 2. Build and start
docker compose up --build

# 3. Or run in background
docker compose up --build -d

# 4. View logs
docker compose logs -f

# 5. Stop
docker compose down

Option C: Build Docker image locally:

# Build
docker build -t modelmesh-proxy .

# Run
docker run -p 8080:8080 \
  -e OPENAI_API_KEY="sk-..." \
  -e ANTHROPIC_API_KEY="sk-ant-..." \
  -v ./modelmesh.yaml:/app/modelmesh.yaml:ro \
  modelmesh-proxy \
  --config /app/modelmesh.yaml --host 0.0.0.0 --port 8080

Using Automation Scripts

# Build + run (interactive, Ctrl+C to stop)
./scripts/proxy-up.sh

# Build + run in background
./scripts/proxy-up.sh --detach

# Stop
./scripts/proxy-down.sh

# Run smoke tests against running proxy
./scripts/proxy-test.sh

# Full cycle: build, start, test, stop
./scripts/proxy-test.sh --full

Verify

# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/v1/models

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"text-generation","messages":[{"role":"user","content":"Hi"}]}'

Docker Compose Configuration

# docker-compose.yaml
services:
  modelmesh-proxy:
    build: .
    ports:
      - "8080:8080"
    env_file: .env
    volumes:
      - ./modelmesh.yaml:/app/modelmesh.yaml:ro
    command: ["--config", "/app/modelmesh.yaml", "--host", "0.0.0.0", "--port", "8080"]

Dockerfile

FROM python:3.12-slim
WORKDIR /app
COPY src/python/ ./
COPY pyproject.toml ./
RUN pip install . && pip install pyyaml>=6.0
EXPOSE 8080
ENTRYPOINT ["python", "-m", "modelmesh.proxy"]
CMD ["--host", "0.0.0.0", "--port", "8080"]

Browser Usage

The proxy includes CORS support, so you can call it directly from browser JavaScript using fetch().

Vanilla JS Example

A complete browser test page is provided at samples/proxy-test/index.html. Open it directly in a browser (no build step needed).

Non-streaming request:

const response = await fetch('http://localhost:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'text-generation',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});
const data = await response.json();
console.log(data.choices[0].message.content);

Streaming request:

const response = await fetch('http://localhost:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'text-generation',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true,
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });

  const lines = buffer.split('\n');
  buffer = lines.pop();

  for (const line of lines) {
    const trimmed = line.trim();
    if (trimmed === 'data: [DONE]') break;
    if (!trimmed.startsWith('data: ')) continue;
    const chunk = JSON.parse(trimmed.slice(6));
    const token = chunk.choices?.[0]?.delta?.content;
    if (token) process.stdout.write(token); // or append to DOM
  }
}

OpenAI SDK Compatibility

Any OpenAI SDK client works with the proxy by changing the base URL:

Python (openai SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",  # or your proxy token
)

response = client.chat.completions.create(
    model="text-generation",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

TypeScript (openai SDK):

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'text-generation',
  messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(response.choices[0].message.content);

Programmatic Usage

You can also embed the proxy server in your own Python application:

from modelmesh.proxy.server import ProxyServer

# Start in background
server = ProxyServer(
    config="modelmesh.yaml",
    host="0.0.0.0",
    port=8080,
    token="my-secret",
)
server.start(block=False)

# Check status
print(server.get_status())

# Stop
server.stop()

Troubleshooting

Symptom Cause Fix
ModuleNotFoundError: yaml PyYAML not installed pip install pyyaml or use pip install modelmesh-lite[yaml]
Connection refused Proxy not running or wrong port Check docker compose logs or start with --log-level DEBUG
502 Routing error All providers failed Check API keys in .env, verify provider connectivity
401 Invalid token Bearer token mismatch Pass correct --token or remove auth requirement
CORS error in browser Proxy not reachable Verify proxy URL and that the proxy is running
Empty request body Missing Content-Type header Add -H "Content-Type: application/json" to curl
Streaming not working Client not reading SSE Use ReadableStream API, not response.json()

See also: FAQ · Quick Start · System Configuration