The LLM Platform or Inference Hub is an open-source product designed to simplify the development and deployment of large language model (LLM) applications at scale. It provides a unified framework that allows developers to integrate and manage multiple LLM vendors, models, and related services through a standardized but highly flexible approach.
The platform integrates with a wide range of LLM providers:
Chat/Completion Models:
- OpenAI Platform and Azure OpenAI Service (GPT models)
- Anthropic (Claude models)
- Google Gemini
- AWS Bedrock
- Groq
- Mistral AI
- xAI
- Hugging Face
- Local deployments: Ollama, LLAMA.CPP, Mistral.RS
- Custom models via gRPC plugins
Embedding Models:
- OpenAI, Azure OpenAI, Jina, Hugging Face, Google Gemini
- Local: Ollama, LLAMA.CPP
- Custom embedders via gRPC
Media Processing:
- Image generation: OpenAI DALL-E, Replicate
- Speech-to-text: OpenAI Whisper, Groq Whisper, WHISPER.CPP
- Text-to-speech: OpenAI TTS
- Reranking: Jina
Document Extractors:
- Apache Tika for various document formats
- Unstructured.io for advanced document parsing
- Azure Document Intelligence
- Jina Reader for web content
- Exa and Tavily for web search and extraction
- Text extraction from plain files
- Custom extractors via gRPC
Text Segmentation:
- Jina segmenter for semantic chunking
- Text-based chunking with configurable sizes
- Unstructured.io segmentation
- Custom segmenters via gRPC
Information Retrieval:
- Web search: DuckDuckGo, Exa, Tavily
- Custom retrievers via gRPC plugins
Chains & Agents:
- Agent/Assistant chains with tool calling capabilities
- Custom conversation flows
- Multi-step reasoning workflows
- Tool integration and function calling
Tools & Function Calling:
- Built-in tools: search, extract, retrieve, render, synthesize, translate
- Model Context Protocol (MCP) support: Full server and client implementation
- Connect to external MCP servers as tool providers
- Built-in MCP server exposing platform capabilities
- Multiple transport methods (HTTP streaming, SSE, command execution)
- Custom tools via gRPC plugins
Additional Capabilities:
- Text summarization (via chat models)
- Language translation
- Content rendering and formatting
Routing & Load Balancing:
- Round-robin load balancer for distributing requests
- Model fallback strategies
- Request routing across multiple providers
Rate Limiting & Control:
- Per-provider and per-model rate limiting
- Request throttling and queuing
- Resource usage controls
Authentication & Security:
- Static token authentication
- OpenID Connect (OIDC) integration
- Secure credential management
API Compatibility:
- OpenAI-compatible API endpoints
- Custom API configurations
- Multiple API versions support
Observability & Monitoring:
- Full OpenTelemetry integration
- Request tracing across all components
- Comprehensive metrics and logging
- Performance monitoring and debugging
Developers can define providers, models, credentials, document processing pipelines, tools, and advanced AI workflows using YAML configuration files. This approach streamlines integration and makes it easy to manage complex AI applications.
The architecture is designed to be modular and extensible, allowing developers to plug in different providers and services as needed. It consists of key components:
Core Providers:
- Completers: Chat/completion models for text generation and reasoning
- Embedders: Vector embedding models for semantic understanding
- Renderers: Image generation and visual content creation
- Synthesizers: Text-to-speech and audio generation
- Transcribers: Speech-to-text and audio processing
- Rerankers: Result ranking and relevance scoring
Document & Data Processing:
- Extractors: Document parsing and content extraction from various formats
- Segmenters: Text chunking and semantic segmentation for RAG
- Retrievers: Web search and information retrieval
- Summarizers: Content compression and summarization
- Translators: Multi-language text translation
AI Workflows & Tools:
- Chains: Multi-step AI workflows and agent-based reasoning
- Tools: Function calling, web search, document processing, and custom capabilities
- APIs: Multiple API formats and compatibility layers
Infrastructure:
- Routers: Load balancing and request distribution
- Rate Limiters: Resource control and throttling
- Authorizers: Authentication and access control
- Observability: OpenTelemetry tracing and monitoring
- Enterprise AI Applications: Unified platform for multiple AI services and models
- RAG (Retrieval-Augmented Generation): Document processing, semantic search, and knowledge retrieval
- AI Agents & Workflows: Multi-step reasoning, tool integration, and autonomous task execution
- Scalable LLM Deployment: High-volume applications with load balancing and failover
- Multi-Modal AI: Combining text, image, and audio processing capabilities
- Custom AI Pipelines: Flexible workflows using custom tools and chains
https://platform.openai.com/docs/api-reference
providers:
- type: openai
token: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
models:
- gpt-4o
- gpt-4o-mini
- text-embedding-3-small
- text-embedding-3-large
- whisper-1
- dall-e-3
- tts-1
- tts-1-hd
https://azure.microsoft.com/en-us/products/ai-services/openai-service
providers:
- type: openai
url: https://xxxxxxxx.openai.azure.com
token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
models:
# https://docs.anthropic.com/en/docs/models-overview
#
# {alias}:
# - id: {azure oai deployment name}
gpt-3.5-turbo:
id: gpt-35-turbo-16k
gpt-4:
id: gpt-4-32k
text-embedding-ada-002:
id: text-embedding-ada-002
providers:
- type: anthropic
token: sk-ant-apixx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# https://docs.anthropic.com/en/docs/models-overview
#
# {alias}:
# - id: {anthropic api model name}
models:
claude-3.5-sonnet:
id: claude-3-5-sonnet-20240620
providers:
- type: gemini
token: ${GOOGLE_API_KEY}
# https://ai.google.dev/gemini-api/docs/models/gemini
#
# {alias}:
# - id: {gemini api model name}
models:
gemini-1.5-pro:
id: gemini-1.5-pro-latest
gemini-1.5-flash:
id: gemini-1.5-flash-latest
providers:
- type: bedrock
# AWS credentials configured via environment or IAM roles
models:
claude-3-sonnet:
id: anthropic.claude-3-sonnet-20240229-v1:0
providers:
- type: groq
token: ${GROQ_API_KEY}
# https://console.groq.com/docs/models
#
# {alias}:
# - id: {groq api model name}
models:
groq-llama-3-8b:
id: llama3-8b-8192
groq-whisper-1:
id: whisper-large-v3
providers:
- type: mistral
token: ${MISTRAL_API_KEY}
# https://docs.mistral.ai/getting-started/models/
#
# {alias}:
# - id: {mistral api model name}
models:
mistral-large:
id: mistral-large-latest
providers:
- type: xai
token: ${XAI_API_KEY}
models:
grok-beta:
id: grok-beta
providers:
- type: replicate
token: ${REPLICATE_API_KEY}
#
# {alias}:
# - id: {cohere api model name}
models:
replicate-flux-pro:
id: black-forest-labs/flux-pro
$ ollama start
$ ollama run mistral
providers:
- type: ollama
url: http://localhost:11434
# https://ollama.com/library
#
# {alias}:
# - id: {ollama model name with optional version}
models:
mistral-7b-instruct:
id: mistral:latest
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
# using taskfile.dev
$ task llama:server
# LLAMA.CPP Server
$ llama-server --port 9081 --log-disable --model ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
providers:
- type: llama
url: http://localhost:9081
models:
- mistral-7b-instruct
https://github.com/EricLBuehler/mistral.rs
$ mistralrs-server --port 1234 --isq Q4K plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
providers:
- type: mistralrs
url: http://localhost:1234
models:
mistralrs-llama-3.1-8b:
id: llama
https://github.com/ggerganov/whisper.cpp/tree/master/examples/server
# using taskfile.dev
$ task whisper:server
# WHISPER.CPP Server
$ whisper-server --port 9083 --convert --model ./models/whisper-large-v3-turbo.bin
providers:
- type: whisper
url: http://localhost:9083
models:
- whisper
providers:
- type: huggingface
token: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
models:
mistral-7B-instruct:
id: mistralai/Mistral-7B-Instruct-v0.1
huggingface-minilm-l6-2:
id: sentence-transformers/all-MiniLM-L6-v2
routers:
llama-lb:
type: roundrobin
models:
- llama-3-8b
- groq-llama-3-8b
- huggingface-llama-3-8b
retrievers:
web:
type: duckduckgo
retrievers:
exa:
type: exa
token: ${EXA_API_KEY}
retrievers:
tavily:
type: tavily
token: ${TAVILY_API_KEY}
retrievers:
custom:
type: custom
url: http://localhost:8080
# using Docker
docker run -it --rm -p 9998:9998 apache/tika:3.0.0.0-BETA2-full
extractors:
tika:
type: tika
url: http://localhost:9998
chunkSize: 4000
chunkOverlap: 200
# using taskfile.dev
task unstructured:server
# using Docker
docker run -it --rm -p 9085:8000 quay.io/unstructured-io/unstructured-api:0.0.80 --port 8000 --host 0.0.0.0
extractors:
unstructured:
type: unstructured
url: http://localhost:9085/general/v0/general
extractors:
azure:
type: azure
url: https://YOUR_INSTANCE.cognitiveservices.azure.com
token: ${AZURE_API_KEY}
extractors:
jina:
type: jina
token: ${JINA_API_KEY}
extractors:
exa:
type: exa
token: ${EXA_API_KEY}
tavily:
type: tavily
token: ${TAVILY_API_KEY}
extractors:
text:
type: text
extractors:
custom:
type: custom
url: http://localhost:8080
segmenters:
jina:
type: jina
token: ${JINA_API_KEY}
segmenters:
text:
type: text
chunkSize: 1000
chunkOverlap: 200
segmenters:
unstructured:
type: unstructured
url: http://localhost:9085/general/v0/general
segmenters:
custom:
type: custom
url: http://localhost:8080
chains:
assistant:
type: agent
model: gpt-4o
tools:
- search
- extract
messages:
- role: system
content: "You are a helpful AI assistant."
The platform provides comprehensive support for the Model Context Protocol (MCP), enabling integration with MCP-compatible tools and services.
MCP Server Support:
- Built-in MCP server that exposes platform tools to MCP clients
- Automatic tool discovery and schema generation
- Multiple transport methods (HTTP streaming, SSE, command-line)
MCP Client Support:
- Connect to external MCP servers as tool providers
- Support for various MCP transport methods
- Automatic tool registration and execution
MCP Tool Configuration:
tools:
# MCP server via HTTP streaming
mcp-streamable:
type: mcp
url: http://localhost:8080/mcp
# MCP server via Server-Sent Events
mcp-sse:
type: mcp
url: http://localhost:8080/sse
vars:
api-key: ${API_KEY}
# MCP server via command execution
mcp-command:
type: mcp
command: /path/to/mcp-server
args:
- --config
- /path/to/config.json
vars:
ENV_VAR: value
Built-in MCP Server:
The platform automatically exposes its tools via MCP protocol at /mcp
endpoint, allowing other MCP clients to discover and use platform capabilities.
tools:
search:
type: search
retriever: web
extract:
type: extract
extractor: tika
translate:
type: translate
translator: default
render:
type: render
renderer: dalle-3
synthesize:
type: synthesize
synthesizer: tts-1
tools:
custom-tool:
type: custom
url: http://localhost:8080
authorizers:
- type: static
tokens:
- "your-secret-token"
authorizers:
- type: oidc
url: https://your-oidc-provider.com
audience: your-audience
routers:
llama-lb:
type: roundrobin
models:
- llama-3-8b
- groq-llama-3-8b
- huggingface-llama-3-8b
Add rate limiting to any provider:
providers:
- type: openai
token: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
limit: 10 # requests per second
models:
gpt-4o:
limit: 5 # override for specific model
Summarization is automatically available for any chat model:
# Use any completer model for summarization
# The platform automatically adapts chat models for summarization tasks
translators:
default:
type: default
# Uses configured chat models for translation