A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.
This framework provides enterprise-grade tools for:
- Comprehensive Testing: Deterministic, semantic, behavioral, safety, and performance tests
- LLM-as-Judge Evaluation: Use advanced models to evaluate other model outputs
- Production Monitoring: Real-time metrics, alerting, and drift detection
- Drift Detection: Statistical and semantic drift detection across inputs, outputs, and performance
- Multi-Model Support: Works with any model available through OpenRouter
# Clone or download the framework
cd ai-agent-evals
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.template .env
# Edit .env with your OpenRouter API key
Edit your .env
file:
# Required
OPENROUTER_API_KEY=your_openrouter_api_key_here
# Optional (defaults provided)
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
SIMILARITY_THRESHOLD=0.8
python main_demo.py
This will run a comprehensive demonstration of all features.
OpenAI-compatible client for OpenRouter API with built-in pricing calculation.
from openrouter_client import OpenRouterClient, OpenRouterPricing
client = OpenRouterClient(api_key="your_key")
pricing = OpenRouterPricing()
response = client.chat.completions.create(
model="openai/gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")
Core testing engine with multiple evaluation strategies.
from testing_framework import EvaluationFramework, TestCase, TestType
framework = EvaluationFramework("your_api_key")
# Create a test case
test = TestCase(
id="test_sql_basic",
type=TestType.SEMANTIC,
input="Write SQL to select all users",
expected="SELECT * FROM users;",
metadata={"similarity_threshold": 0.8}
)
# Run single test
result = framework.run_test(test, your_llm_function, api_key="your_key")
# Create test suite and run
framework.create_test_suite("my_tests", [test])
results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")
Exact string matching for precise outputs.
TestCase(
id="exact_match",
type=TestType.DETERMINISTIC,
input="What is 2+2?",
expected="4"
)
Meaning-based comparison using embeddings.
TestCase(
id="semantic_test",
type=TestType.SEMANTIC,
input="Explain SQL",
expected="SQL is a language for managing databases",
metadata={"similarity_threshold": 0.75}
)
Check if outputs meet behavioral constraints.
TestCase(
id="sql_safety",
type=TestType.BEHAVIORAL,
input="Write a database query",
expected=None,
constraints={
"must_include": ["SELECT", "FROM"],
"must_exclude": ["DELETE", "DROP"],
"format": "sql"
}
)
Detect harmful content, PII, and security issues.
TestCase(
id="safety_check",
type=TestType.SAFETY,
input="User input with potential issues",
expected=None,
metadata={"check_harmful": True}
)
Measure latency and cost constraints.
TestCase(
id="performance_test",
type=TestType.PERFORMANCE,
input="Complex query requiring fast response",
expected=None,
constraints={"max_latency_ms": 2000, "max_tokens": 500}
)
from production_monitoring import ProductionMonitor
monitor = ProductionMonitor("production.db")
# Log requests
monitor.log_request(
request_id="req_123",
input_text="User question",
output_text="AI response",
latency_ms=450,
tokens_used=150,
model="openai/gpt-3.5-turbo",
success=True
)
# Get metrics
metrics = monitor.get_metrics(hours=24)
print(f"Success rate: {metrics['success_rate']:.1%}")
print(f"Average latency: {metrics['avg_latency']:.0f}ms")
from production_monitoring import DriftDetector
detector = DriftDetector()
# Detect input drift
baseline_inputs = ["Normal queries from last week"]
current_inputs = ["Recent queries"]
has_drift, score, details = detector.detect_input_drift(
baseline_inputs,
current_inputs,
method='embedding'
)
if has_drift:
alert = detector.create_drift_alert(
'input', 'query_distribution', 1.0, 1.0 + score, details
)
print(f"β οΈ {alert.severity} drift detected: {alert.action_required}")
Use advanced models to evaluate other model outputs:
from test_suites import LLMJudgeEvaluator
judge = LLMJudgeEvaluator("your_api_key")
evaluation = judge.evaluate_response(
prompt="Explain machine learning",
response="ML is a subset of AI that learns from data...",
criteria={
"accuracy": "Is the information correct?",
"clarity": "Is it easy to understand?",
"completeness": "Does it fully answer the question?"
}
)
print(f"Overall score: {evaluation['overall_score']}")
Pre-built test suites for common use cases:
from test_suites import AIAgentTestSuites
test_creator = AIAgentTestSuites("your_api_key")
# Create domain-specific test suites
test_creator.create_sql_generation_tests()
test_creator.create_data_quality_tests()
test_creator.create_safety_tests()
test_creator.create_edge_case_tests()
# Run specific suite
results = test_creator.framework.run_suite(
"sql_generation",
your_llm_function,
api_key="your_key"
)
Available test suites:
- SQL Generation: Test database query generation
- Data Quality: Test data validation capabilities
- Data Analysis: Test analytical reasoning
- Pipeline Analysis: Test code analysis skills
- Regression: Catch model performance regressions
- Edge Cases: Handle unusual inputs
- Domain Expertise: Test specialized knowledge
def my_llm_function(prompt, **kwargs):
"""Your custom LLM function"""
# Process prompt with your model
response = your_model.generate(prompt)
return {
'response': response.text,
'tokens': response.token_count,
'cost': calculate_cost(response.token_count)
}
# Use with framework
framework.run_test(test_case, my_llm_function, custom_param="value")
def ci_test_pipeline():
"""CI/CD pipeline test function"""
framework = EvaluationFramework(os.getenv("API_KEY"))
# Run critical tests
results = framework.run_suite("critical_tests", my_llm_function)
pass_rate = results['passed'].mean()
if pass_rate < 0.9: # 90% threshold
print("β DEPLOYMENT BLOCKED")
exit(1)
else:
print("β
DEPLOYMENT APPROVED")
class CustomDriftDetector(DriftDetector):
def detect_business_metric_drift(self, baseline_metrics, current_metrics):
"""Custom drift detection for business metrics"""
# Your custom logic
pass
detector = CustomDriftDetector()
- Request Metrics: Volume, success rate, latency percentiles
- Cost Metrics: Token usage, cost per request, total spend
- Quality Metrics: User feedback, test pass rates
- Drift Metrics: Input/output distribution changes
- Error Metrics: Error types and frequencies
# Get test history
history = framework.get_test_history(hours=24)
# Export to CSV
history.to_csv("test_results.csv", index=False)
# Get production metrics
metrics = monitor.get_metrics(hours=24)
# Custom reporting
report = {
"timestamp": datetime.now(),
"test_results": results.to_dict(),
"production_metrics": metrics,
"alerts": monitor.get_recent_alerts(24).to_dict()
}
# OpenRouter API
OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
# Models
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
EMBEDDINGS_MODEL=all-MiniLM-L6-v2
# Database
TEST_RESULTS_DB=test_results.db
PRODUCTION_METRICS_DB=production_metrics.db
# Thresholds
ALERT_LATENCY_THRESHOLD_MS=2000
ALERT_ERROR_RATE_THRESHOLD=0.05
SIMILARITY_THRESHOLD=0.8
# Custom thresholds
monitor.alert_thresholds.update({
'latency_p95': 1500, # 1.5s
'error_rate': 0.02, # 2%
'cost_per_request': 0.05 # $0.05
})
# Custom pricing
pricing = OpenRouterPricing()
pricing.pricing['custom/model'] = 0.001 / 1000
The framework automatically generates alerts for:
- High latency (>2s by default)
- High error rates (>5% by default)
- High costs (>$0.10 per request by default)
- Drift detection across inputs/outputs/performance
def custom_alert_check(monitor):
metrics = monitor.get_metrics(1)
if metrics['avg_user_feedback'] < 3.0: # Below 3/5 stars
monitor._create_alert(
'low_satisfaction',
'WARNING',
f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',
metrics['avg_user_feedback'],
3.0
)
- Start with Critical Tests: Focus on core functionality first
- Use Multiple Test Types: Combine deterministic, semantic, and behavioral tests
- Set Appropriate Thresholds: Tune similarity thresholds based on your use case
- Regular Test Updates: Update test cases as your model evolves
- Log Everything: Capture inputs, outputs, latency, and user feedback
- Set Smart Alerts: Avoid alert fatigue with meaningful thresholds
- Monitor Trends: Look at metrics over time, not just point values
- Regular Drift Checks: Run drift detection daily or weekly
- Batch Testing: Run tests in pa 8809 rallel when possible
- Cache Embeddings: Reuse embeddings for semantic comparisons
- Database Indexing: Ensure proper indexes on timestamp fields
- Cleanup Old Data: Archive old test results and metrics
API Key Issues
# Check if API key is set
python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"
# Test API connection
python openrouter_client.py
Database Issues
# Check database permissions
ls -la *.db
# Reset databases
rm *.db
python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"
Import Issues
# Check dependencies
pip install -r requirements.txt
# Check Python path
python -c "import sys; print(sys.path)"
import logging
logging.basicConfig(level=logging.DEBUG)
# Detailed test execution
framework.run_test(test_case, my_function, debug=True)
We welcome contributions! Please see our contribution guidelines for details.
- Fork the repository on GitHub
- Clone your fork:
git clone https://github.com/your-username/ai-agent-evals.git
- Create a feature branch:
git checkout -b feature/amazing-feature
- Install dependencies:
pip install -r requirements.txt
- Make your changes and add tests
- Run tests:
python test_imports.py
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request on GitHub
# Clone the repository
git clone https://github.com/drc-infinyon/ai-agent-evals.git
cd ai-agent-evals
# Install dependencies
pip install -r requirements.txt
# Run validation
python test_imports.py
# Run demo (requires OpenRouter API key)
python main_demo.py
- Follow PEP 8 guidelines
- Use type hints where appropriate
- Add docstrings for all public functions
- Run
black
for code formatting - Add tests for new functionality
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and issues:
- Check the troubleshooting section above
- Review the demo script (
main_demo.py
) for examples - Check that your OpenRouter API key has sufficient credits
- Ensure all dependencies are properly installed
- Web dashboard for monitoring
- Integration with MLflow/Weights & Biases
- A/B testing framework
- Multi-model comparison tools
- Advanced anomaly detection
- Custom evaluation metrics
- Automated model retraining triggers
Built for reliable AI systems in production