AI Agent Evaluation Framework

A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.

Overview

This framework provides enterprise-grade tools for:

Comprehensive Testing: Deterministic, semantic, behavioral, safety, and performance tests
LLM-as-Judge Evaluation: Use advanced models to evaluate other model outputs
Production Monitoring: Real-time metrics, alerting, and drift detection
Drift Detection: Statistical and semantic drift detection across inputs, outputs, and performance
Multi-Model Support: Works with any model available through OpenRouter

Quick Start

1. Installation

# Clone or download the framework
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp .env.template .env
# Edit .env with your OpenRouter API key

2. Configuration

Edit your .env file:

# Required
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional (defaults provided)
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
SIMILARITY_THRESHOLD=0.8

3. Run the Demo

python main_demo.py

This will run a comprehensive demonstration of all features.

Core Components

OpenRouter Client (`openrouter_client.py`)

OpenAI-compatible client for OpenRouter API with built-in pricing calculation.

from openrouter_client import OpenRouterClient, OpenRouterPricing

client = OpenRouterClient(api_key="your_key")
pricing = OpenRouterPricing()

response = client.chat.completions.create(
    model="openai/gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")

Evaluation Framework (`testing_framework.py`)

Core testing engine with multiple evaluation strategies.

from testing_framework import EvaluationFramework, TestCase, TestType

framework = EvaluationFramework("your_api_key")

# Create a test case
test = TestCase(
    id="test_sql_basic",
    type=TestType.SEMANTIC,
    input="Write SQL to select all users",
    expected="SELECT * FROM users;",
    metadata={"similarity_threshold": 0.8}
)

# Run single test
result = framework.run_test(test, your_llm_function, api_key="your_key")

# Create test suite and run
framework.create_test_suite("my_tests", [test])
results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")

Test Types

1. Deterministic Tests

Exact string matching for precise outputs.

TestCase(
    id="exact_match",
    type=TestType.DETERMINISTIC,
    input="What is 2+2?",
    expected="4"
)

2. Semantic Tests

Meaning-based comparison using embeddings.

TestCase(
    id="semantic_test",
    type=TestType.SEMANTIC,
    input="Explain SQL",
    expected="SQL is a language for managing databases",
    metadata={"similarity_threshold": 0.75}
)

3. Behavioral Tests

Check if outputs meet behavioral constraints.

TestCase(
    id="sql_safety",
    type=TestType.BEHAVIORAL,
    input="Write a database query",
    expected=None,
    constraints={
        "must_include": ["SELECT", "FROM"],
        "must_exclude": ["DELETE", "DROP"],
        "format": "sql"
    }
)

4. Safety Tests

Detect harmful content, PII, and security issues.

TestCase(
    id="safety_check",
    type=TestType.SAFETY,
    input="User input with potential issues",
    expected=None,
    metadata={"check_harmful": True}
)

5. Performance Tests

Measure latency and cost constraints.

TestCase(
    id="performance_test",
    type=TestType.PERFORMANCE,
    input="Complex query requiring fast response",
    expected=None,
    constraints={"max_latency_ms": 2000, "max_tokens": 500}
)

Production Monitoring

Basic Monitoring

from production_monitoring import ProductionMonitor

monitor = ProductionMonitor("production.db")

# Log requests
monitor.log_request(
    request_id="req_123",
    input_text="User question",
    output_text="AI response", 
    latency_ms=450,
    tokens_used=150,
    model="openai/gpt-3.5-turbo",
    success=True
)

# Get metrics
metrics = monitor.get_metrics(hours=24)
print(f"Success rate: {metrics['success_rate']:.1%}")
print(f"Average latency: {metrics['avg_latency']:.0f}ms")

Drift Detection

from production_monitoring import DriftDetector

detector = DriftDetector()

# Detect input drift
baseline_inputs = ["Normal queries from last week"]
current_inputs = ["Recent queries"]

has_drift, score, details = detector.detect_input_drift(
    baseline_inputs, 
    current_inputs,
    method='embedding'
)

if has_drift:
    alert = detector.create_drift_alert(
        'input', 'query_distribution', 1.0, 1.0 + score, details
    )
    print(f"⚠️ {alert.severity} drift detected: {alert.action_required}")

LLM-as-Judge

Use advanced models to evaluate other model outputs:

from test_suites import LLMJudgeEvaluator

judge = LLMJudgeEvaluator("your_api_key")

evaluation = judge.evaluate_response(
    prompt="Explain machine learning",
    response="ML is a subset of AI that learns from data...",
    criteria={
        "accuracy": "Is the information correct?",
        "clarity": "Is it easy to understand?",
        "completeness": "Does it fully answer the question?"
    }
)

print(f"Overall score: {evaluation['overall_score']}")

Test Suites

Pre-built test suites for common use cases:

from test_suites import AIAgentTestSuites

test_creator = AIAgentTestSuites("your_api_key")

# Create domain-specific test suites
test_creator.create_sql_generation_tests()
test_creator.create_data_quality_tests() 
test_creator.create_safety_tests()
test_creator.create_edge_case_tests()

# Run specific suite
results = test_creator.framework.run_suite(
    "sql_generation",
    your_llm_function,
    api_key="your_key"
)

Available test suites:

SQL Generation: Test database query generation
Data Quality: Test data validation capabilities
Data Analysis: Test analytical reasoning
Pipeline Analysis: Test code analysis skills
Regression: Catch model performance regressions
Edge Cases: Handle unusual inputs
Domain Expertise: Test specialized knowledge

Advanced Usage

Custom Test Functions

def my_llm_function(prompt, **kwargs):
    """Your custom LLM function"""
    # Process prompt with your model
    response = your_model.generate(prompt)
    
    return {
        'response': response.text,
        'tokens': response.token_count,
        'cost': calculate_cost(response.token_count)
    }

# Use with framework
framework.run_test(test_case, my_llm_function, custom_param="value")

Continuous Integration

def ci_test_pipeline():
    """CI/CD pipeline test function"""
    framework = EvaluationFramework(os.getenv("API_KEY"))
    
    # Run critical tests
    results = framework.run_suite("critical_tests", my_llm_function)
    pass_rate = results['passed'].mean()
    
    if pass_rate < 0.9:  # 90% threshold
        print("❌ DEPLOYMENT BLOCKED")
        exit(1)
    else:
        print("✅ DEPLOYMENT APPROVED")

Custom Drift Detection

class CustomDriftDetector(DriftDetector):
    def detect_business_metric_drift(self, baseline_metrics, current_metrics):
        """Custom drift detection for business metrics"""
        # Your custom logic
        pass

detector = CustomDriftDetector()

Metrics & Reporting

Available Metrics

Request Metrics: Volume, success rate, latency percentiles
Cost Metrics: Token usage, cost per request, total spend
Quality Metrics: User feedback, test pass rates
Drift Metrics: Input/output distribution changes
Error Metrics: Error types and frequencies

Exporting Data

# Get test history
history = framework.get_test_history(hours=24)

# Export to CSV
history.to_csv("test_results.csv", index=False)

# Get production metrics  
metrics = monitor.get_metrics(hours=24)

# Custom reporting
report = {
    "timestamp": datetime.now(),
    "test_results": results.to_dict(),
    "production_metrics": metrics,
    "alerts": monitor.get_recent_alerts(24).to_dict()
}

Configuration

Environment Variables

# OpenRouter API
OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Models  
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
EMBEDDINGS_MODEL=all-MiniLM-L6-v2

# Database
TEST_RESULTS_DB=test_results.db
PRODUCTION_METRICS_DB=production_metrics.db

# Thresholds
ALERT_LATENCY_THRESHOLD_MS=2000
ALERT_ERROR_RATE_THRESHOLD=0.05
SIMILARITY_THRESHOLD=0.8

Programmatic Configuration

# Custom thresholds
monitor.alert_thresholds.update({
    'latency_p95': 1500,  # 1.5s
    'error_rate': 0.02,   # 2%
    'cost_per_request': 0.05  # $0.05
})

# Custom pricing
pricing = OpenRouterPricing()
pricing.pricing['custom/model'] = 0.001 / 1000

Alerting & Monitoring

Real-time Alerts

The framework automatically generates alerts for:

High latency (>2s by default)
High error rates (>5% by default)
High costs (>$0.10 per request by default)
Drift detection across inputs/outputs/performance

Custom Alerts

def custom_alert_check(monitor):
    metrics = monitor.get_metrics(1)
    
    if metrics['avg_user_feedback'] < 3.0:  # Below 3/5 stars
        monitor._create_alert(
            'low_satisfaction',
            'WARNING', 
            f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',
            metrics['avg_user_feedback'],
            3.0
        )

Best Practices

Test Design

Start with Critical Tests: Focus on core functionality first
Use Multiple Test Types: Combine deterministic, semantic, and behavioral tests
Set Appropriate Thresholds: Tune similarity thresholds based on your use case
Regular Test Updates: Update test cases as your model evolves

Production Monitoring

Log Everything: Capture inputs, outputs, latency, and user feedback
Set Smart Alerts: Avoid alert fatigue with meaningful thresholds
Monitor Trends: Look at metrics over time, not just point values
Regular Drift Checks: Run drift detection daily or weekly

Performance Optimization

Batch Testing: Run tests in pa 8809 rallel when possible
Cache Embeddings: Reuse embeddings for semantic comparisons
Database Indexing: Ensure proper indexes on timestamp fields
Cleanup Old Data: Archive old test results and metrics

Troubleshooting

Common Issues

API Key Issues

# Check if API key is set
python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"

# Test API connection
python openrouter_client.py

Database Issues

# Check database permissions
ls -la *.db

# Reset databases
rm *.db
python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"

Import Issues

# Check dependencies
pip install -r requirements.txt

# Check Python path
python -c "import sys; print(sys.path)"

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)

# Detailed test execution
framework.run_test(test_case, my_function, debug=True)

Contributing

We welcome contributions! Please see our contribution guidelines for details.

Quick Start for Contributors

Fork the repository on GitHub
Clone your fork: git clone https://github.com/your-username/ai-agent-evals.git
Create a feature branch: git checkout -b feature/amazing-feature
Install dependencies: pip install -r requirements.txt
Make your changes and add tests
Run tests: python test_imports.py
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request on GitHub

Development Setup

# Clone the repository
git clone https://github.com/drc-infinyon/ai-agent-evals.git
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Run validation
python test_imports.py

# Run demo (requires OpenRouter API key)
python main_demo.py

Code Style

Follow PEP 8 guidelines
Use type hints where appropriate
Add docstrings for all public functions
Run black for code formatting
Add tests for new functionality

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and issues:

Check the troubleshooting section above
Review the demo script (main_demo.py) for examples
Check that your OpenRouter API key has sufficient credits
Ensure all dependencies are properly installed

Roadmap

Web dashboard for monitoring
Integration with MLflow/Weights & Biases
A/B testing framework
Multi-model comparison tools
Advanced anomaly detection
Custom evaluation metrics
Automated model retraining triggers

Built for reliable AI systems in production

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
agentic-ai-evals.md		agentic-ai-evals.md
main_demo.py		main_demo.py
openrouter_client.py		openrouter_client.py
production_monitoring.py		production_monitoring.py
requirements.txt		requirements.txt
setup.py		setup.py
test_imports.py		test_imports.py
test_suites.py		test_suites.py
testing_framework.py		testing_framework.py

License

drc-infinyon/ai-agent-evals

Folders and files

Latest commit

History

Repository files navigation

AI Agent Evaluation Framework

Overview

Quick Start

1. Installation

2. Configuration

3. Run the Demo

Core Components

OpenRouter Client (openrouter_client.py)

Evaluation Framework (testing_framework.py)

Test Types

1. Deterministic Tests

2. Semantic Tests

3. Behavioral Tests

4. Safety Tests

5. Performance Tests

Production Monitoring

Basic Monitoring

Drift Detection

LLM-as-Judge

Test Suites

Advanced Usage

Custom Test Functions

Continuous Integration

Custom Drift Detection

Metrics & Reporting

Available Metrics

Exporting Data

Configuration

Environment Variables

Programmatic Configuration

Alerting & Monitoring

Real-time Alerts

Custom Alerts

Best Practices

Test Design

Production Monitoring

Performance Optimization

Troubleshooting

Common Issues

Debug Mode

Contributing

Quick Start for Contributors

Development Setup

Code Style

License

Support

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

OpenRouter Client (`openrouter_client.py`)

Evaluation Framework (`testing_framework.py`)

Packages