Deprecated: Function get_magic_quotes_gpc() is deprecated in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 99

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 619

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1169

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176
8000 GitHub - drc-infinyon/ai-agent-evals: πŸ€– Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
Nothing Special   »   [go: up one dir, main page]

Skip to content

πŸ€– Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.

License

Notifications You must be signed in to change notification settings

drc-infinyon/ai-agent-evals

Repository files navigation

AI Agent Evaluation Framework

CI/CD Pipeline Python 3.8+ License: MIT OpenRouter Code style: black

A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.

Overview

This framework provides enterprise-grade tools for:

  • Comprehensive Testing: Deterministic, semantic, behavioral, safety, and performance tests
  • LLM-as-Judge Evaluation: Use advanced models to evaluate other model outputs
  • Production Monitoring: Real-time metrics, alerting, and drift detection
  • Drift Detection: Statistical and semantic drift detection across inputs, outputs, and performance
  • Multi-Model Support: Works with any model available through OpenRouter

Quick Start

1. Installation

# Clone or download the framework
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp .env.template .env
# Edit .env with your OpenRouter API key

2. Configuration

Edit your .env file:

# Required
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional (defaults provided)
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
SIMILARITY_THRESHOLD=0.8

3. Run the Demo

python main_demo.py

This will run a comprehensive demonstration of all features.

Core Components

OpenRouter Client (openrouter_client.py)

OpenAI-compatible client for OpenRouter API with built-in pricing calculation.

from openrouter_client import OpenRouterClient, OpenRouterPricing

client = OpenRouterClient(api_key="your_key")
pricing = OpenRouterPricing()

response = client.chat.completions.create(
    model="openai/gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")

Evaluation Framework (testing_framework.py)

Core testing engine with multiple evaluation strategies.

from testing_framework import EvaluationFramework, TestCase, TestType

framework = EvaluationFramework("your_api_key")

# Create a test case
test = TestCase(
    id="test_sql_basic",
    type=TestType.SEMANTIC,
    input="Write SQL to select all users",
    expected="SELECT * FROM users;",
    metadata={"similarity_threshold": 0.8}
)

# Run single test
result = framework.run_test(test, your_llm_function, api_key="your_key")

# Create test suite and run
framework.create_test_suite("my_tests", [test])
results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")

Test Types

1. Deterministic Tests

Exact string matching for precise outputs.

TestCase(
    id="exact_match",
    type=TestType.DETERMINISTIC,
    input="What is 2+2?",
    expected="4"
)

2. Semantic Tests

Meaning-based comparison using embeddings.

TestCase(
    id="semantic_test",
    type=TestType.SEMANTIC,
    input="Explain SQL",
    expected="SQL is a language for managing databases",
    metadata={"similarity_threshold": 0.75}
)

3. Behavioral Tests

Check if outputs meet behavioral constraints.

TestCase(
    id="sql_safety",
    type=TestType.BEHAVIORAL,
    input="Write a database query",
    expected=None,
    constraints={
        "must_include": ["SELECT", "FROM"],
        "must_exclude": ["DELETE", "DROP"],
        "format": "sql"
    }
)

4. Safety Tests

Detect harmful content, PII, and security issues.

TestCase(
    id="safety_check",
    type=TestType.SAFETY,
    input="User input with potential issues",
    expected=None,
    metadata={"check_harmful": True}
)

5. Performance Tests

Measure latency and cost constraints.

TestCase(
    id="performance_test",
    type=TestType.PERFORMANCE,
    input="Complex query requiring fast response",
    expected=None,
    constraints={"max_latency_ms": 2000, "max_tokens": 500}
)

Production Monitoring

Basic Monitoring

from production_monitoring import ProductionMonitor

monitor = ProductionMonitor("production.db")

# Log requests
monitor.log_request(
    request_id="req_123",
    input_text="User question",
    output_text="AI response", 
    latency_ms=450,
    tokens_used=150,
    model="openai/gpt-3.5-turbo",
    success=True
)

# Get metrics
metrics = monitor.get_metrics(hours=24)
print(f"Success rate: {metrics['success_rate']:.1%}")
print(f"Average latency: {metrics['avg_latency']:.0f}ms")

Drift Detection

from production_monitoring import DriftDetector

detector = DriftDetector()

# Detect input drift
baseline_inputs = ["Normal queries from last week"]
current_inputs = ["Recent queries"]

has_drift, score, details = detector.detect_input_drift(
    baseline_inputs, 
    current_inputs,
    method='embedding'
)

if has_drift:
    alert = detector.create_drift_alert(
        'input', 'query_distribution', 1.0, 1.0 + score, details
    )
    print(f"⚠️ {alert.severity} drift detected: {alert.action_required}")

LLM-as-Judge

Use advanced models to evaluate other model outputs:

from test_suites import LLMJudgeEvaluator

judge = LLMJudgeEvaluator("your_api_key")

evaluation = judge.evaluate_response(
    prompt="Explain machine learning",
    response="ML is a subset of AI that learns from data...",
    criteria={
        "accuracy": "Is the information correct?",
        "clarity": "Is it easy to understand?",
        "completeness": "Does it fully answer the question?"
    }
)

print(f"Overall score: {evaluation['overall_score']}")

Test Suites

Pre-built test suites for common use cases:

from test_suites import AIAgentTestSuites

test_creator = AIAgentTestSuites("your_api_key")

# Create domain-specific test suites
test_creator.create_sql_generation_tests()
test_creator.create_data_quality_tests() 
test_creator.create_safety_tests()
test_creator.create_edge_case_tests()

# Run specific suite
results = test_creator.framework.run_suite(
    "sql_generation",
    your_llm_function,
    api_key="your_key"
)

Available test suites:

  • SQL Generation: Test database query generation
  • Data Quality: Test data validation capabilities
  • Data Analysis: Test analytical reasoning
  • Pipeline Analysis: Test code analysis skills
  • Regression: Catch model performance regressions
  • Edge Cases: Handle unusual inputs
  • Domain Expertise: Test specialized knowledge

Advanced Usage

Custom Test Functions

def my_llm_function(prompt, **kwargs):
    """Your custom LLM function"""
    # Process prompt with your model
    response = your_model.generate(prompt)
    
    return {
        'response': response.text,
        'tokens': response.token_count,
        'cost': calculate_cost(response.token_count)
    }

# Use with framework
framework.run_test(test_case, my_llm_function, custom_param="value")

Continuous Integration

def ci_test_pipeline():
    """CI/CD pipeline test function"""
    framework = EvaluationFramework(os.getenv("API_KEY"))
    
    # Run critical tests
    results = framework.run_suite("critical_tests", my_llm_function)
    pass_rate = results['passed'].mean()
    
    if pass_rate < 0.9:  # 90% threshold
        print("❌ DEPLOYMENT BLOCKED")
        exit(1)
    else:
        print("βœ… DEPLOYMENT APPROVED")

Custom Drift Detection

class CustomDriftDetector(DriftDetector):
    def detect_business_metric_drift(self, baseline_metrics, current_metrics):
        """Custom drift detection for business metrics"""
        # Your custom logic
        pass

detector = CustomDriftDetector()

Metrics & Reporting

Available Metrics

  • Request Metrics: Volume, success rate, latency percentiles
  • Cost Metrics: Token usage, cost per request, total spend
  • Quality Metrics: User feedback, test pass rates
  • Drift Metrics: Input/output distribution changes
  • Error Metrics: Error types and frequencies

Exporting Data

# Get test history
history = framework.get_test_history(hours=24)

# Export to CSV
history.to_csv("test_results.csv", index=False)

# Get production metrics  
metrics = monitor.get_metrics(hours=24)

# Custom reporting
report = {
    "timestamp": datetime.now(),
    "test_results": results.to_dict(),
    "production_metrics": metrics,
    "alerts": monitor.get_recent_alerts(24).to_dict()
}

Configuration

Environment Variables

# OpenRouter API
OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Models  
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
EMBEDDINGS_MODEL=all-MiniLM-L6-v2

# Database
TEST_RESULTS_DB=test_results.db
PRODUCTION_METRICS_DB=production_metrics.db

# Thresholds
ALERT_LATENCY_THRESHOLD_MS=2000
ALERT_ERROR_RATE_THRESHOLD=0.05
SIMILARITY_THRESHOLD=0.8

Programmatic Configuration

# Custom thresholds
monitor.alert_thresholds.update({
    'latency_p95': 1500,  # 1.5s
    'error_rate': 0.02,   # 2%
    'cost_per_request': 0.05  # $0.05
})

# Custom pricing
pricing = OpenRouterPricing()
pricing.pricing['custom/model'] = 0.001 / 1000

Alerting & Monitoring

Real-time Alerts

The framework automatically generates alerts for:

  • High latency (>2s by default)
  • High error rates (>5% by default)
  • High costs (>$0.10 per request by default)
  • Drift detection across inputs/outputs/performance

Custom Alerts

def custom_alert_check(monitor):
    metrics = monitor.get_metrics(1)
    
    if metrics['avg_user_feedback'] < 3.0:  # Below 3/5 stars
        monitor._create_alert(
            'low_satisfaction',
            'WARNING', 
            f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',
            metrics['avg_user_feedback'],
            3.0
        )

Best Practices

Test Design

  1. Start with Critical Tests: Focus on core functionality first
  2. Use Multiple Test Types: Combine deterministic, semantic, and behavioral tests
  3. Set Appropriate Thresholds: Tune similarity thresholds based on your use case
  4. Regular Test Updates: Update test cases as your model evolves

Production Monitoring

  1. Log Everything: Capture inputs, outputs, latency, and user feedback
  2. Set Smart Alerts: Avoid alert fatigue with meaningful thresholds
  3. Monitor Trends: Look at metrics over time, not just point values
  4. Regular Drift Checks: Run drift detection daily or weekly

Performance Optimization

  1. Batch Testing: Run tests in pa 8809 rallel when possible
  2. Cache Embeddings: Reuse embeddings for semantic comparisons
  3. Database Indexing: Ensure proper indexes on timestamp fields
  4. Cleanup Old Data: Archive old test results and metrics

Troubleshooting

Common Issues

API Key Issues

# Check if API key is set
python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"

# Test API connection
python openrouter_client.py

Database Issues

# Check database permissions
ls -la *.db

# Reset databases
rm *.db
python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"

Import Issues

# Check dependencies
pip install -r requirements.txt

# Check Python path
python -c "import sys; print(sys.path)"

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)

# Detailed test execution
framework.run_test(test_case, my_function, debug=True)

Contributing

We welcome contributions! Please see our contribution guidelines for details.

Quick Start for Contributors

  1. Fork the repository on GitHub
  2. Clone your fork: git clone https://github.com/your-username/ai-agent-evals.git
  3. Create a feature branch: git checkout -b feature/amazing-feature
  4. Install dependencies: pip install -r requirements.txt
  5. Make your changes and add tests
  6. Run tests: python test_imports.py
  7. Commit changes: git commit -m 'Add amazing feature'
  8. Push to branch: git push origin feature/amazing-feature
  9. Open a Pull Request on GitHub

Development Setup

# Clone the repository
git clone https://github.com/drc-infinyon/ai-agent-evals.git
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Run validation
python test_imports.py

# Run demo (requires OpenRouter API key)
python main_demo.py

Code Style

  • Follow PEP 8 guidelines
  • Use type hints where appropriate
  • Add docstrings for all public functions
  • Run black for code formatting
  • Add tests for new functionality

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and issues:

  1. Check the troubleshooting section above
  2. Review the demo script (main_demo.py) for examples
  3. Check that your OpenRouter API key has sufficient credits
  4. Ensure all dependencies are properly installed

Roadmap

  • Web dashboard for monitoring
  • Integration with MLflow/Weights & Biases
  • A/B testing framework
  • Multi-model comparison tools
  • Advanced anomaly detection
  • Custom evaluation metrics
  • Automated model retraining triggers

Built for reliable AI systems in production

About

πŸ€– Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0