A production-ready, containerized observability stack template, derived from a live setup, featuring Prometheus, Grafana, Traefik, and OpenTelemetry Collector. This stack provides comprehensive monitoring, metrics collection, visualization, and distributed tracing capabilities with enterprise-grade security and performance optimizations.
This observability stack consists of the following components:
- Prometheus - Time-series database and monitoring system for metrics collection
- Grafana Enterprise - Advanced visualization and dashboarding platform
- Traefik v3 - Modern reverse proxy with automatic HTTPS/TLS termination
- Node Exporter - System metrics collector for host monitoring
- OpenTelemetry Collector - Unified telemetry data collection and processing
- Loki - Log aggregation system (configured separately)
- Automatic HTTPS/TLS with Let's Encrypt certificate management
- Basic Authentication protection for all services
- Performance optimizations including compression, caching, and HTTP/2
- Secure cookie handling with SameSite and Secure flags
- User/Group isolation with configurable UID/GID
- Health checks for all services with automatic recovery
- Rolling updates with failure rollback capabilities
- Persistent data storage with local volume bindings
- Service dependencies and startup ordering
- Restart policies with exponential backoff
- System metrics via Node Exporter
- Application metrics via Prometheus
- Distributed tracing via OpenTelemetry
- Log aggregation via Loki integration
- Custom dashboards with persistent storage
- Reverse proxy routing with path-based routing
- Overlay networking for service communication
- External access only through Traefik (ports 80/443)
- Internal service discovery via Docker networks
- Docker Engine 20.10+ with Swarm mode enabled
- Docker Compose v2.0+
- Make utility
- Minimum 4GB RAM and 2 CPU cores
- Domain name with DNS pointing to your server (for production)
git clone <repository-url>
cd prom-grafana-observability-stack
# Copy environment template
cp .env-example .env
# Edit environment variables
nano .env
Required environment variables:
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=your-secure-password
TRAEFIK_ADMIN_USER=admin
TRAEFIK_ADMIN_PASSWORD=your-secure-password
docker swarm init
# Create Prometheus master token secret
echo "your-prometheus-token" | docker secret create prometheus_master_token -
# Initialize data directories and deploy
make up
# Or step by step:
make init # Create data directories
make up # Deploy the stack
The project includes a comprehensive Makefile with the following commands:
make help # Show all available commands
make init # Create necessary data directories
make up # Deploy the stack
make down # Remove the stack
make restart # Restart the stack (down + up)
make ps # List running services
make logs # Show logging instructions
make clean # Clean up Docker resources
make dev # Deploy stack for development
Once deployed, services are accessible via:
- Traefik Dashboard:
https://monitoring.yourdomain.com/
- Grafana:
https://monitoring.yourdomain.com/grafana
- Prometheus:
https://monitoring.yourdomain.com/prometheus
- Traefik Dashboard:
http://localhost:8080
- Grafana:
http://localhost:3000
- Prometheus:
http://localhost:9090
βββ docker-stack.yml # Main Docker Swarm stack definition
βββ Makefile # Automation commands
βββ .env-example # Environment variables template
βββ prometheus/
β βββ prometheus.yml # Prometheus configuration
βββ grafana/
β βββ datasources/ # Grafana datasource configurations
β βββ dashboards/ # Dashboard provisioning
βββ traefik/
β βββ traefik.yml # Traefik static configuration
β βββ dynamic/ # Dynamic configuration files
β βββ acme.json.example # Let's Encrypt certificate storage
βββ otel-collector/
β βββ config.yaml # OpenTelemetry Collector configuration
βββ loki/ # Loki log aggregation configuration
- Scrape targets: Configured in
prometheus/prometheus.yml
- Retention: Default 15 days (configurable)
- Storage: Persistent volume at
/home/tofara/data/prometheus
- Datasources: Auto-provisioned from
grafana/datasources/
- Dashboards: Auto-provisioned from
grafana/dashboards/
- Plugins: Enterprise features enabled
- Storage: Persistent volume at
/home/tofara/data/grafana
- Static config:
traefik/traefik.yml
- Dynamic config:
traefik/dynamic/
directory - Certificates: Automatic Let's Encrypt with HTTP challenge
- Middlewares: Compression, caching, authentication
- Receivers: OTLP, Prometheus, Jaeger
- Processors: Batch processing, resource detection
- Exporters: Prometheus, Jaeger, logging
- Storage: Persistent volume at
/home/tofara/data/otel-collector
- Add service definition to
docker-stack.yml
- Configure Traefik labels for routing
- Add health checks and restart policies
- Update network configuration
- Place JSON dashboard files in
grafana/dashboards/json/
- Configure dashboard provider in
grafana/dashboards/
- Restart Grafana service
- Update domain names in Traefik labels
- Configure DNS to point to your server
- Ensure ports 80/443 are accessible
- Let's Encrypt will automatically provision certificates
# Check service status
make ps
# View service logs
docker service logs monitoring_prometheus
docker service logs monitoring_grafana
docker service logs monitoring_traefik
All services include comprehensive health checks:
- Prometheus:
/prometheus/-/healthy
- Grafana:
/api/health
- Traefik:
/metrics
- Node Exporter:
/metrics
- OpenTelemetry: Health check endpoint
- Permission errors: Ensure data directories have correct ownership
- Certificate issues: Verify DNS configuration and port accessibility
- Service startup: Check Docker Swarm status and resource availability
- Network connectivity: Verify overlay network configuration
- System metrics: CPU, memory, disk, network via Node Exporter
- Container metrics: Docker container statistics
- Application metrics: Custom application metrics via Prometheus
- Traefik metrics: Request rates, response times, error rates
- Node Exporter system metrics
- Docker container monitoring
- Traefik performance metrics
- Custom application dashboards
- All services run with non-root users
- Basic authentication on all external endpoints
- HTTPS/TLS encryption for all external traffic
- Secure cookie configuration
- Network isolation via Docker overlay networks
- Regular security updates via automated image pulls
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review service logs
- Open an issue with detailed information
- Include environment details and error messages