Monitoring & Observability
Health checks, Prometheus metrics, logging, Grafana dashboards, alerts, and capacity planning for NFYio.
This guide covers health check endpoints, Prometheus metrics, logging, Grafana dashboard setup, alerting, performance monitoring, and capacity planning for NFYio.
Health Check Endpoints
NFYio services expose HTTP health endpoints for load balancers and orchestration.
Endpoints
| Service | Endpoint | Expected Response |
|---|---|---|
| Gateway | GET /health | {"status":"ok","version":"0.9.0"} |
| Storage Proxy | GET /health | {"status":"ok","backend":"seaweedfs"} |
| Agent Service | GET /health | {"status":"ok","model":"gpt-4o"} |
Example Checks
# Gateway
curl -s http://localhost:3000/health | jq .
# Storage
curl -s http://localhost:7007/health | jq .
# Agent
curl -s http://localhost:7010/health | jq .
Kubernetes Liveness/Readiness
# In Helm values or Deployment
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Docker Compose Healthcheck
nfyio-gateway:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
Metrics (Prometheus)
NFYio services expose Prometheus metrics on /metrics.
Scrape Configuration
Add to prometheus.yml:
scrape_configs:
- job_name: 'nfyio-gateway'
static_configs:
- targets: ['nfyio-gateway:3000']
metrics_path: /metrics
scrape_interval: 30s
- job_name: 'nfyio-storage'
static_configs:
- targets: ['nfyio-storage:7007']
metrics_path: /metrics
scrape_interval: 30s
- job_name: 'nfyio-agent'
static_configs:
- targets: ['nfyio-agent:7010']
metrics_path: /metrics
scrape_interval: 30s
Key Metrics
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, path, status |
http_request_duration_seconds | Histogram | Request latency (P50, P95, P99) |
nfyio_storage_operations_total | Counter | S3 operations (Get, Put, List, Delete) |
nfyio_agent_queries_total | Counter | RAG/LLM queries |
nfyio_agent_query_duration_seconds | Histogram | Query latency |
nfyio_embeddings_total | Counter | Embedding API calls |
process_resident_memory_bytes | Gauge | Memory usage |
process_cpu_seconds_total | Counter | CPU usage |
Example Queries
# Request rate (requests per second)
rate(http_requests_total{job="nfyio-gateway"}[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m]))
# Error rate (5xx)
rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m]) / rate(http_requests_total{job="nfyio-gateway"}[5m])
# Storage operations by type
rate(nfyio_storage_operations_total[5m])
Logging
Log Format
NFYio uses structured JSON logging:
{
"timestamp": "2026-03-01T12:00:00.000Z",
"level": "info",
"message": "Request completed",
"method": "GET",
"path": "/api/buckets",
"status": 200,
"duration_ms": 12,
"request_id": "req_abc123"
}
Log Levels
| Level | Use Case |
|---|---|
debug | Development, verbose tracing |
info | Normal operations, request logs |
warn | Recoverable issues, deprecations |
error | Failures, exceptions |
Docker Logs
# Follow gateway logs
docker compose logs -f nfyio-gateway
# JSON parsing with jq
docker compose logs nfyio-gateway 2>&1 | jq -r '.message'
Centralized Logging (Loki / ELK)
Forward logs to Loki, Elasticsearch, or similar:
# Docker logging driver (example for Loki)
nfyio-gateway:
logging:
driver: loki
options:
loki-url: "http://loki:3100/loki/api/v1/push"
loki-batch-size: "400"
Or use a sidecar/daemon (Fluent Bit, Filebeat) to ship container logs.
Dashboard Setup (Grafana)
Import NFYio Dashboard
- In Grafana, go to Dashboards → Import
- Upload the NFYio dashboard JSON (from the chart or docs)
- Select Prometheus as the data source
- Import
Custom Panels
Create panels for:
Overview
- Request rate (all services)
- Error rate (4xx, 5xx)
- P95 latency
Storage
- S3 operations/sec (Get, Put, List)
- Storage proxy latency
- Bucket object count (if exposed)
Agents
- RAG queries/sec
- Query latency
- Embedding calls/sec
Resources
- CPU and memory per pod/container
- PostgreSQL connections
- Redis memory usage
Example Panel (Request Rate)
- Query:
sum(rate(http_requests_total{job=~"nfyio-.*"}[5m])) by (job) - Visualization: Time series
- Legend:
{{job}}
Alerts
PrometheusRule Examples
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nfyio-alerts
namespace: nfyio
spec:
groups:
- name: nfyio
rules:
- alert: NfyioGatewayDown
expr: up{job="nfyio-gateway"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "NFYio Gateway is down"
- alert: NfyioHighErrorRate
expr: |
sum(rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="nfyio-gateway"}[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway 5xx error rate > 5%"
- alert: NfyioHighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway P95 latency > 2s"
- alert: NfyioStorageDown
expr: up{job="nfyio-storage"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Storage proxy is down"
- alert: NfyioAgentHighMemory
expr: |
container_memory_usage_bytes{container="nfyio-agent"} /
container_spec_memory_limit_bytes{container="nfyio-agent"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Agent memory usage > 90%"
Alertmanager Routing
Route NFYio alerts to your team:
# alertmanager config
route:
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
Performance Monitoring
Key SLOs
| SLO | Target | Measurement |
|---|---|---|
| Availability | 99.9% | up metric over time |
| Latency (P95) | < 500ms | http_request_duration_seconds |
| Error rate | < 0.1% | 5xx / total requests |
Slow Query Detection
For the agent service, alert on slow RAG queries:
histogram_quantile(0.99, rate(nfyio_agent_query_duration_seconds_bucket[5m])) > 10
Database Monitoring
Monitor PostgreSQL and Redis:
# PostgreSQL connections (if exposed)
pg_stat_activity_count
# Redis memory
redis_memory_used_bytes
Capacity Planning
Metrics to Track
| Metric | Action Threshold |
|---|---|
| CPU utilization | Scale when sustained > 70% |
| Memory utilization | Scale when sustained > 80% |
| Disk usage (PostgreSQL, SeaweedFS) | Alert at 80%, critical at 95% |
| Request rate growth | Plan scaling before 2x current peak |
Scaling Triggers
- Gateway: Scale on CPU or request rate
- Storage: Scale on S3 operation rate or latency
- Agent: Scale on query rate or queue depth
- PostgreSQL: Consider read replicas for high read load
- SeaweedFS: Add volume nodes for storage capacity
Example Capacity Dashboard
Create a Grafana dashboard with:
- Request rate trend (7d)
- Resource utilization (CPU, memory) per service
- Storage usage (PostgreSQL, Redis, SeaweedFS)
- Error budget consumption
What’s Next
- Deploy with Kubernetes — ServiceMonitor and PrometheusRule setup
- Error Handling — API error codes and troubleshooting
- Rate Limits — Throttling and quotas