Monitoring & Observability

Health checks, Prometheus metrics, logging, Grafana dashboards, alerts, and capacity planning for NFYio.

This guide covers health check endpoints, Prometheus metrics, logging, Grafana dashboard setup, alerting, performance monitoring, and capacity planning for NFYio.

Health Check Endpoints

NFYio services expose HTTP health endpoints for load balancers and orchestration.

Endpoints

ServiceEndpointExpected Response
GatewayGET /health{"status":"ok","version":"0.9.0"}
Storage ProxyGET /health{"status":"ok","backend":"seaweedfs"}
Agent ServiceGET /health{"status":"ok","model":"gpt-4o"}

Example Checks

# Gateway
curl -s http://localhost:3000/health | jq .

# Storage
curl -s http://localhost:7007/health | jq .

# Agent
curl -s http://localhost:7010/health | jq .

Kubernetes Liveness/Readiness

# In Helm values or Deployment
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

Docker Compose Healthcheck

nfyio-gateway:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
    interval: 10s
    timeout: 5s
    retries: 3
    start_period: 30s

Metrics (Prometheus)

NFYio services expose Prometheus metrics on /metrics.

Scrape Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: 'nfyio-gateway'
    static_configs:
      - targets: ['nfyio-gateway:3000']
    metrics_path: /metrics
    scrape_interval: 30s

  - job_name: 'nfyio-storage'
    static_configs:
      - targets: ['nfyio-storage:7007']
    metrics_path: /metrics
    scrape_interval: 30s

  - job_name: 'nfyio-agent'
    static_configs:
      - targets: ['nfyio-agent:7010']
    metrics_path: /metrics
    scrape_interval: 30s

Key Metrics

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests by method, path, status
http_request_duration_secondsHistogramRequest latency (P50, P95, P99)
nfyio_storage_operations_totalCounterS3 operations (Get, Put, List, Delete)
nfyio_agent_queries_totalCounterRAG/LLM queries
nfyio_agent_query_duration_secondsHistogramQuery latency
nfyio_embeddings_totalCounterEmbedding API calls
process_resident_memory_bytesGaugeMemory usage
process_cpu_seconds_totalCounterCPU usage

Example Queries

# Request rate (requests per second)
rate(http_requests_total{job="nfyio-gateway"}[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m]))

# Error rate (5xx)
rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m]) / rate(http_requests_total{job="nfyio-gateway"}[5m])

# Storage operations by type
rate(nfyio_storage_operations_total[5m])

Logging

Log Format

NFYio uses structured JSON logging:

{
  "timestamp": "2026-03-01T12:00:00.000Z",
  "level": "info",
  "message": "Request completed",
  "method": "GET",
  "path": "/api/buckets",
  "status": 200,
  "duration_ms": 12,
  "request_id": "req_abc123"
}

Log Levels

LevelUse Case
debugDevelopment, verbose tracing
infoNormal operations, request logs
warnRecoverable issues, deprecations
errorFailures, exceptions

Docker Logs

# Follow gateway logs
docker compose logs -f nfyio-gateway

# JSON parsing with jq
docker compose logs nfyio-gateway 2>&1 | jq -r '.message'

Centralized Logging (Loki / ELK)

Forward logs to Loki, Elasticsearch, or similar:

# Docker logging driver (example for Loki)
nfyio-gateway:
  logging:
    driver: loki
    options:
      loki-url: "http://loki:3100/loki/api/v1/push"
      loki-batch-size: "400"

Or use a sidecar/daemon (Fluent Bit, Filebeat) to ship container logs.

Dashboard Setup (Grafana)

Import NFYio Dashboard

  1. In Grafana, go to DashboardsImport
  2. Upload the NFYio dashboard JSON (from the chart or docs)
  3. Select Prometheus as the data source
  4. Import

Custom Panels

Create panels for:

Overview

  • Request rate (all services)
  • Error rate (4xx, 5xx)
  • P95 latency

Storage

  • S3 operations/sec (Get, Put, List)
  • Storage proxy latency
  • Bucket object count (if exposed)

Agents

  • RAG queries/sec
  • Query latency
  • Embedding calls/sec

Resources

  • CPU and memory per pod/container
  • PostgreSQL connections
  • Redis memory usage

Example Panel (Request Rate)

  • Query: sum(rate(http_requests_total{job=~"nfyio-.*"}[5m])) by (job)
  • Visualization: Time series
  • Legend: {{job}}

Alerts

PrometheusRule Examples

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nfyio-alerts
  namespace: nfyio
spec:
  groups:
    - name: nfyio
      rules:
        - alert: NfyioGatewayDown
          expr: up{job="nfyio-gateway"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "NFYio Gateway is down"

        - alert: NfyioHighErrorRate
          expr: |
            sum(rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m])) /
            sum(rate(http_requests_total{job="nfyio-gateway"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Gateway 5xx error rate > 5%"

        - alert: NfyioHighLatency
          expr: |
            histogram_quantile(0.95,
              rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m])
            ) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Gateway P95 latency > 2s"

        - alert: NfyioStorageDown
          expr: up{job="nfyio-storage"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Storage proxy is down"

        - alert: NfyioAgentHighMemory
          expr: |
            container_memory_usage_bytes{container="nfyio-agent"} /
            container_spec_memory_limit_bytes{container="nfyio-agent"} > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Agent memory usage > 90%"

Alertmanager Routing

Route NFYio alerts to your team:

# alertmanager config
route:
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
    - match:
        severity: warning
      receiver: slack-warnings

Performance Monitoring

Key SLOs

SLOTargetMeasurement
Availability99.9%up metric over time
Latency (P95)< 500mshttp_request_duration_seconds
Error rate< 0.1%5xx / total requests

Slow Query Detection

For the agent service, alert on slow RAG queries:

histogram_quantile(0.99, rate(nfyio_agent_query_duration_seconds_bucket[5m])) > 10

Database Monitoring

Monitor PostgreSQL and Redis:

# PostgreSQL connections (if exposed)
pg_stat_activity_count

# Redis memory
redis_memory_used_bytes

Capacity Planning

Metrics to Track

MetricAction Threshold
CPU utilizationScale when sustained > 70%
Memory utilizationScale when sustained > 80%
Disk usage (PostgreSQL, SeaweedFS)Alert at 80%, critical at 95%
Request rate growthPlan scaling before 2x current peak

Scaling Triggers

  • Gateway: Scale on CPU or request rate
  • Storage: Scale on S3 operation rate or latency
  • Agent: Scale on query rate or queue depth
  • PostgreSQL: Consider read replicas for high read load
  • SeaweedFS: Add volume nodes for storage capacity

Example Capacity Dashboard

Create a Grafana dashboard with:

  • Request rate trend (7d)
  • Resource utilization (CPU, memory) per service
  • Storage usage (PostgreSQL, Redis, SeaweedFS)
  • Error budget consumption

What’s Next