Monitoring & Observability

This guide covers health check endpoints, Prometheus metrics, logging, Grafana dashboard setup, alerting, performance monitoring, and capacity planning for NFYio.

Health Check Endpoints

NFYio services expose HTTP health endpoints for load balancers and orchestration.

Endpoints

Service	Endpoint	Expected Response
Gateway	`GET /health`	`{"status":"ok","version":"0.9.0"}`
Storage Proxy	`GET /health`	`{"status":"ok","backend":"seaweedfs"}`
Agent Service	`GET /health`	`{"status":"ok","model":"gpt-4o"}`

Example Checks

# Gateway
curl -s http://localhost:3000/health | jq .

# Storage
curl -s http://localhost:7007/health | jq .

# Agent
curl -s http://localhost:7010/health | jq .

Kubernetes Liveness/Readiness

# In Helm values or Deployment
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

Docker Compose Healthcheck

nfyio-gateway:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
    interval: 10s
    timeout: 5s
    retries: 3
    start_period: 30s

Metrics (Prometheus)

NFYio services expose Prometheus metrics on /metrics.

Scrape Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: 'nfyio-gateway'
    static_configs:
      - targets: ['nfyio-gateway:3000']
    metrics_path: /metrics
    scrape_interval: 30s

  - job_name: 'nfyio-storage'
    static_configs:
      - targets: ['nfyio-storage:7007']
    metrics_path: /metrics
    scrape_interval: 30s

  - job_name: 'nfyio-agent'
    static_configs:
      - targets: ['nfyio-agent:7010']
    metrics_path: /metrics
    scrape_interval: 30s

Key Metrics

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, status
`http_request_duration_seconds`	Histogram	Request latency (P50, P95, P99)
`nfyio_storage_operations_total`	Counter	S3 operations (Get, Put, List, Delete)
`nfyio_agent_queries_total`	Counter	RAG/LLM queries
`nfyio_agent_query_duration_seconds`	Histogram	Query latency
`nfyio_embeddings_total`	Counter	Embedding API calls
`process_resident_memory_bytes`	Gauge	Memory usage
`process_cpu_seconds_total`	Counter	CPU usage

Example Queries

# Request rate (requests per second)
rate(http_requests_total{job="nfyio-gateway"}[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m]))

# Error rate (5xx)
rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m]) / rate(http_requests_total{job="nfyio-gateway"}[5m])

# Storage operations by type
rate(nfyio_storage_operations_total[5m])

Logging

Log Format

NFYio uses structured JSON logging:

{
  "timestamp": "2026-03-01T12:00:00.000Z",
  "level": "info",
  "message": "Request completed",
  "method": "GET",
  "path": "/api/buckets",
  "status": 200,
  "duration_ms": 12,
  "request_id": "req_abc123"
}

Log Levels

Level	Use Case
`debug`	Development, verbose tracing
`info`	Normal operations, request logs
`warn`	Recoverable issues, deprecations
`error`	Failures, exceptions

Docker Logs

# Follow gateway logs
docker compose logs -f nfyio-gateway

# JSON parsing with jq
docker compose logs nfyio-gateway 2>&1 | jq -r '.message'

Centralized Logging (Loki / ELK)

Forward logs to Loki, Elasticsearch, or similar:

# Docker logging driver (example for Loki)
nfyio-gateway:
  logging:
    driver: loki
    options:
      loki-url: "http://loki:3100/loki/api/v1/push"
      loki-batch-size: "400"

Or use a sidecar/daemon (Fluent Bit, Filebeat) to ship container logs.

Dashboard Setup (Grafana)

Import NFYio Dashboard

In Grafana, go to Dashboards → Import
Upload the NFYio dashboard JSON (from the chart or docs)
Select Prometheus as the data source
Import

Custom Panels

Create panels for:

Overview

Request rate (all services)
Error rate (4xx, 5xx)
P95 latency

Storage

S3 operations/sec (Get, Put, List)
Storage proxy latency
Bucket object count (if exposed)

Agents

RAG queries/sec
Query latency
Embedding calls/sec

Resources

CPU and memory per pod/container
PostgreSQL connections
Redis memory usage

Example Panel (Request Rate)

Query: sum(rate(http_requests_total{job=~"nfyio-.*"}[5m])) by (job)
Visualization: Time series
Legend: {{job}}

Alerts

PrometheusRule Examples

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nfyio-alerts
  namespace: nfyio
spec:
  groups:
    - name: nfyio
      rules:
        - alert: NfyioGatewayDown
          expr: up{job="nfyio-gateway"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "NFYio Gateway is down"

        - alert: NfyioHighErrorRate
          expr: |
            sum(rate(http_requests_total{job="nfyio-gateway",status=~"5.."}[5m])) /
            sum(rate(http_requests_total{job="nfyio-gateway"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Gateway 5xx error rate > 5%"

        - alert: NfyioHighLatency
          expr: |
            histogram_quantile(0.95,
              rate(http_request_duration_seconds_bucket{job="nfyio-gateway"}[5m])
            ) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Gateway P95 latency > 2s"

        - alert: NfyioStorageDown
          expr: up{job="nfyio-storage"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Storage proxy is down"

        - alert: NfyioAgentHighMemory
          expr: |
            container_memory_usage_bytes{container="nfyio-agent"} /
            container_spec_memory_limit_bytes{container="nfyio-agent"} > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Agent memory usage > 90%"

Alertmanager Routing

Route NFYio alerts to your team:

# alertmanager config
route:
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
    - match:
        severity: warning
      receiver: slack-warnings

Performance Monitoring

Key SLOs

SLO	Target	Measurement
Availability	99.9%	`up` metric over time
Latency (P95)	< 500ms	`http_request_duration_seconds`
Error rate	< 0.1%	5xx / total requests

Slow Query Detection

For the agent service, alert on slow RAG queries:

histogram_quantile(0.99, rate(nfyio_agent_query_duration_seconds_bucket[5m])) > 10

Database Monitoring

Monitor PostgreSQL and Redis:

# PostgreSQL connections (if exposed)
pg_stat_activity_count

# Redis memory
redis_memory_used_bytes

Capacity Planning

Metrics to Track

Metric	Action Threshold
CPU utilization	Scale when sustained > 70%
Memory utilization	Scale when sustained > 80%
Disk usage (PostgreSQL, SeaweedFS)	Alert at 80%, critical at 95%
Request rate growth	Plan scaling before 2x current peak

Scaling Triggers

Gateway: Scale on CPU or request rate
Storage: Scale on S3 operation rate or latency
Agent: Scale on query rate or queue depth
PostgreSQL: Consider read replicas for high read load
SeaweedFS: Add volume nodes for storage capacity

Example Capacity Dashboard

Create a Grafana dashboard with:

Request rate trend (7d)
Resource utilization (CPU, memory) per service
Storage usage (PostgreSQL, Redis, SeaweedFS)
Error budget consumption

What’s Next

Deploy with Kubernetes — ServiceMonitor and PrometheusRule setup
Error Handling — API error codes and troubleshooting
Rate Limits — Throttling and quotas