RAG Agents
Retrieval Augmented Generation agents for document Q&A. Learn architecture, embedding models, chunk strategies, and how to create and query RAG agents via Chat UI and API.
RAG (Retrieval Augmented Generation) agents combine semantic search with LLM generation. They ingest your documents, index them as embeddings, and answer questions by retrieving relevant chunks and passing them to the LLM as context. Use RAG agents for document Q&A, knowledge-base chatbots, and internal search.
What is RAG?
RAG augments LLM responses with retrieved context from your documents. Instead of relying solely on the model’s training data, the model receives relevant passages from your corpus and generates answers grounded in that context. This reduces hallucinations and keeps answers up-to-date with your data.
RAG Flow
- Ingest — Documents are uploaded to an S3 bucket
- Chunk — Documents are split into overlapping chunks (e.g., 512 tokens)
- Embed — Chunks are converted to vector embeddings
- Index — Embeddings are stored in pgvector
- Query — User question is embedded and similar chunks are retrieved
- Generate — Retrieved chunks + question are sent to the LLM for answer generation
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Documents │────►│ Embeddings │────►│ Vector DB │────►│ Retrieval │────►│ LLM │
│ (S3 bucket) │ │ (OpenAI, │ │ (pgvector) │ │ (Top-K) │ │ (GPT-4o, │
│ │ │ Voyage) │ │ │ │ │ │ Claude) │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │ │
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Chunk size Model config Similarity Top-K value System prompt
256/512/1024 text-embedding-3 cosine/L2 (e.g., 5) + context
Creating a RAG Agent
1. Prepare Your Bucket
Create an S3 bucket and upload documents (PDF, DOCX, TXT, Markdown, images). The agent will watch the bucket or a specific prefix for new objects.
aws s3 cp ./docs/ s3://my-knowledge-base/ --recursive \
--endpoint-url https://storage.yourdomain.com
2. Configure Embedding Model
Choose an embedding model:
| Model | Dimensions | Use Case |
|---|---|---|
text-embedding-3-small | 1536 | Fast, cost-effective |
text-embedding-3-large | 3072 | Higher quality |
voyage-3.5-lite | 1024 | Alternative, good for long docs |
3. Set Chunk Size
Chunk size affects retrieval quality and cost:
| Size (tokens) | Pros | Cons |
|---|---|---|
| 256 | Fine-grained, many chunks | More API calls, may miss context |
| 512 | Balanced | Default for most use cases |
| 1024 | More context per chunk | Fewer chunks, coarser retrieval |
4. Configure LLM
Select the LLM for generation:
- GPT-4o — Best quality, recommended for production
- Claude 3 — Strong alternative
- GPT-3.5-turbo — Faster, lower cost for simple Q&A
5. Set Top-K
Top-K controls how many chunks are retrieved per query. Typical values: 3–10. Higher values give more context but increase token usage and latency.
JSON Configuration Example
{
"name": "company-knowledge-base",
"type": "rag",
"bucket": "my-knowledge-base",
"prefix": "docs/",
"embedding": {
"model": "text-embedding-3-small",
"chunkSize": 512,
"chunkOverlap": 64
},
"llm": {
"model": "gpt-4o",
"temperature": 0.2,
"maxTokens": 2048
},
"retrieval": {
"topK": 5,
"similarityThreshold": 0.7
},
"systemPrompt": "You are a helpful assistant. Answer based only on the provided context. If the context does not contain the answer, say so."
}
Querying via Chat UI
Use the NFYio Chat UI to interact with your RAG agent:
- Select the agent from the dropdown
- Type your question
- The agent retrieves relevant chunks and streams the response
Conversations are persisted as threads. You can resume previous threads or start new ones.
Querying via API
Chat Completion (Streaming)
curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is our refund policy?"}
],
"stream": true
}'
Chat Completion (Non-Streaming)
curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is our refund policy?"}
],
"stream": false
}'
Response Format (Non-Streaming)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Our refund policy allows returns within 30 days..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 450,
"completion_tokens": 120,
"total_tokens": 570
}
}
With Thread ID (Conversation History)
curl -X POST "https://api.yourdomain.com/v1/agents/company-knowledge-base/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"threadId": "thread_xyz789",
"messages": [
{"role": "user", "content": "What is our refund policy?"}
],
"stream": true
}'
Best Practices
Chunking
- Use overlap (e.g., 64 tokens) to avoid splitting sentences awkwardly
- For structured docs (tables, lists), consider smaller chunks
- For narrative text, 512 tokens is a good default
Retrieval
- Start with Top-K = 5 and tune based on answer quality
- Use
similarityThresholdto filter low-relevance chunks - Consider hybrid search (keyword + semantic) for mixed queries
System Prompts
- Instruct the model to answer only from the provided context
- Add fallback behavior: “If the context doesn’t contain the answer, say ‘I don’t have that information.’”
- Include domain-specific instructions (tone, format, citations)
Re-indexing
- Re-index when you add or update documents
- Use incremental indexing when possible to avoid full re-embedding
- Monitor embedding and indexing latency for large corpora
Next Steps
- Embeddings & Vector Search — Deep dive into embeddings and chunk strategies
- LLM Agents — Direct LLM interaction without retrieval
- Agent Tools — Extend RAG with custom tools