RAG Intelligence Pipeline
Overview
The RAG (Retrieval-Augmented Generation) Intelligence Pipeline is an advanced AI feature that allows your application to answer questions based on uploaded documents. Unlike traditional AI chatbots that rely solely on pre-trained knowledge, this system dynamically retrieves relevant information from your custom knowledge base and uses it to generate accurate, context-aware responses.
[!TIP] Interactive Tutorial: We have an online tutorial on how to implement this RAG pipeline step by step. You can follow it here: RAG Pipeline Step-by-Step Tutorial
What is RAG?
RAG combines two powerful AI concepts:
- Retrieval: Finding relevant documents from a knowledge base using semantic search
- Generation: Using an AI model (like Google's Gemini) to generate human-like responses based on retrieved context
This approach ensures your AI assistant provides answers grounded in your specific data, reducing hallucinations and improving accuracy.
Architecture Overview
flowchart TD
A[User Uploads File] --> B[Generate Presigned URL]
B --> C[Upload to Cloudflare R2]
C --> D[Confirm Upload]
D --> E[Add to Ingestion Queue]
E --> F{File Type?}
F -->|Text/PDF| G[Extract Text]
F -->|Image| H[Gemini Vision Analysis]
G --> I[Chunk Text]
H --> I
I --> J[Generate Embeddings]
J --> K[Store in pgvector]
L[User Asks Question] --> M[Rewrite Query w/ Context]
M --> N[Convert to Embedding]
N --> O[Hybrid Vector Search]
O --> P[Generate Presigned URLs]
P --> Q[Stream AI Response]
How It Works
Phase 1: File Ingestion
When a user uploads a file, the system processes it through an asynchronous pipeline:
Step 1: File Upload (Reservation Pattern)
The upload follows a secure "reservation" pattern:
// 1. Client requests signed URL
POST /api/v1/upload/generate-signed-url
{
"fileName": "company-docs.pdf",
"fileType": "application/pdf",
"fileSize": 524288,
"isPublic": false
}
// 2. Backend creates PENDING record in DB
const file = await prisma.file.create({
data: {
fileKey: `users/${userId}/uuid-company-docs.pdf`,
status: 'PENDING',
userId,
// ...
}
});
// 3. Client uploads directly to R2 using signed URL
// 4. Client confirms upload
POST /api/v1/upload/confirm
{ "fileId": "uuid" }
See File Upload Architecture for detailed flow.
Step 2: Background Processing
Once confirmed, a BullMQ job is triggered (ingestion.worker.ts):
For Text Files (PDF, TXT, Markdown):
- Download from R2: Fetch the file using AWS SDK
- Hash Check: Calculate SHA-256 to detect duplicates
- Text Extraction: Parse content (PDF uses
unpdflibrary) - Chunking: Split text into ~1000 character chunks with 200 char overlap
- Embedding: Convert each chunk to a 768-dimension vector using Gemini's
text-embedding-004 - Storage: Save to PostgreSQL with pgvector extension
For Images (JPEG, PNG, WEBP, GIF):
- Multimodal Analysis: Send image to Gemini Vision model (
gemini-2.5-flash-lite) - Description Generation: AI describes visual content, text, and data tables
- Chunking \u0026 Embedding: Process description text like above
Example Code from ingestion.worker.ts:
// Image processing
if (mimeType.startsWith('image/')) {
const { text: imageDesc } = await generateText({
model: google('gemini-2.5-flash-lite'),
messages: [{
role: 'user',
content: [
{
type: 'text',
text: 'Analyze this image in detail. Extract all visible text...'
},
{ type: 'image', image: buffer }
]
}]
});
text = imageDesc;
}
// Generate embeddings
for (const chunk of chunks) {
const { embedding } = await embed({
model: google.textEmbeddingModel('text-embedding-004'),
value: chunk.pageContent,
});
// Store in pgvector
await prisma.$executeRaw`
INSERT INTO "Document" (id, content, metadata, userId, fileId, embedding, createdAt)
VALUES (gen_random_uuid(), ${chunk.pageContent}, ${metadata}::jsonb,
${userId}, ${fileId}, ${embedding}::vector, NOW())
`;
}
Phase 2: Intelligent Retrieval
When a user asks a question, the system performs sophisticated retrieval:
Step 1: Contextual Query Rewriting
If the conversation has history, the system uses Gemini to rewrite the query:
// User: "Who is Elon Musk?"
// AI: "He is the CEO of Tesla and SpaceX"
// User: "How old is he?"
// System rewrites "How old is he?" → "How old is Elon Musk?"
const { text } = await generateText({
model: google('gemma-3-12b'),
messages: conversationHistory,
system: 'Rewrite the last message into a standalone search query...'
});
Step 2: Vector Embedding
Convert the refined query to the same 768-D vector space:
const { embedding } = await embed({
model: google.textEmbeddingModel('text-embedding-004'),
value: refinedQuery
});
Step 3: Hybrid Search (Vector + Metadata)
The system performs a hybrid search combining:
- Vector Similarity: pgvector's
<=>operator (cosine distance) - Access Control: Filters by user ownership OR public visibility
SELECT d.content, d.metadata, f."originalName", f."fileKey", f."isPublic",
(d.embedding <=> $vectorQuery::vector) as distance
FROM "Document" d
LEFT JOIN "File" f ON d."fileId" = f.id
WHERE (d."userId" = $currentUserId OR f."isPublic" = true)
ORDER BY distance ASC
LIMIT 5
Security Model: - Users can search their own private documents - Users can search any public document (uploaded by admins/contributors) - Users cannot see other users' private documents
Step 4: Smart Citations
For each retrieved document, the system generates a temporary presigned URL (valid 1 hour) so the AI can cite sources:
const signedUrl = await getSignedUrl(s3Client, new GetObjectCommand({
Bucket: config.aws.s3.bucket,
Key: doc.fileKey
}), { expiresIn: 3600 });
// Package for AI
const context = documents.map(doc =>
`Source: ${doc.originalName} [${doc.isPublic ? 'Public' : 'Private'}]
Link: ${signedUrl}
Content: ${doc.content}`
).join('\n\n');
Step 5: AI Response Generation
Finally, stream the response using Gemini:
const result = streamText({
model: google('gemini-2.5-flash-lite'),
messages: userConversation,
system: `Answer based ONLY on this context. Cite sources using [Source](Link).
Context:
${context}`
});
result.pipeTextStreamToResponse(res);
Configuration
1. Google AI API Key
You need a Google Generative AI API key:
- Visit Google AI Studio
- Create an API key
- Add to
.env:
GOOGLE_GENERATIVE_AI_API_KEY=your_api_key_here
2. pgvector Extension
Ensure your PostgreSQL database has the pgvector extension:
CREATE EXTENSION IF NOT EXISTS vector;
This is typically configured in your database migrations. See Database (Prisma) for details.
3. Supported Models
The free tier includes:
- gemini-2.5-flash-lite (recommended for chat)
- gemini-2.5-flash (higher quality)
- gemma-3-12b (lightweight)
- text-embedding-004 (embeddings only)
Check Google AI pricing for limits.
4. AI Model Configuration
The specific AI models used for ingestion, chat, and embeddings are configured in src/config/ai-models.ts. You can modify this file to switch between different Gemini models (e.g., upgrading from gemini-2.5-flash-lite to gemini-1.5-pro for better reasoning).
5. Prompt Configuration
You can customize the specific instructions given to the AI in src/config/prompts.ts. This allows you to fine-tune the persona, response style, and query refinement logic without changing the core code.
| Prompt Key | Purpose |
|---|---|
ingestion.imageAnalysis |
Instructions for Gemini Vision to extract information from images during file upload. |
agent.queryRewriter |
Instructions for rewriting the user's latest message into a standalone search query based on conversation history. |
agent.systemPrompt |
The master instruction for the RAG chatbot. It controls how the AI answers questions and cites sources. |
Important: When editing agent.systemPrompt, ensure you preserve the {{context}} placeholder, as this is where the retrieved documents are injected. Also, strict citation formats (e.g., [Source Name](Link)) are critical for the frontend to render clickable source links.
File Support
| File Type | Processing Method | Max Size |
|---|---|---|
Text extraction via unpdf |
5 MB | |
| Text/Markdown | Direct UTF-8 decode | 5 MB |
| JPEG/PNG/GIF/WEBP | Gemini Vision → Text description | 5 MB |
Performance Considerations
Chunking Strategy: - Chunk size: 1000 characters - Overlap: 200 characters (preserves context across boundaries)
Why Overlap? If a sentence is split across chunks, the overlap ensures at least one chunk contains the full context.
Embedding Dimensions: 768 (Gemini text-embedding-004)
Search Results: Top 5 most relevant chunks
Troubleshooting
"No text extracted from file"
- Cause: PDF might be image-based (scanned document)
- Solution: Use OCR preprocessing or upload as image
"Model not available in your region"
- Cause: Some Gemini models have geographic restrictions
- Solution: Switch to
gemma-3-12b(globally available)
"Vector dimension mismatch"
- Cause: Database schema expects different vector size
- Solution: Verify
Document.embeddingisvector(768)
"Duplicate file detected"
- Info: This is expected! The system deduplicates files by SHA-256 hash
- Behavior: New upload is marked
DUPLICATE, original remains
Next Steps
- Learn about File Upload Architecture
- Explore Agent API Routes
- Review Background Jobs setup