Building a Production RAG Pipeline with AWS Bedrock

RAG is the most practical AI pattern for businesses right now. Instead of fine-tuning models on your data, which is often expensive, slow, and becomes stale as your documents change, you retrieve relevant context at query time to feed to the LLM. The concept is simple, but getting it to work reliably in production is where most teams struggle.

We've built several RAG systems on AWS Bedrock for clients in financial services, professional services, and operations. This post covers what we've learned about architecture, embedding, retrieval, and the non-obvious things that make the difference between a demo that impresses and a system people actually trust.

The Architecture

Here's the architecture we use for most production RAG systems on AWS. Every component has alternatives, but this is our default stack and where we start conversations with clients.

Documents (PDF, DOCX, etc.)
        |
        v
   S3 Bucket (raw)
        |
        v
   Lambda (parse + chunk)
        |
        v
   Bedrock Titan Embeddings
        |
        v
   OpenSearch Serverless (vector store)
        |
        |
   -----+-----
   |         |
   v         v
 Query     Ingestion
 Path      Pipeline
   |
   v
User Query --> Embed --> Vector Search --> Context Assembly --> Bedrock Claude --> Response

The flow has two sides. Ingestion: documents land in S3, a Lambda function parses and chunks them, generates embeddings via Bedrock, and writes vectors to OpenSearch. Query: a user's question gets embedded with the same model, we search for similar chunks, assemble them into a prompt with the original question, and send the lot to Claude on Bedrock.

We default to OpenSearch Serverless for the vector store on most projects. It's a managed service, scales without capacity planning, supports both vector and keyword search in one index (critical for hybrid search, which we'll get to), and it's available in eu-west-2. The serverless pricing model means you're not paying for idle compute during off-hours, which matters for internal tools that see traffic 8 hours a day.

The alternative is Aurora PostgreSQL with pgvector. We reach for this when the client already has Aurora in their stack and the dataset is modest, such as under 500k chunks. While pgvector is simpler and its querying is pure SQL, it avoids adding another service to manage. However, OpenSearch wins on advanced search features like built-in hybrid search, better performance at scale, and native support for filtering on metadata fields during vector queries.

For orchestration, Step Functions handles the ingestion pipeline, as it's asynchronous and can handle documents that take minutes to process with built-in retry logic. Lambda handles the synchronous query path. It's important to keep these separate, as the ingestion pipeline has different scaling, error handling, and timeout requirements compared to the query path.

Choosing Your Embedding Model

This is the decision most teams get wrong by not spending enough time on it. Your embedding model determines the quality of retrieval, and retrieval quality is the single biggest factor in RAG accuracy. The LLM can only work with what you give it.

On Bedrock, you have two main options:

Amazon Titan Embeddings v2 supports configurable dimensions (256, 512, or 1024), handles up to 8,192 tokens of input, and is the cheapest option on Bedrock. At 512 dimensions, it strikes a good balance between quality and storage cost. It's what we use by default.

Cohere Embed v3 on Bedrock offers 1024 dimensions, supports up to 512 tokens per input, and has separate input types for search documents vs. search queries (which does improve retrieval quality). It edges ahead of Titan on benchmarks for English text, but the shorter input length means you need to think harder about chunk sizes.

Our take: start with Titan Embeddings v2 at 512 dimensions. It's cost-effective, handles longer inputs, and the quality gap with models like Cohere is smaller than benchmarks might suggest. If you're building a system where retrieval precision is absolutely critical, such as for legal document search or compliance queries, test multiple models on your actual data and measure performance yourself. Avoid relying on benchmarks alone, as they rarely reflect specific document types and query patterns.

Here's how we generate embeddings with Bedrock:

import json
import boto3

bedrock = boto3.client('bedrock-runtime', region_name='eu-west-2')

def get_embedding(text):
    response = bedrock.invoke_model(
        modelId='amazon.titan-embed-text-v2:0',
        body=json.dumps({
            'inputText': text,
            'dimensions': 512,
            'normalize': True
        })
    )
    return json.loads(response['body'].read())['embedding']

A few things to note. We always set normalize: True to make cosine similarity calculations simpler and slightly faster. We use 512 dimensions as the default, as going to 1024 rarely improves results enough to justify the additional storage and compute cost. Furthermore, we pin the region to ensure all Bedrock calls, the vector store, and S3 buckets are in the same location to avoid latency and cross-region transfer charges.

Chunking Strategy Matters More Than You Think

Chunking is where most RAG demos fall apart when you point them at real documents. A PDF with complex layouts, such as tables and multi-column text, can lead to incoherent results if you use a simple fixed-token splitting approach.

We've tried several approaches:

Fixed-size chunks: splitting every N tokens with no overlap is simple to implement but often leads to poor results. You risk splitting mid-sentence or mid-paragraph, which can result in useless context if relevant information spans two chunks.
Semantic chunking: this method uses embedding similarity between sentences to find natural break points. It offers better results but can be slow and unpredictable since it requires calling the embedding model for every sentence.
Our preferred approach: overlapping chunks with metadata preservation. We split on paragraph boundaries, aiming for 512-1024 tokens per chunk with a 20% overlap. Crucially, we preserve the document structure by including source filenames, page numbers, and section headings with every chunk.

Overlapping chunks ensure that relevant passages aren't lost at boundaries. A 20% overlap is typically our sweet spot, as it handles boundary cases effectively without significantly increasing storage costs.

Preserving metadata is also essential. It allows the LLM to provide accurate citations, such as document names and page numbers. Furthermore, it enables filtering during retrieval so you can limit searches to specific categories or date ranges.

One more thing: don't chunk tables. Instead, extract them separately and convert them into a structured format like Markdown. Store them as distinct chunks with explicit metadata marking them as tables, since splitting tables across chunks can confuse the model.

The Retrieval Step

Vector similarity search handles much of the work, but true production quality requires hybrid search, filtering, and re-ranking. These additional steps separate a basic demo from a system that users can truly trust.

Hybrid search combines vector similarity with traditional keyword matching. This is important because embedding models can occasionally miss exact terms. For instance, a search for a specific policy number might return general results via vector search, but keyword matching will pinpoint the exact document. OpenSearch Serverless supports this natively by allowing you to combine scores from both vector and keyword queries on a single index.

Here's how we query OpenSearch for hybrid search:

{
  "size": 8,
  "_source": ["text", "metadata"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "neural": {
            "embedding": {
              "query_text": "What is our policy on remote working?",
              "model_id": "your-embedding-model-id",
              "k": 20
            }
          }
        },
        {
          "match": {
            "text": {
              "query": "remote working policy"
            }
          }
        }
      ]
    }
  }
}

How many chunks to retrieve: start with 5-8 and test from there. Too few and you miss relevant context. Too many and you dilute the good results with noise, which confuses the LLM. We typically retrieve 15-20 candidates, then re-rank and take the top 5-8.

Re-ranking is the step most teams skip, and it makes a significant difference. The idea: your initial vector search returns a broad set of candidates. A re-ranker (a cross-encoder model) then scores each candidate against the original query more carefully and reorders them. The vector search is fast but approximate; the re-ranker is slower but more accurate. You can use a re-ranking model via a SageMaker endpoint or, increasingly, Bedrock's re-ranking API.

Prompt Engineering for RAG

The system prompt for your RAG application is not a nice-to-have. It's load-bearing infrastructure. Get it wrong and the model will hallucinate, ignore your context, or give rambling answers that bury the useful information.

Here's the template we start with:

You are a helpful assistant for {company_name}. Answer questions
using ONLY the context provided below. Follow these rules strictly:

1. Base your answer entirely on the provided context.
2. If the context does not contain enough information to answer
   the question fully, say "I don't have enough information to
   answer that fully" and explain what's missing.
3. NEVER invent or assume information not present in the context.
4. Cite your sources by referencing the document name and page
   number in square brackets, e.g. [Finance Policy v3, p.12].
5. If multiple sources are relevant, synthesise them and cite each.
6. Keep answers concise and direct. Use bullet points for lists.

Context:
{context_chunks}

Question: {user_question}

A few hard-won lessons on RAG prompts:

Instruct the model to cite sources. This ensures answers are grounded in the provided context and helps the user verify information. Models that provide citations tend to have lower hallucination rates.
"I don't know" is a feature, not a bug. Explicitly telling the model it's okay to say it doesn't have the answer dramatically reduces hallucination. Without this instruction, the model will try to be helpful by making things up.
Don't just dump all chunks into the prompt. Order matters. Put the most relevant chunks first. The model pays more attention to context near the beginning of the prompt. This is another reason re-ranking matters.
Separate context from instructions. Use clear delimiters between your system instructions, the retrieved context, and the user's question. Mixing them together confuses the model about what's an instruction vs. what's context.

Context window management is crucial at scale. Claude on Bedrock gives you a large context window, but that doesn't mean you should fill it. More context means higher latency, higher cost (you pay per token), and diminishing returns on quality. We aim for 3,000-5,000 tokens of context for most queries. If you consistently need more, your chunking or retrieval is likely the problem, not your context window size.

The Things That Bite You in Production

Everything above will get you a working demo. The following is what decides whether it survives contact with real users and real data.

Stale data

Documents change frequently, so it's critical to build an incremental ingestion pipeline from day one. Using S3 event notifications can trigger re-embedding immediately when documents are added or updated, ensuring your RAG system always uses current information.

Hallucination despite context

Even with good context and a solid system prompt, the model can still make things up. We've seen it synthesise plausible-sounding policy clauses that don't exist, combine numbers from different tables incorrectly, and present one document's guidance as if it came from another. Implement source citation checking: compare the model's cited sources against the chunks you actually provided. If it cites a document that wasn't in the context, flag the response for review. This isn't foolproof, but it catches a meaningful percentage of hallucinations.

Latency

The full RAG chain, from query embedding to final generation, can add several seconds of latency. While this may be acceptable for internal tools, it can impact the user experience for customer-facing applications.

To reduce latency, consider caching embeddings for common queries and using streaming responses. Furthermore, keeping your context payload lean and warming your Lambda functions can significantly improve response times.

Cost

Bedrock's per-token pricing can quickly add up for high-volume applications, so it's essential to monitor usage and set billing alarms. A typical query can be inexpensive, but at scale, these costs can become significant.

Use the right model for the right task. We use a tiered approach, using Haiku for query classification and Sonnet for generation. Since Haiku is significantly cheaper and fast enough for initial routing, this can lead to substantial cost savings. Avoid using Opus for every task, as it should be reserved for complex reasoning where the quality justifies the higher cost.

Permission-awareness

This is the one that kills projects in regulated industries. Not every user should see every document. If your RAG system retrieves a chunk from an HR document about a specific employee and includes it in a response to someone who shouldn't see it, you have a serious problem.

Access control is a critical requirement in regulated industries like finance and healthcare. You must ensure that users can only retrieve information they're authorized to see by tagging every chunk with access metadata during ingestion.

Our Recommended Stack

For most production RAG systems on AWS, this is where we start:

Ingestion: S3 + Lambda + Step Functions. S3 event notifications trigger the pipeline, Step Functions orchestrates parsing, chunking, embedding, and indexing with built-in retry logic.
Embeddings: Amazon Titan Embeddings v2 at 512 dimensions. Cost-effective, good quality, long input length. Switch to Cohere if benchmarking on your data shows a meaningful improvement.
Vector store: OpenSearch Serverless with the vector engine. Hybrid search support, managed scaling, no capacity planning.
LLM: Claude on Bedrock. Haiku for classification and routing, Sonnet for generation. Adjust as newer model versions land.
Orchestration: Step Functions for async ingestion, Lambda for synchronous query handling.
Monitoring: CloudWatch metrics and alarms, plus custom metrics for retrieval quality (relevance scores, citation accuracy, user feedback).
Infrastructure: CDK. Everything defined as code, deployed through a pipeline, reproducible across environments.

Start small by focusing on a specific use case and document collection. It's better to deliver a highly effective system for a narrow scope than to struggle with a "universal assistant" that never leaves the development stage.

Building a RAG System?

We can help you go from idea to production. Free 30-minute strategy session to discuss your use case.

Book a Free Session

Mark Hayward

Founder at Cloudavian. AWS Solutions Architect Professional building AI systems on Bedrock.