Most businesses have a document problem. Invoices, contracts, applications, compliance forms — they arrive in different formats, need data extracted, and feed into downstream systems. The traditional approach is either manual data entry or rigid template-based OCR that breaks the moment someone changes the layout of their invoice.
The combination of Amazon Textract (structured OCR) and Amazon Bedrock (LLM reasoning) changes this. Textract handles the extraction; Bedrock handles the understanding. Together they can process documents that would have been impossible to automate even two years ago.
This post walks through the architecture we use for intelligent document processing on AWS, with code examples and the production lessons we've picked up along the way.
The Architecture
At a high level, the pipeline is straightforward. A document arrives, gets read by OCR, gets understood by an LLM, and the structured output feeds into your systems. The detail is in the orchestration.
Here's the flow:
Document Upload (S3)
|
v
EventBridge Rule
|
v
Step Functions Workflow
|
+---> Step 1: Textract -- Extract text, tables, form key-value pairs
|
+---> Step 2: Bedrock -- Classify document type (invoice, contract, etc.)
|
+---> Step 3: Bedrock -- Extract structured data based on document type
|
+---> Step 4: Validate -- Check required fields, data types, cross-references
|
+---> Step 5: Output -- Write to DynamoDB/RDS
| OR route failures to human review queue (SQS)
v
Done
Documents arrive in S3 via whatever mechanism makes sense — a web upload form, an email integration using SES, or a direct API integration. An EventBridge rule on the S3 bucket triggers a Step Functions state machine, which orchestrates the entire pipeline.
Step Functions is the right choice here because document processing involves multiple service calls that can fail independently. You get built-in retries, error handling, and the ability to route failures to a human review queue without writing retry logic yourself.
Why Textract + Bedrock, Not Just Bedrock?
Fair question. Multimodal models like Claude on Bedrock can look at a document image and read it directly. Why bother with Textract at all?
Three reasons:
Structured extraction is more reliable with Textract. Textract was purpose-built for OCR. It returns tables as actual table structures with rows and columns, and forms as key-value pairs. When you send a document image to an LLM, it reads the text but loses the spatial relationships. A table that's obvious to a human can confuse an LLM when columns aren't clearly aligned.
Textract is significantly cheaper. Sending a full-page document image to a multimodal LLM costs roughly 10-20x what Textract charges for the same page. When you're processing thousands of documents, that adds up fast.
The combination is more powerful than either alone. Textract extracts the raw data with high fidelity. Bedrock reasons about what that data means — classifying the document, mapping fields to your schema, handling ambiguity. Each tool does what it's best at.
That said, we do use Bedrock's multimodal capabilities as a fallback. Some documents defeat Textract — handwritten notes, unusual layouts, poor scan quality. For those, sending the image directly to Claude on Bedrock is a good backup strategy.
Document Classification with Bedrock
Before you can extract structured data, you need to know what you're looking at. An invoice needs different fields extracted than a contract or a tax form.
We classify documents by sending the Textract output to Claude on Bedrock with a classification prompt. Here's the core of it:
import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="eu-west-2")
def classify_document(extracted_text: str) -> dict:
prompt = f"""You are a document classification system. Classify the following document
into exactly one of these categories:
- invoice
- purchase_order
- contract
- tax_form
- bank_statement
- receipt
- unknown
If you are not confident in the classification, return "unknown".
Respond with JSON only: {{"document_type": "...", "confidence": "high|medium|low"}}
Document text:
{extracted_text[:4000]}"""
response = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"messages": [
{"role": "user", "content": prompt}
]
})
)
result = json.loads(response["body"].read())
classification = json.loads(result["content"][0]["text"])
return classification
A few things worth noting:
Few-shot prompting improves accuracy significantly. The example above is zero-shot for clarity, but in production we include 2-3 examples of each document type in the prompt. This is especially important for distinguishing similar document types (invoices vs. receipts, for example).
The "unknown" category is critical. When the model isn't sure, it should say so. We route "unknown" classifications and "low" confidence results to a human review queue. This is cheaper than fixing bad data downstream, and the reviewed documents become training examples for improving the prompts.
Truncate the input. You don't need the entire document for classification. The first few thousand characters usually contain enough signals (headers, formatting, key terms) to classify correctly. This keeps the cost down.
Structured Data Extraction
Classification tells you what the document is. Extraction pulls out the specific data you need. This is the core of the pipeline.
We define a schema for each document type — what fields to extract, their expected types, and which are required. Then we send the Textract output plus the schema to Claude with instructions to extract and return JSON.
EXTRACTION_SCHEMAS = {
"invoice": {
"vendor_name": {"type": "string", "required": True},
"invoice_number": {"type": "string", "required": True},
"invoice_date": {"type": "date", "required": True},
"due_date": {"type": "date", "required": False},
"total_amount": {"type": "number", "required": True},
"currency": {"type": "string", "required": True},
"line_items": {
"type": "array",
"required": False,
"items": {
"description": "string",
"quantity": "number",
"unit_price": "number",
"amount": "number"
}
},
"vat_number": {"type": "string", "required": False},
"payment_terms": {"type": "string", "required": False},
}
}
def extract_fields(extracted_text: str, tables: list, document_type: str) -> dict:
schema = EXTRACTION_SCHEMAS[document_type]
prompt = f"""Extract the following fields from this document.
Return valid JSON matching the schema exactly. If a field is not found in the document,
use null for optional fields. Never invent or guess values.
Schema:
{json.dumps(schema, indent=2)}
Document text:
{extracted_text}
Table data:
{json.dumps(tables, indent=2)}
Respond with JSON only."""
response = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"messages": [
{"role": "user", "content": prompt}
]
})
)
result = json.loads(response["body"].read())
extracted = json.loads(result["content"][0]["text"])
return extracted
Notice we pass both the extracted text and the table data from Textract. Tables are where Textract really earns its keep — line items on an invoice, for example, are structured data that Textract preserves as rows and columns. Sending this structured table data to the LLM gives much better results than raw text where the column alignment is lost.
Validate your outputs. Don't trust that the LLM will always return valid JSON matching your schema. Use Pydantic (or a similar validation library) to parse and validate the response. Check required fields are present, dates are valid dates, amounts are numbers. When validation fails, retry with a more explicit prompt before routing to human review.
A typical validation layer looks like this:
from pydantic import BaseModel, validator
from typing import Optional
from datetime import date
class LineItem(BaseModel):
description: str
quantity: Optional[float] = None
unit_price: Optional[float] = None
amount: float
class InvoiceExtraction(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: date
due_date: Optional[date] = None
total_amount: float
currency: str
line_items: Optional[list[LineItem]] = None
vat_number: Optional[str] = None
payment_terms: Optional[str] = None
@validator("total_amount")
def total_must_be_positive(cls, v):
if v <= 0:
raise ValueError("Total amount must be positive")
return v
If the LLM output doesn't pass Pydantic validation, you have a clear, structured error to act on — retry, adjust the prompt, or route to human review.
Handling the Messy Reality
Sample documents in a demo always work perfectly. Production documents don't. Here's what to expect.
Multi-page documents. Textract handles multi-page PDFs natively, but you need to think about context windows when sending the output to Bedrock. A 50-page contract produces a lot of text. Strategies: extract from specific pages if you know where the data lives, or use a two-pass approach where the first LLM call identifies which pages contain the data you need and the second extracts from those pages only.
Poor scan quality. Faded text, skewed pages, coffee stains — they happen. Basic image preprocessing (deskewing, contrast enhancement) before Textract helps. For truly bad scans, fall back to sending the image directly to Claude on Bedrock in multimodal mode. It's more expensive per page but handles degraded quality better than OCR.
Multiple languages. Both Textract and Claude handle multiple languages well, but test with your specific document set. We've seen good results with English, French, German, and Spanish. Less common languages may need prompt adjustments or language-specific post-processing.
Confidence scoring matters. Build a composite confidence score from multiple signals:
- Textract's per-word confidence scores (available in the API response)
- The LLM's expressed certainty (ask it to rate its confidence)
- Validation pass rate (how many required fields were extracted successfully)
- Cross-reference checks (does the line item total match the invoice total?)
Weight these signals and set a threshold. Documents below the threshold go to human review.
The human-in-the-loop is not a failure — it's a design feature. Every document processing system needs a human review path. Start with AI processing plus human review of every document. Measure accuracy. Gradually increase automation as confidence improves. The reviewed documents become your ground truth for measuring accuracy and improving prompts over time. A system that processes 80% of documents automatically and routes 20% to humans is vastly more valuable than one that processes 100% with 15% errors.
Cost and Performance
Real numbers, because vague "it's cheap" claims aren't useful for planning.
Textract pricing (eu-west-2, as of early 2026):
- Text detection: ~$1.50 per 1,000 pages
- Form extraction (key-value pairs): ~$50 per 1,000 pages
- Table extraction: ~$15 per 1,000 pages
- For a typical invoice using text + tables: roughly $0.02 per page
Bedrock Claude costs (Claude 3.5 Sonnet):
- Classification call (~500 input tokens, ~50 output): ~$0.002
- Extraction call (~2,000 input tokens, ~500 output): ~$0.01
- Total LLM cost per document: typically $0.01-$0.05 depending on document length and complexity
Total cost per typical invoice: $0.03-$0.07. Compare that to manual data entry costs of $1-5 per document and the case makes itself.
Processing time: 5-15 seconds per document end-to-end. Textract is the bottleneck at 2-8 seconds depending on page count. The Bedrock calls add 1-3 seconds each. For most use cases this is fast enough. If you're processing in bulk, use SQS to decouple ingestion from processing and run multiple Step Functions executions in parallel.
Orchestration tip: Step Functions Express Workflows are cheaper than Standard Workflows for short-lived executions (under 5 minutes). Since document processing typically completes in under 30 seconds, Express Workflows make sense and save roughly 50% on orchestration costs at scale.
Getting Started
If you're planning a document processing pipeline, here's the approach that works:
Start with one document type. Pick the highest-volume document that's currently handled manually. Get it working end-to-end — extraction, classification, validation, human review — before adding more document types. Each new type is mostly prompt engineering once the infrastructure is in place.
Collect 50+ sample documents before building. Variety matters more than volume. You need to see the range of formats, layouts, and edge cases your system will encounter. Five samples from one vendor won't prepare you for the real world.
Build the human review interface early. You'll need it for edge cases from day one, and the reviewed documents become your training data for measuring and improving accuracy. A simple web form that shows the document alongside the extracted data, with the ability to correct fields, is enough to start.
Monitor extraction accuracy weekly. Track what percentage of documents pass through without human intervention, what percentage need correction, and what the most common errors are. Use this data to refine prompts and add validation rules.
Set up CloudWatch custom metrics for the numbers that matter:
- Documents processed per hour
- Automated vs. human-reviewed ratio
- Average confidence score
- Cost per document
- End-to-end processing time (P50, P95)
These metrics tell you when prompts are degrading, when a new document format is appearing that your system doesn't handle, and whether the ROI is tracking to plan.
Drowning in Manual Document Processing?
We build intelligent document pipelines on AWS. Let's talk about what you could automate.
Book a Free Session