LLaMA on AWS Bedrock: Ditching OpenAI for Production NLP

Note: To protect proprietary details, specific names, labels, and configurations in this post have been adapted into a similar but fictional context. The architectural decisions, trade-offs, and lessons learned are real.

Why Move Away from OpenAI

Three reasons pushed us from OpenAI to AWS Bedrock with LLaMA:

Cost — At our scale, OpenAI API costs were significant. Bedrock’s pricing for LLaMA was substantially lower.
Data residency — Our data stayed within our AWS environment. No external API calls, no data leaving our VPC.
Control — No dependency on OpenAI’s model deprecation schedule, rate limits, or availability.

The Bedrock Setup

AWS Bedrock provides managed access to foundation models without managing infrastructure. We focused first on the classification service — the core engine for multi-label text analysis. The summarizer followed as a separate system (covered in a dedicated post).

Classification Service (LLaMA 3 70B)

Multi-label text classification across around 25 dimensions — covering aspects like sentiment polarity, visual appeal, credibility, perceived quality, purchase likelihood, eco-consciousness, and more:

Model: LLaMA 3 70B via AWS Bedrock
Batch processing: Texts grouped into batches of 40-60, processed with high concurrency
Temperature: 0.5 with top-p sampling
Output: Structured JSON with label assignments per text
Prompts: Extensive label definitions (~30KB of prompt engineering) with examples and edge cases, versioned per environment
Pydantic validation on every response
DynamoDB caching layer for repeat classifications

Prompt Engineering for LLaMA vs GPT

LLaMA and GPT respond differently to the same prompts. Key differences we discovered:

Instruction following. GPT-3.5 is more forgiving with loose instructions. LLaMA 3 needs more explicit formatting directives — especially for JSON output.

Context window. LLaMA 3 70B has a generous context window but we found accuracy degraded with very long inputs. We kept batch sizes moderate (50 texts) rather than stuffing the context.

Temperature sensitivity. LLaMA at temperature 0.5 produced more diverse outputs than GPT at the same temperature. We tuned per-task.

JSON reliability. LLaMA occasionally produced JSON with trailing commas or missing brackets. We added a json_repair library to handle this automatically.

Architecture Decisions

Lambda + Bedrock, not ECS. Since Bedrock handles the model hosting, our inference code is just an API call. Lambda was sufficient — no need for persistent compute.

25 retries with exponential backoff. Bedrock occasionally throttles or times out. Generous retry logic ensured batch completion.

MLflow for prompt management. We stored prompt templates as MLflow artifacts, versioned and tagged per environment (dev/prod). This let us update prompts without code deploys.

Dual model support. The summarizer supports both LLaMA and Claude, selectable per request. This gave us flexibility when one model performed better on specific analysis types.

Results

Compared to the OpenAI approach:

60-70% cost reduction for equivalent throughput
Comparable accuracy on classification benchmarks
Better data governance — everything stays in AWS
More predictable latency — no external API variability

What Still Wasn’t Perfect

The 70B model is powerful but expensive per-token. For classification — where the output is just a single label — using a 70B parameter model felt wasteful. This led us to explore fine-tuning a smaller model with LoRA adapters.

Technical Stack

AWS Bedrock Runtime — model inference
Meta LLaMA 3 / 3.1 70B — primary model
Anthropic Claude Sonnet — fallback model
AWS Lambda — orchestration
MLflow — prompt versioning
DynamoDB — response caching
json_repair — malformed JSON recovery
Pydantic — response validation