Command Palette

Search for a command to run...

Blog
PreviousNext

Scaling RAG Architecture for 1 Million Users: A No-Fluff Engineering Guide

How to move your RAG system from a weekend prototype to something that actually survives production traffic.

** Who this is for:** Backend engineers and platform teams who already have a working RAG proof-of-concept and are preparing to take it to production at scale. If you haven't built a basic RAG system yet, start there first — this guide picks up where the tutorials leave off.

Let's be honest — most RAG tutorials show you how to get something working in a Jupyter notebook. You load a few PDFs, chunk them up, throw them into a vector store, and chain it all together with LangChain or a handful of OpenAI calls. It works. It's impressive in a demo. And it will absolutely fall apart the moment real users start hitting it.

This guide is about what happens after that. If you're planning to serve hundreds of thousands — or even a million — concurrent users with a RAG system, you need to stop thinking in terms of scripts and start thinking in terms of systems. The good news? The principles aren't alien. They're the same distributed systems ideas you already know — just applied to a new kind of workload. The bad news? There are a lot of moving parts, and getting any one of them wrong can silently degrade your entire system.

So let's walk through it, section by section, the way you'd actually build it.


1. Why a Single Pipeline Won't Cut It

The first and most important mindset shift: your ingestion path and your query path need to be completely separate.

In a POC, this doesn't matter. You ingest a few documents, then you query them. It's sequential. But at scale, ingestion is a continuous process — new documents are constantly being chunked, embedded, and indexed. If that workload lives on the same servers as the ones answering live user queries, you've got a classic "noisy neighbor" problem. A heavy re-indexing job starts running at 2 PM, and suddenly your p99 query latency doubles.

The solution is straightforward in concept: split it into two independent pipelines.

The Ingestion Pipeline handles everything upstream of the vector store — reading raw documents, chunking them, computing embeddings, and writing vectors. This runs asynchronously, often on a queue-based architecture. It doesn't need to be fast; it needs to be reliable and scalable.

The Query Pipeline handles everything a user touches — receiving the query, embedding it, searching the vector store, re-ranking results, and calling the LLM for a final answer. This one needs to be fast. We're talking sub-second response times, ideally.

At scale, both of these live as independent microservices or containerized workloads — think Kubernetes pods behind a load balancer, auto-scaling based on traffic. The ingestion workers pull from a queue like Kafka or SQS. The query servers sit behind an API gateway. They don't talk to each other except through the vector store itself.

┌─────────────────────────────────────────────────────────────────────┐
│                        HIGH-LEVEL ARCHITECTURE                      │
│                                                                     │
│  ┌──────────────┐    ┌─────────┐    ┌────────────┐                  │
│  │  Raw Docs    │───▶│  Queue  │───▶│  Ingestion │                  │
│  │  (S3, DB…)   │    │ (Kafka) │    │  Workers   │                  │
│  └──────────────┘    └─────────┘    │  ┌──────┐  │                  │
│                                     │  │Chunk │  │                  │
│                                     │  │Embed │  │                  │
│                                     │  └──┬───┘  │                  │
│                                     └─────┼──────┘                  │
│                                           ▼                         │
│                                    ┌──────────────┐                 │
│                                    │  Vector Store │                │
│                                    │  (Pinecone)   │                │
│                                    └──────┬───────┘                 │
│                                           ▲                         │
│                                     ┌─────┴──────┐                  │
│  ┌─────────┐    ┌────────────┐      │   Query    │                  │
│  │  User   │───▶│ API Gateway│─────▶│  Servers   │                  │
│  │ (1M)    │    │ + LB       │      │            │                  │
│  └─────────┘    └────────────┘      └────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

This separation is the foundation. Everything else we'll discuss builds on top of it.


2. Chunking: It's Not as Simple as You Think

Before anything hits the vector store, your raw documents need to be broken into chunks. And this is one of those steps that people rush through in POCs but absolutely matters in production.

Here's the core tension: chunks need to be small enough that the embedding model can represent them meaningfully, but large enough that they actually contain useful context. Too small, and you lose coherence — the LLM gets a bunch of disconnected fragments. Too large, and your embeddings become vague, your retrieval becomes noisy, and you're wasting context window space.

** Tip:** The sweet spot most teams land on is somewhere between 256 and 1024 tokens per chunk, depending on your documents. For structured content like legal docs or technical manuals, section-level splitting works well. For looser content like articles or chat logs, paragraph-level splitting is usually better.

One trick that's easy to overlook: chunk overlap. If you slice a document into perfectly non-overlapping segments, you'll inevitably cut sentences or ideas right at the boundary. Adding a 10–20% overlap between consecutive chunks means that important context near a boundary shows up in both chunks. It's a small thing, but it meaningfully improves retrieval quality.

Here's what that looks like in practice:

chunker.ts

function splitTextIntoChunks(text: string): string[] {
  const tokenizer = new GPT3Tokenizer();
  const maxTokens = 512;
  const overlap = 50; // ~10% of 512 — this is doing real work
 
  const tokens = tokenizer.encode(text);
  const chunks: string[] = [];
 
  for (let i = 0; i < tokens.length; i += maxTokens - overlap) {
    const chunkTokens = tokens.slice(i, i + maxTokens);
    chunks.push(tokenizer.decode(chunkTokens));
  }
 
  return chunks;
}

Now, documents aren't static. When your source material changes, you need to update the index. Most teams start with a nightly batch re-index — it's simple, it works, and for most use cases the data freshness is acceptable. When you actually need real-time updates, that's when you bring in change-data-capture (CDC) tools like Debezium to stream updates into your ingestion pipeline. But don't reach for CDC on day one. It adds significant operational complexity, and most teams don't actually need it until they've proven everything else works.

** Warning:** Don't forget deletions. When a document is removed from your source system, the old vectors are still sitting in your vector store unless you explicitly clean them up. Build a mechanism for this — even if it's just a background job that periodically checks for tombstones. Stale vectors are a silent source of bad retrieval results.


3. Embedding Smartly: Don't Recompute What You Already Know

Computing embeddings isn't free. If you're using a hosted model like OpenAI's, every embedding call is an API round-trip with associated latency and cost. At scale — where you might be embedding millions of chunks and handling thousands of queries per second — this adds up fast.

The good news is that embeddings are deterministic. The same input text will always produce the same vector. That means caching is a perfect fit.

The pattern is simple: before you call the embedding API, check a cache first. If you've already embedded this exact text, return the cached vector. If not, call the API, store the result, and move on.

embed-cache.ts

const embedCache = new Map<string, number[]>();
 
async function getEmbedding(text: string): Promise<number[]> {
  // Hit the cache first — no API call needed
  if (embedCache.has(text)) {
    return embedCache.get(text)!;
  }
 
  const res = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
 
  const vector = res.data[0].embedding;
 
  // Store for next time
  embedCache.set(text, vector);
  return vector;
}

In a single-server setup, an in-memory Map works fine. But once you're running multiple query servers behind a load balancer, you need a shared cache — Redis is the go-to here. Set a TTL of around an hour. Semantic content doesn't change that fast, and you don't want stale vectors hanging around forever either.

** Info:** This single optimization can cut your embedding API costs dramatically. And it's one of those things where the effort-to-impact ratio is genuinely excellent. Do this before almost anything else.


4. Caching Goes Deeper Than You Think

Embedding caching is the most obvious layer. But it's not the only one. A well-designed RAG system actually caches at three distinct levels, and the payoff at each one is significant.

LayerWhat's CachedTTLWhy It Matters
Embedding Cachetext → vector~1 hourAvoids redundant API calls for repeated inputs
Retrieval Cachequery → [doc IDs]~30 minSkips the vector search entirely for recent queries
Answer Cachequery → final LLM answer~1–2 hoursThe big one — no LLM call at all for repeated questions

The Answer Cache is where it gets really interesting, and that's where semantic caching comes in. The idea: when you get a new query, embed it and do a fast vector search against a cache of previous queries (not your main document index — a separate, lightweight index of past Q&A pairs). If the similarity score is above some threshold — say 0.9 — you return the cached answer directly. No retrieval. No LLM call. Done.

semantic-cache.ts

async function semanticAnswer(query: string): Promise<string> {
  const qEmb = await getEmbedding(query);
 
  // Search the semantic cache — a separate, small vector index of past Q&As
  const results = await cacheIndex.query({
    queryRequest: { vector: qEmb, topK: 1 },
  });
 
  const best = results.matches[0];
 
  // If we've seen something close enough, return that answer immediately
  if (best.score! > 0.9) {
    return best.metadata.answer as string; // <100ms response
  }
 
  // Miss — run the full retrieval + LLM pipeline
  return await fullRAGPipeline(query);
}

The threshold tuning matters here. Too low, and you'll return answers to questions that are subtly different — which can be worse than no answer at all. Too high, and you never get cache hits. Most teams land somewhere in the 0.85–0.95 range and tune from there based on their specific domain.

** Real-world impact:** Semantic caching can reduce LLM API calls by 60–70%. A cached response can come back in under 100ms. A full LLM pipeline takes seconds. This is one of the highest-leverage optimizations in the entire stack. Always track your hit/miss rates — that's how you know whether your thresholds are set right.


5. Vector Store Scaling: Why You Need a Real Database

An in-memory list of vectors and a brute-force cosine similarity loop works fine for testing. It doesn't work for production. Once you're storing millions or tens of millions of vectors and fielding thousands of queries per second, you need a purpose-built vector database.

Managed services like Pinecone, Weaviate, or Qdrant handle the heavy lifting — sharding, replication, index management — so you don't have to. Redis also supports vector search natively, which is worth considering if you're already using it for caching.

Let's use Pinecone as a concrete example. Pinecone's serverless mode uses a "slab" architecture: your data is partitioned into fixed-size shards (250 GB each), and reads are distributed across those shards in parallel. Hot data gets cached in memory. The result is predictable, low-latency queries even at high throughput.

The two main scaling knobs:

KnobWhat It ScalesHow
ShardsStorage capacityAdd more as your dataset grows
ReplicasQuery throughput2 replicas ≈ 2× QPS (roughly linear)

If you need extreme, predictable throughput — hundreds to thousands of QPS with strict latency SLAs — Pinecone offers Dedicated Read Nodes (DRN). These provision fixed hardware with no shared rate limits. It's more expensive, but for mission-critical applications, it's worth the peace of mind.

** Warning:** Deploy your vector store in the same cloud region as your application. Cross-region latency will silently destroy your p99s. This is a common mistake that's annoying to diagnose once it's in production.


6. Hybrid Retrieval & Re-Ranking: Because Vector Search Alone Isn't Enough

Pure vector search is great at finding semantically similar content. It's not great at finding exact matches. If a user types a specific product SKU or a technical error code, pure semantic search might return something "close" — but not the thing they actually need.

Hybrid retrieval solves this by combining vector search with traditional keyword/lexical search (usually BM25 or full-text search via Elasticsearch). You run both in parallel, then merge the results.

The merging step is where it gets interesting. A simple and effective technique is Reciprocal Rank Fusion (RRF): each result gets a score based on its rank in each retrieval method, and you combine those scores. It's simple to implement and works surprisingly well without much tuning.

After retrieval, you typically have a set of candidate documents that are probably relevant, but not necessarily ordered correctly. This is where re-ranking comes in — and there are two main approaches:

Cross-Encoder

Cross-encoder models score the query and each document together, unlike bi-encoder embeddings which encode them separately. The result is a much more accurate relevance score. Since you're only scoring a small set of candidates (usually 10–20), the latency hit is fine.

// Use a cross-encoder to re-rank top candidates
async function rerankWithCrossEncoder(
  query: string,
  contexts: string[]
): Promise<string[]> {
  const scores = await pinecone.inference.rerank({
    model: "cohere-rerank-v3.5",
    query,
    documents: contexts,
  });
 
  // Sort by relevance score descending
  return contexts.sort(
    (a, b) => scores[contexts.indexOf(b)] - scores[contexts.indexOf(a)]
  );
}

MMR

Maximal Marginal Relevance considers diversity — it penalizes results that are too similar to each other. This is particularly valuable when feeding results into an LLM, because sending five near-duplicate chunks wastes context window space and doesn't actually give the model more information.

// MMR: balance relevance with diversity
function mmrRank(
  docEmbeddings: number[][],
  queryEmb: number[],
  lambda: number = 0.7 // higher = more relevance, lower = more diversity
): number[] {
  // Iteratively select docs that maximize:
  //   λ * sim(doc, query) - (1 - λ) * max(sim(doc, already_selected))
  // ...
  return scores;
}

** Info:** The goal at the end of this pipeline is the smallest possible set of highly relevant documents to feed to the LLM. Every irrelevant chunk you include wastes tokens, increases latency, and can actually confuse the model.


7. Multi-Tenancy: Keeping Customers' Data Separate

If you're building a SaaS product on top of RAG — which, increasingly, a lot of companies are — you need to think carefully about data isolation. Customer A's documents should never show up in Customer B's queries. Ever.

The standard pattern with most vector stores is namespace-based isolation. In Pinecone, you can have a single index but partition it into namespaces — one per tenant. Reads and writes are automatically scoped to the namespace you specify. It's clean, it's simple, and it scales well.

multi-tenant.ts

const index = pinecone.Index("my-rag-index");
 
//  Ingestion: write to the tenant's namespace
await index.upsert({
  upsertRequest: {
    namespace: "tenant-acme-corp",
    vectors: [
      {
        id: "doc1",
        values: embedding,
        metadata: { title: "Q3 Report", createdAt: "2026-01-15" },
      },
    ],
  },
});
 
// Query: automatically scoped — no cross-tenant leakage possible
const results = await index.query({
  queryRequest: {
    namespace: "tenant-acme-corp",
    vector: queryVec,
    topK: 5,
    includeMetadata: true,
  },
});
 
// Offboarding: one call, all their data gone
await index.deleteNamespace({ namespace: "tenant-acme-corp" });

** Cost tip:** Querying a small, scoped namespace is cheaper (in terms of resource units) than running a filtered query across a large index. Namespaces aren't just an isolation strategy — they're a cost optimization too.

A word of caution: namespaces give you logical isolation, but they're still within the same physical index. If one tenant is hammering the system with queries, it can affect latency for others — though managed services do a decent job mitigating this internally. For truly mission-critical isolation, you might need separate indices or even separate clusters per high-value tenant.


8. Ingesting at Scale: Queues, Workers, and Fault Tolerance

When you first need to index a large corpus — say, millions of documents — doing it one at a time is painfully slow. The solution is batch processing with a worker pool.

  Raw Docs          Queue           Workers           Vector Store
  ┌────────┐     ┌─────────┐     ┌──────────┐      ┌────────────┐
  │ Doc 1  │────▶│         │────▶│ Worker 1 │─────▶│            │
  │ Doc 2  │────▶│  Kafka  │────▶│ Worker 2 │─────▶│  Pinecone  │
  │ Doc 3  │────▶│  / SQS  │────▶│ Worker 3 │─────▶│            │
  │  ...   │────▶│         │────▶│   ...    │─────▶│            │
  │ Doc N  │────▶│         │────▶│ Worker N │─────▶│            │
  └────────┘     └─────────┘     └──────────┘      └────────────┘
                                      ↑
                              If a worker crashes,
                              message goes back to
                              the queue — no data lost

The queue decouples producers from consumers, adds built-in retry logic (if a worker crashes, the message goes back to the queue), and lets you scale workers independently based on load. Pinecone's upsert API supports batching hundreds or thousands of vectors per call, which dramatically reduces overhead compared to individual upserts.

For real-time updates — when new content needs to be searchable within seconds or minutes of being published — you graduate from batch processing to streaming. Apache Kafka is the most common choice here. It handles enormous throughput, guarantees ordering, and integrates well with stream processing frameworks like Flink or Spark Streaming.

** Warning:** Don't over-engineer this early. If your documents change once a day, a nightly batch job is perfectly fine. Streaming pipelines are more complex to operate, and the operational overhead isn't justified unless you genuinely need sub-minute data freshness. Start simple, add complexity when you feel the pain.


9. Load Balancing, Rate Limiting, and Not Shooting Yourself in the Foot

At a million users, you need to think about traffic management at every layer of the stack.

Load balancing is table stakes. Put your query API servers behind a standard load balancer (AWS ELB, Nginx, Kubernetes Ingress — doesn't matter much). Auto-scale based on CPU and latency. Use connection pooling for your vector store and Redis connections. If you're not pooling connections, you're creating and tearing down network connections on every request, which is expensive.

Rate limiting is where it gets interesting. You need it at multiple levels:

LevelPurposeExample
UserPrevent bots / abuseBlock after 100 req/min per user
TenantPrevent noisy neighborsCap at 500 req/min per org
APIHandle upstream limitsExponential backoff on 429s

At the API level, you're rate-limited by your upstream providers whether you like it or not. OpenAI has rate limits. Pinecone has rate limits. If you hit them, you get 429 errors. The solution: catch those errors, back off with exponential backoff and jitter, and retry.

retry.ts

async function embedWithRetry(text: string): Promise<number[]> {
  const maxRetries = 5;
  let waitMs = 500;
 
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await openai.embeddings.create({
        model: "text-embedding-3-small",
        input: text,
      });
      return res.data[0].embedding;
    } catch (err: any) {
      if (err.status === 429) {
        // Jitter prevents thundering herd on retry
        await new Promise((r) => setTimeout(r, waitMs + Math.random() * 200));
        waitMs *= 2; // Exponential backoff
      } else {
        throw err; // Don't retry non-recoverable errors
      }
    }
  }
 
  throw new Error("Failed to get embedding after max retries");
}

** Why jitter matters:** If you have 100 servers all hitting a rate limit simultaneously and they all back off for exactly 1 second, they'll all retry at exactly the same time and hit the limit again. Jitter spreads them out. It's a small detail that makes a big difference under load.

Apply this same retry pattern to your LLM generation calls, not just embeddings. Any external API call that can fail transiently deserves it.


10. Observability: You Can't Fix What You Can't See

This section is short, but don't skip it. Observability is the difference between debugging a production issue in 10 minutes and debugging it in 10 hours.

At minimum, you need to be tracking these four things:

MetricWhy It Matters
Latency per stageTells you which part of the pipeline is the bottleneck
Cache hit ratesA sudden drop means something broke in your caching layer
Error rates by type429s, timeouts, malformed responses — your early warning system
Queue depthIf it's growing, you need more ingestion workers. Now.

The best way to get visibility across all of this is distributed tracing — a single request ID that flows through every stage of the pipeline, from the initial query all the way through to the final LLM response. When a request is slow, you pull up the trace and immediately see where the time went. Tools like Datadog, OpenTelemetry, or even just structured logging with correlation IDs will get you there.


11. Putting It All Together — A Phased Approach

Here's the honest truth: you don't need to build all of this on day one. The teams that build successfully at scale almost always start simple and add complexity in response to actual problems, not anticipated ones.

1. Foundation

Two pipelines (ingestion and query), basic chunking with overlap, embedding caching, a managed vector store, and retry logic on external API calls.

This gets you a solid, production-grade RAG system that can handle meaningful traffic.

2. Scale

Multi-layer caching (retrieval + semantic caching), hybrid retrieval with re-ranking, namespace-based multi-tenancy, and streaming ingestion.

This moves you from "works well" to "works at scale."

3. Bulletproof

Dedicated read nodes, full CDC pipelines, advanced observability dashboards, and sophisticated per-user/per-tenant rate-limiting tiers.

This moves you from "works at scale" to "I can sleep at night."

The architecture isn't magic. It's just a collection of well-understood distributed systems patterns — applied thoughtfully to the specific constraints of RAG workloads. The key is understanding why each piece exists, so that when you're making tradeoffs (and you will be), you're making informed ones.