Building a RAG Semantic Pipeline

Building a RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.

Building a RAG Semantic Pipeline: A Complete Guide to AI-Powered Contextual Search

In the age of Large Language Models (LLMs), one of the most powerful patterns emerging is Retrieval Augmented Generation (RAG). This technique bridges the gap between static AI knowledge and dynamic, context-aware responses. In this post, I'll walk you through building a complete RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.

What is RAG and Why Should You Care?

Traditional LLMs have a significant limitation: they only know what they were trained on. Ask them about your company's internal documentation, and they'll draw a blank. RAG solves this problem elegantly by:

Storing your custom knowledge as vector embeddings
Retrieving relevant context based on semantic similarity
Augmenting the LLM's response with this context

The result? An AI that can answer questions about your data while leveraging the reasoning capabilities of powerful models like GPT-4.

Architecture Overview

Our pipeline consists of four main components:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌─────────────┐
│   Express   │────▶│  Embeddings  │────▶│  Pinecone   │────▶│   OpenAI    │
│   Server    │     │   (OpenAI)   │     │  Vector DB  │     │   (GPT-4)   │
└─────────────┘     └──────────────┘     └─────────────┘     └─────────────┘

Tech Stack:

Runtime: Bun (fast JavaScript runtime)
Framework: Express.js
Vector Database: Pinecone
AI Provider: OpenAI (embeddings + chat completions)
Language: TypeScript

The Implementation

1. Setting Up Pinecone

First, we initialize our connection to Pinecone, a vector database that excels at similarity search:

import { Pinecone } from "@pinecone-database/pinecone";
 
const pc = new Pinecone({
  apiKey: process.env.PINECONE_KEY as string,
});
 
export const rag_index = pc.index("rag");

Pinecone stores our embeddings and enables lightning-fast similarity searches across millions of vectors.

2. Creating Embeddings

The magic of semantic search lies in embeddings—numerical representations of text that capture meaning:

import { OpenAI } from "openai";
 
const client = new OpenAI({
  apiKey: process.env.OPENAI_KEY as string,
});
 
const createEmbedding = async (prompt: string) => {
  const response = await client.embeddings.create({
    input: prompt,
    model: "text-embedding-3-small",
    dimensions: 512,
  });
 
  return response.data[0]?.embedding;
};

We use OpenAI's text-embedding-3-small model with 512 dimensions—a sweet spot between accuracy and performance.

3. Training: Storing Knowledge

The training process converts your content into searchable vectors:

const train = async (content: string) => {
  // Generate embedding for the content
  const embedding = await createEmbedding(content);
  if (!embedding) {
    return { success: false, message: "failed to create embedding", data: null };
  }
 
  // Store with unique ID
  const id = crypto.randomUUID();
  await upsertRagItem({
    id: id,
    values: embedding,
    metadata: { content: content },
  });
 
  return { success: true, message: "rag data stored in the db", data: { id } };
};

Each piece of content is:

Converted to a 512-dimensional vector
Assigned a unique UUID
Stored in Pinecone with the original content as metadata

4. Querying: The RAG Core

Here's where the magic happens—combining retrieval with generation:

const core = async (main_prompt: string) => {
  // Convert user question to embedding
  const userEmbedding = await createEmbedding(main_prompt);
  if (!userEmbedding) {
    throw new Error("Failed to create query embedding");
  }
 
  // Find similar content in our knowledge base
  const searchResult = await rag_index.query({
    vector: userEmbedding,
    topK: 3,
    includeMetadata: true,
  });
 
  // Build context from retrieved documents
  const retrievedContext = searchResult.matches
    ?.map((match) => match.metadata?.content)
    .filter(Boolean)
    .join("\n\n");
 
  // Augment prompt with context
  const finalPrompt = `
You are an assistant answering using the context below.
If the context does not contain the answer, say you don't know.
 
Context:
${retrievedContext}
 
Question:
${main_prompt}
`;
 
  const response = await createChatCompletion(finalPrompt);
  return { success: true, data: response };
};

The process:

Convert the user's question to an embedding
Search Pinecone for the top 3 most similar documents
Construct a prompt that includes the retrieved context
Generate a response using GPT-4 with the augmented context

API Endpoints

The pipeline exposes two RESTful endpoints:

Store Knowledge

POST /v1/store
Content-Type: application/json
 
{
  "content": "Your knowledge base content here..."
}

Query Knowledge

POST /v1/query
Content-Type: application/json
 
{
  "prompt": "What is your question about the stored content?"
}

Why This Architecture Works

Semantic Understanding

Unlike keyword search, semantic search understands meaning. "What's the return policy?" and "Can I get my money back?" are understood as related questions.

Scalability

Pinecone handles billions of vectors efficiently. Your knowledge base can grow without performance degradation.

Accuracy

By grounding the LLM in your actual data, you reduce hallucinations and get factually accurate responses.

Flexibility

The modular design allows you to:

Swap OpenAI for other embedding providers
Add multiple vector indices for different knowledge domains
Implement re-ranking for even better results

Production Considerations

When taking this to production, consider:

Chunking Strategy: Large documents should be split into meaningful chunks (paragraphs, sections) before embedding
Metadata Enrichment: Store source URLs, timestamps, and categories for filtering
Caching: Cache frequent embeddings to reduce API costs
Rate Limiting: Protect your endpoints from abuse
Monitoring: Track query latency and relevance scores

What's Next? Optimizing Your RAG Pipeline

This implementation is just the beginning! RAG systems can be significantly enhanced with various optimization techniques. In upcoming blog posts, we'll dive deep into:

Advanced Chunking Strategies

Semantic chunking: Split documents at natural boundaries rather than fixed character counts
Overlapping windows: Maintain context across chunk boundaries
Hierarchical chunking: Create parent-child relationships between document sections

Retrieval Enhancements

Re-ranking: Use cross-encoder models to re-score retrieved documents for better relevance
Hybrid Search: Combine vector similarity with traditional keyword search (BM25) for the best of both worlds
Query Expansion: Automatically generate related queries to improve recall

Context Optimization

Contextual Compression: Distill retrieved documents to only the most relevant sentences
Lost in the Middle: Strategically order retrieved chunks to avoid the LLM ignoring middle content
Maximal Marginal Relevance (MMR): Balance relevance with diversity in retrieved results

Advanced Techniques

HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve query embeddings
Multi-Index Routing: Route queries to specialized indices based on intent
Agentic RAG: Let the LLM decide when and how to retrieve information

Evaluation & Monitoring

RAGAS Framework: Measure faithfulness, relevance, and context precision
A/B Testing: Compare different chunking and retrieval strategies
Feedback Loops: Use user interactions to improve retrieval quality

Stay tuned for these deep dives that will take your RAG pipeline from good to production-grade!

Conclusion

Building a RAG pipeline democratizes AI-powered search for any organization. With just a few hundred lines of TypeScript, you can create a system that:

Understands natural language queries
Retrieves contextually relevant information
Generates accurate, grounded responses

The combination of vector databases and LLMs is transforming how we interact with information. Whether you're building customer support bots, internal documentation search, or knowledge management systems, RAG provides the foundation for truly intelligent applications.

Ready to build your own? Clone the repository, add your API keys, and start storing your knowledge today!

# Install dependencies
bun install
 
# Configure environment
cp .env.example .env
# Add your OPENAI_KEY and PINECONE_KEY
 
# Run the server
bun run dev

Happy embedding! 🚀