Command Palette

Search for a command to run...

Blog
PreviousNext

Building a RAG Semantic Pipeline

Building a RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.

Building a RAG Semantic Pipeline: A Complete Guide to AI-Powered Contextual Search

In the age of Large Language Models (LLMs), one of the most powerful patterns emerging is Retrieval Augmented Generation (RAG). This technique bridges the gap between static AI knowledge and dynamic, context-aware responses. In this post, I'll walk you through building a complete RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.

What is RAG and Why Should You Care?

Traditional LLMs have a significant limitation: they only know what they were trained on. Ask them about your company's internal documentation, and they'll draw a blank. RAG solves this problem elegantly by:

  1. Storing your custom knowledge as vector embeddings
  2. Retrieving relevant context based on semantic similarity
  3. Augmenting the LLM's response with this context

The result? An AI that can answer questions about your data while leveraging the reasoning capabilities of powerful models like GPT-4.


Architecture Overview

Our pipeline consists of four main components:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌─────────────┐
│   Express   │────▶│  Embeddings  │────▶│  Pinecone   │────▶│   OpenAI    │
│   Server    │     │   (OpenAI)   │     │  Vector DB  │     │   (GPT-4)   │
└─────────────┘     └──────────────┘     └─────────────┘     └─────────────┘

Tech Stack:

  • Runtime: Bun (fast JavaScript runtime)
  • Framework: Express.js
  • Vector Database: Pinecone
  • AI Provider: OpenAI (embeddings + chat completions)
  • Language: TypeScript

The Implementation

1. Setting Up Pinecone

First, we initialize our connection to Pinecone, a vector database that excels at similarity search:

import { Pinecone } from "@pinecone-database/pinecone";
 
const pc = new Pinecone({
  apiKey: process.env.PINECONE_KEY as string,
});
 
export const rag_index = pc.index("rag");

Pinecone stores our embeddings and enables lightning-fast similarity searches across millions of vectors.

2. Creating Embeddings

The magic of semantic search lies in embeddings—numerical representations of text that capture meaning:

import { OpenAI } from "openai";
 
const client = new OpenAI({
  apiKey: process.env.OPENAI_KEY as string,
});
 
const createEmbedding = async (prompt: string) => {
  const response = await client.embeddings.create({
    input: prompt,
    model: "text-embedding-3-small",
    dimensions: 512,
  });
 
  return response.data[0]?.embedding;
};

We use OpenAI's text-embedding-3-small model with 512 dimensions—a sweet spot between accuracy and performance.

3. Training: Storing Knowledge

The training process converts your content into searchable vectors:

const train = async (content: string) => {
  // Generate embedding for the content
  const embedding = await createEmbedding(content);
  if (!embedding) {
    return { success: false, message: "failed to create embedding", data: null };
  }
 
  // Store with unique ID
  const id = crypto.randomUUID();
  await upsertRagItem({
    id: id,
    values: embedding,
    metadata: { content: content },
  });
 
  return { success: true, message: "rag data stored in the db", data: { id } };
};

Each piece of content is:

  1. Converted to a 512-dimensional vector
  2. Assigned a unique UUID
  3. Stored in Pinecone with the original content as metadata

4. Querying: The RAG Core

Here's where the magic happens—combining retrieval with generation:

const core = async (main_prompt: string) => {
  // Convert user question to embedding
  const userEmbedding = await createEmbedding(main_prompt);
  if (!userEmbedding) {
    throw new Error("Failed to create query embedding");
  }
 
  // Find similar content in our knowledge base
  const searchResult = await rag_index.query({
    vector: userEmbedding,
    topK: 3,
    includeMetadata: true,
  });
 
  // Build context from retrieved documents
  const retrievedContext = searchResult.matches
    ?.map((match) => match.metadata?.content)
    .filter(Boolean)
    .join("\n\n");
 
  // Augment prompt with context
  const finalPrompt = `
You are an assistant answering using the context below.
If the context does not contain the answer, say you don't know.
 
Context:
${retrievedContext}
 
Question:
${main_prompt}
`;
 
  const response = await createChatCompletion(finalPrompt);
  return { success: true, data: response };
};

The process:

  1. Convert the user's question to an embedding
  2. Search Pinecone for the top 3 most similar documents
  3. Construct a prompt that includes the retrieved context
  4. Generate a response using GPT-4 with the augmented context

API Endpoints

The pipeline exposes two RESTful endpoints:

Store Knowledge

POST /v1/store
Content-Type: application/json
 
{
  "content": "Your knowledge base content here..."
}

Query Knowledge

POST /v1/query
Content-Type: application/json
 
{
  "prompt": "What is your question about the stored content?"
}

Why This Architecture Works

Semantic Understanding

Unlike keyword search, semantic search understands meaning. "What's the return policy?" and "Can I get my money back?" are understood as related questions.

Scalability

Pinecone handles billions of vectors efficiently. Your knowledge base can grow without performance degradation.

Accuracy

By grounding the LLM in your actual data, you reduce hallucinations and get factually accurate responses.

Flexibility

The modular design allows you to:

  • Swap OpenAI for other embedding providers
  • Add multiple vector indices for different knowledge domains
  • Implement re-ranking for even better results

Production Considerations

When taking this to production, consider:

  1. Chunking Strategy: Large documents should be split into meaningful chunks (paragraphs, sections) before embedding
  2. Metadata Enrichment: Store source URLs, timestamps, and categories for filtering
  3. Caching: Cache frequent embeddings to reduce API costs
  4. Rate Limiting: Protect your endpoints from abuse
  5. Monitoring: Track query latency and relevance scores

What's Next? Optimizing Your RAG Pipeline

This implementation is just the beginning! RAG systems can be significantly enhanced with various optimization techniques. In upcoming blog posts, we'll dive deep into:

Advanced Chunking Strategies

  • Semantic chunking: Split documents at natural boundaries rather than fixed character counts
  • Overlapping windows: Maintain context across chunk boundaries
  • Hierarchical chunking: Create parent-child relationships between document sections

Retrieval Enhancements

  • Re-ranking: Use cross-encoder models to re-score retrieved documents for better relevance
  • Hybrid Search: Combine vector similarity with traditional keyword search (BM25) for the best of both worlds
  • Query Expansion: Automatically generate related queries to improve recall

Context Optimization

  • Contextual Compression: Distill retrieved documents to only the most relevant sentences
  • Lost in the Middle: Strategically order retrieved chunks to avoid the LLM ignoring middle content
  • Maximal Marginal Relevance (MMR): Balance relevance with diversity in retrieved results

Advanced Techniques

  • HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve query embeddings
  • Multi-Index Routing: Route queries to specialized indices based on intent
  • Agentic RAG: Let the LLM decide when and how to retrieve information

Evaluation & Monitoring

  • RAGAS Framework: Measure faithfulness, relevance, and context precision
  • A/B Testing: Compare different chunking and retrieval strategies
  • Feedback Loops: Use user interactions to improve retrieval quality

Stay tuned for these deep dives that will take your RAG pipeline from good to production-grade!


Conclusion

Building a RAG pipeline democratizes AI-powered search for any organization. With just a few hundred lines of TypeScript, you can create a system that:

  • Understands natural language queries
  • Retrieves contextually relevant information
  • Generates accurate, grounded responses

The combination of vector databases and LLMs is transforming how we interact with information. Whether you're building customer support bots, internal documentation search, or knowledge management systems, RAG provides the foundation for truly intelligent applications.


Ready to build your own? Clone the repository, add your API keys, and start storing your knowledge today!

# Install dependencies
bun install
 
# Configure environment
cp .env.example .env
# Add your OPENAI_KEY and PINECONE_KEY
 
# Run the server
bun run dev

Happy embedding! 🚀