Building a RAG Semantic Pipeline
Building a RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.
Building a RAG Semantic Pipeline: A Complete Guide to AI-Powered Contextual Search
In the age of Large Language Models (LLMs), one of the most powerful patterns emerging is Retrieval Augmented Generation (RAG). This technique bridges the gap between static AI knowledge and dynamic, context-aware responses. In this post, I'll walk you through building a complete RAG semantic pipeline using TypeScript, OpenAI, and Pinecone.
What is RAG and Why Should You Care?
Traditional LLMs have a significant limitation: they only know what they were trained on. Ask them about your company's internal documentation, and they'll draw a blank. RAG solves this problem elegantly by:
- Storing your custom knowledge as vector embeddings
- Retrieving relevant context based on semantic similarity
- Augmenting the LLM's response with this context
The result? An AI that can answer questions about your data while leveraging the reasoning capabilities of powerful models like GPT-4.
Architecture Overview
Our pipeline consists of four main components:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Express │────▶│ Embeddings │────▶│ Pinecone │────▶│ OpenAI │
│ Server │ │ (OpenAI) │ │ Vector DB │ │ (GPT-4) │
└─────────────┘ └──────────────┘ └─────────────┘ └─────────────┘
Tech Stack:
- Runtime: Bun (fast JavaScript runtime)
- Framework: Express.js
- Vector Database: Pinecone
- AI Provider: OpenAI (embeddings + chat completions)
- Language: TypeScript
The Implementation
1. Setting Up Pinecone
First, we initialize our connection to Pinecone, a vector database that excels at similarity search:
import { Pinecone } from "@pinecone-database/pinecone";
const pc = new Pinecone({
apiKey: process.env.PINECONE_KEY as string,
});
export const rag_index = pc.index("rag");Pinecone stores our embeddings and enables lightning-fast similarity searches across millions of vectors.
2. Creating Embeddings
The magic of semantic search lies in embeddings—numerical representations of text that capture meaning:
import { OpenAI } from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_KEY as string,
});
const createEmbedding = async (prompt: string) => {
const response = await client.embeddings.create({
input: prompt,
model: "text-embedding-3-small",
dimensions: 512,
});
return response.data[0]?.embedding;
};We use OpenAI's text-embedding-3-small model with 512 dimensions—a sweet spot between accuracy and performance.
3. Training: Storing Knowledge
The training process converts your content into searchable vectors:
const train = async (content: string) => {
// Generate embedding for the content
const embedding = await createEmbedding(content);
if (!embedding) {
return { success: false, message: "failed to create embedding", data: null };
}
// Store with unique ID
const id = crypto.randomUUID();
await upsertRagItem({
id: id,
values: embedding,
metadata: { content: content },
});
return { success: true, message: "rag data stored in the db", data: { id } };
};Each piece of content is:
- Converted to a 512-dimensional vector
- Assigned a unique UUID
- Stored in Pinecone with the original content as metadata
4. Querying: The RAG Core
Here's where the magic happens—combining retrieval with generation:
const core = async (main_prompt: string) => {
// Convert user question to embedding
const userEmbedding = await createEmbedding(main_prompt);
if (!userEmbedding) {
throw new Error("Failed to create query embedding");
}
// Find similar content in our knowledge base
const searchResult = await rag_index.query({
vector: userEmbedding,
topK: 3,
includeMetadata: true,
});
// Build context from retrieved documents
const retrievedContext = searchResult.matches
?.map((match) => match.metadata?.content)
.filter(Boolean)
.join("\n\n");
// Augment prompt with context
const finalPrompt = `
You are an assistant answering using the context below.
If the context does not contain the answer, say you don't know.
Context:
${retrievedContext}
Question:
${main_prompt}
`;
const response = await createChatCompletion(finalPrompt);
return { success: true, data: response };
};The process:
- Convert the user's question to an embedding
- Search Pinecone for the top 3 most similar documents
- Construct a prompt that includes the retrieved context
- Generate a response using GPT-4 with the augmented context
API Endpoints
The pipeline exposes two RESTful endpoints:
Store Knowledge
POST /v1/store
Content-Type: application/json
{
"content": "Your knowledge base content here..."
}Query Knowledge
POST /v1/query
Content-Type: application/json
{
"prompt": "What is your question about the stored content?"
}Why This Architecture Works
Semantic Understanding
Unlike keyword search, semantic search understands meaning. "What's the return policy?" and "Can I get my money back?" are understood as related questions.
Scalability
Pinecone handles billions of vectors efficiently. Your knowledge base can grow without performance degradation.
Accuracy
By grounding the LLM in your actual data, you reduce hallucinations and get factually accurate responses.
Flexibility
The modular design allows you to:
- Swap OpenAI for other embedding providers
- Add multiple vector indices for different knowledge domains
- Implement re-ranking for even better results
Production Considerations
When taking this to production, consider:
- Chunking Strategy: Large documents should be split into meaningful chunks (paragraphs, sections) before embedding
- Metadata Enrichment: Store source URLs, timestamps, and categories for filtering
- Caching: Cache frequent embeddings to reduce API costs
- Rate Limiting: Protect your endpoints from abuse
- Monitoring: Track query latency and relevance scores
What's Next? Optimizing Your RAG Pipeline
This implementation is just the beginning! RAG systems can be significantly enhanced with various optimization techniques. In upcoming blog posts, we'll dive deep into:
Advanced Chunking Strategies
- Semantic chunking: Split documents at natural boundaries rather than fixed character counts
- Overlapping windows: Maintain context across chunk boundaries
- Hierarchical chunking: Create parent-child relationships between document sections
Retrieval Enhancements
- Re-ranking: Use cross-encoder models to re-score retrieved documents for better relevance
- Hybrid Search: Combine vector similarity with traditional keyword search (BM25) for the best of both worlds
- Query Expansion: Automatically generate related queries to improve recall
Context Optimization
- Contextual Compression: Distill retrieved documents to only the most relevant sentences
- Lost in the Middle: Strategically order retrieved chunks to avoid the LLM ignoring middle content
- Maximal Marginal Relevance (MMR): Balance relevance with diversity in retrieved results
Advanced Techniques
- HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve query embeddings
- Multi-Index Routing: Route queries to specialized indices based on intent
- Agentic RAG: Let the LLM decide when and how to retrieve information
Evaluation & Monitoring
- RAGAS Framework: Measure faithfulness, relevance, and context precision
- A/B Testing: Compare different chunking and retrieval strategies
- Feedback Loops: Use user interactions to improve retrieval quality
Stay tuned for these deep dives that will take your RAG pipeline from good to production-grade!
Conclusion
Building a RAG pipeline democratizes AI-powered search for any organization. With just a few hundred lines of TypeScript, you can create a system that:
- Understands natural language queries
- Retrieves contextually relevant information
- Generates accurate, grounded responses
The combination of vector databases and LLMs is transforming how we interact with information. Whether you're building customer support bots, internal documentation search, or knowledge management systems, RAG provides the foundation for truly intelligent applications.
Ready to build your own? Clone the repository, add your API keys, and start storing your knowledge today!
# Install dependencies
bun install
# Configure environment
cp .env.example .env
# Add your OPENAI_KEY and PINECONE_KEY
# Run the server
bun run devHappy embedding! 🚀