Building RAG Applications with Semantic Kernel and Azure OpenAI
Retrieval-Augmented Generation has become the default pattern for building AI applications that need to work with private, domain-specific data. The concept is straightforward: instead of fine-tuning a model on your data (expensive, slow, and hard to keep current), you retrieve relevant context at query time and inject it into the prompt. The model gets the information it needs without ever being trained on it.
when building several RAG systems in production over the past year, and the gap between a demo that works on stage and a system that works reliably at scale is enormous. Chunking strategies, embedding quality, retrieval precision, prompt engineering, hallucination mitigation — each of these deserves its own deep dive. In this post, let's focus on the practical implementation using Microsoft's Semantic Kernel SDK with Azure OpenAI, because that's the stack where teams often have the most success in .NET ecosystems.
If you've built a basic chatbot with Azure OpenAI and want to level up to a production RAG system, this is the guide you'll wish you had earlier.
Understanding the RAG Pattern
The RAG architecture has three distinct phases, and understanding them individually is crucial for debugging and optimization:
- Ingestion — Your source documents are split into chunks, converted to vector embeddings, and stored in a vector database.
- Retrieval — When a user asks a question, the query is embedded and used to search the vector store for semantically similar chunks.
- Generation — The retrieved chunks are injected into the prompt as context, and the LLM generates a response grounded in that context.
User Query
↓
[Embed Query] → [Vector Search] → [Top-K Chunks]
↓ ↓
└──────────── [Prompt Template] ─────┘
↓
[Azure OpenAI GPT-4o]
↓
Grounded Response
The most common mistake teams make is treating this as a single pipeline. In practice, ingestion and retrieval are separate concerns with different optimization levers. Your ingestion pipeline might run nightly, while retrieval and generation happen on every user query.
Setting Up Semantic Kernel with Azure OpenAI
Semantic Kernel is Microsoft's open-source SDK for building AI applications. It provides abstractions over LLMs, vector stores, and plugins that make it easier to build and maintain RAG systems. Here's the foundation:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.AzureOpenAI;
using Microsoft.SemanticKernel.Memory;
// Program.cs — configuring Semantic Kernel for RAG
var builder = WebApplication.CreateBuilder(args);
// Register Semantic Kernel
builder.Services.AddKernel()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!)
.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: "text-embedding-3-large",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);
// Register vector store — using Azure AI Search
builder.Services.AddAzureAISearchVectorStore(
new Uri(builder.Configuration["AzureSearch:Endpoint"]!),
new Azure.AzureKeyCredential(
builder.Configuration["AzureSearch:ApiKey"]!));
// Register our RAG service
builder.Services.AddScoped<RagService>();
var app = builder.Build();
app.MapPost("/api/ask", async (
AskRequest request,
RagService ragService) =>
{
var response = await ragService.AskAsync(request.Question);
return Results.Ok(new { answer = response.Answer, sources = response.Sources });
});
app.Run();
A note on model selection: you'll find that text-embedding-3-large with 3072 dimensions gives the best retrieval quality for technical documentation. If cost is a concern, text-embedding-3-small at 1536 dimensions works well for most use cases. Don't use text-embedding-ada-002 for new projects — the v3 models are strictly better.
Document Ingestion and Chunking Strategies
Chunking is where most RAG implementations fail. Chunk too small and you lose context. Chunk too large and you dilute relevance and waste token budget. Here's the approach that's worked best for us:
public class DocumentIngestionService
{
private readonly ITextEmbeddingGenerationService _embeddingService;
private readonly IVectorStore _vectorStore;
private const int ChunkSize = 512; // tokens, not characters
private const int ChunkOverlap = 50; // overlap for context continuity
public DocumentIngestionService(
ITextEmbeddingGenerationService embeddingService,
IVectorStore vectorStore)
{
_embeddingService = embeddingService;
_vectorStore = vectorStore;
}
public async Task IngestDocumentAsync(
string documentId, string content, DocumentMetadata metadata)
{
// Step 1: Split into semantic chunks
var chunks = SplitIntoChunks(content, ChunkSize, ChunkOverlap);
// Step 2: Generate embeddings in batches
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
await collection.CreateCollectionIfNotExistsAsync();
var batchSize = 16; // Azure OpenAI embedding batch limit
for (int i = 0; i < chunks.Count; i += batchSize)
{
var batch = chunks.Skip(i).Take(batchSize).ToList();
var texts = batch.Select(c => c.Text).ToList();
var embeddings = await _embeddingService
.GenerateEmbeddingsAsync(texts);
for (int j = 0; j < batch.Count; j++)
{
var chunk = batch[j];
var record = new DocumentChunk
{
Key = $"{documentId}_chunk_{i + j}",
Text = chunk.Text,
Embedding = embeddings[j],
DocumentId = documentId,
DocumentTitle = metadata.Title,
ChunkIndex = i + j,
Source = metadata.Source,
LastUpdated = metadata.LastUpdated
};
await collection.UpsertAsync(record);
}
}
}
private List<TextChunk> SplitIntoChunks(
string content, int maxTokens, int overlap)
{
var chunks = new List<TextChunk>();
var paragraphs = content.Split("\n\n",
StringSplitOptions.RemoveEmptyEntries);
var currentChunk = new StringBuilder();
var currentTokens = 0;
foreach (var paragraph in paragraphs)
{
var paragraphTokens = EstimateTokenCount(paragraph);
if (currentTokens + paragraphTokens > maxTokens
&& currentChunk.Length > 0)
{
chunks.Add(new TextChunk(currentChunk.ToString().Trim()));
// Keep overlap — take last N tokens worth of text
var overlapText = GetLastNTokens(
currentChunk.ToString(), overlap);
currentChunk.Clear();
currentChunk.Append(overlapText);
currentTokens = overlap;
}
currentChunk.AppendLine(paragraph);
currentChunk.AppendLine();
currentTokens += paragraphTokens;
}
if (currentChunk.Length > 0)
chunks.Add(new TextChunk(currentChunk.ToString().Trim()));
return chunks;
}
// Rough estimation — for production, use a proper tokenizer like tiktoken
private int EstimateTokenCount(string text) => text.Length / 4;
private string GetLastNTokens(string text, int tokenCount)
{
var charCount = tokenCount * 4;
return text.Length <= charCount
? text
: text[^charCount..];
}
}
The chunk model needs to be defined as a Semantic Kernel vector store record:
public sealed class DocumentChunk
{
[VectorStoreRecordKey]
public string Key { get; set; } = string.Empty;
[VectorStoreRecordData(IsFullTextSearchable = true)]
public string Text { get; set; } = string.Empty;
[VectorStoreRecordVector(3072, DistanceFunction.CosineSimilarity)]
public ReadOnlyMemory<float> Embedding { get; set; }
[VectorStoreRecordData(IsFilterable = true)]
public string DocumentId { get; set; } = string.Empty;
[VectorStoreRecordData]
public string DocumentTitle { get; set; } = string.Empty;
[VectorStoreRecordData(IsFilterable = true)]
public string Source { get; set; } = string.Empty;
[VectorStoreRecordData]
public int ChunkIndex { get; set; }
[VectorStoreRecordData(IsFilterable = true)]
public DateTime LastUpdated { get; set; }
}
In practice, paragraph-aware chunking with 512 token chunks and 50 token overlap hits the sweet spot for most technical documentation. For code-heavy content, consider splitting on function/class boundaries instead.
Retrieval and Response Generation
Now for the core RAG service that ties retrieval and generation together:
public class RagService
{
private readonly Kernel _kernel;
private readonly IVectorStore _vectorStore;
private readonly ITextEmbeddingGenerationService _embeddingService;
public RagService(
Kernel kernel,
IVectorStore vectorStore,
ITextEmbeddingGenerationService embeddingService)
{
_kernel = kernel;
_vectorStore = vectorStore;
_embeddingService = embeddingService;
}
public async Task<RagResponse> AskAsync(
string question, int topK = 5, float minRelevance = 0.75f)
{
// Step 1: Embed the query
var queryEmbedding = await _embeddingService
.GenerateEmbeddingAsync(question);
// Step 2: Search the vector store
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
var searchResults = await collection.VectorizedSearchAsync(
queryEmbedding,
new VectorSearchOptions
{
Top = topK,
});
var relevantChunks = new List<DocumentChunk>();
await foreach (var result in searchResults.Results)
{
if (result.Score >= minRelevance)
relevantChunks.Add(result.Record);
}
if (relevantChunks.Count == 0)
{
return new RagResponse(
"I don't have enough information to answer that question " +
"based on the available documentation.",
[]);
}
// Step 3: Build the grounded prompt
var context = string.Join("\n\n---\n\n",
relevantChunks.Select(c =>
$"[Source: {c.DocumentTitle}]\n{c.Text}"));
var prompt = $"""
You are a helpful technical assistant. Answer the user's question
based ONLY on the provided context. If the context doesn't contain
enough information to answer fully, say so clearly.
Do not make up information. Cite the source document when possible.
## Context
{context}
## Question
{question}
## Answer
""";
// Step 4: Generate the response
var chatService = _kernel
.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(prompt);
var response = await chatService.GetChatMessageContentAsync(
chatHistory,
new AzureOpenAIPromptExecutionSettings
{
Temperature = 0.1f, // Low temperature for factual responses
MaxTokens = 1024,
});
var sources = relevantChunks
.Select(c => c.DocumentTitle)
.Distinct()
.ToList();
return new RagResponse(
response.Content ?? "Unable to generate a response.",
sources);
}
}
public record RagResponse(string Answer, List<string> Sources);
public record AskRequest(string Question);
A few hard-won lessons:
- Set temperature to 0.1 or lower for RAG responses. You want factual, grounded answers, not creative ones.
- The minimum relevance threshold matters. Set it too low and you get irrelevant context that confuses the model. Start at 0.75 and adjust based on your data.
- Always return sources. Your users need to verify the information, and it builds trust in the system.
Improving Retrieval Quality
Pure vector search gets you 70% of the way there. To get to 90%+, you need hybrid search and query transformation:
public class HybridSearchService
{
// Combine vector similarity with keyword matching
public async Task<List<DocumentChunk>> HybridSearchAsync(
string query, int topK = 10)
{
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
var queryEmbedding = await _embeddingService
.GenerateEmbeddingAsync(query);
// Vector search with keyword filter for hybrid results
var searchOptions = new VectorSearchOptions
{
Top = topK,
// Use Azure AI Search's built-in hybrid search
VectorSearchType = VectorSearchType.Hybrid
};
var results = await collection.VectorizedSearchAsync(
queryEmbedding, searchOptions);
var chunks = new List<DocumentChunk>();
await foreach (var result in results.Results)
{
chunks.Add(result.Record);
}
return chunks;
}
}
Another technique that dramatically improved our retrieval accuracy: query expansion. Before searching, ask the LLM to rephrase the query into multiple search queries:
public async Task<List<string>> ExpandQueryAsync(string originalQuery)
{
var prompt = $"""
Given the following user question, generate 3 alternative
phrasings that might match relevant documentation. Return
only the rephrased queries, one per line.
Question: {originalQuery}
""";
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(prompt);
var result = await _chatService.GetChatMessageContentAsync(chatHistory);
var queries = result.Content?
.Split('\n', StringSplitOptions.RemoveEmptyEntries)
.Select(q => q.Trim())
.Where(q => q.Length > 0)
.ToList() ?? [];
queries.Insert(0, originalQuery);
return queries;
}
Search with all expanded queries, deduplicate the results, and rank by combined relevance score. This single change improved our answer accuracy by roughly 25% in internal evaluations.
Production Considerations
Before shipping a RAG system, address these concerns that won't show up in your local testing:
Token budget management — GPT-4o has a 128K context window, but that doesn't mean you should fill it. In practice, 3-5 chunks of 512 tokens each (roughly 2000-3000 tokens of context) produces better answers than stuffing in 20 chunks. More context means more noise.
Caching — Embed your most common queries and cache the results. Embedding generation is fast but not free. We use a simple Redis cache with a 24-hour TTL:
public async Task<ReadOnlyMemory<float>> GetOrCreateEmbeddingAsync(
string text)
{
var cacheKey = $"embedding:{Convert.ToHexString(
SHA256.HashData(Encoding.UTF8.GetBytes(text)))}";
var cached = await _cache.GetAsync(cacheKey);
if (cached is not null)
return new ReadOnlyMemory<float>(
JsonSerializer.Deserialize<float[]>(cached));
var embedding = await _embeddingService.GenerateEmbeddingAsync(text);
await _cache.SetAsync(cacheKey,
JsonSerializer.SerializeToUtf8Bytes(embedding.ToArray()),
new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(24)
});
return embedding;
}
Evaluation — Build an evaluation dataset of question-answer pairs grounded in your documentation. Run automated evaluations on every change to your chunking strategy, embedding model, or retrieval logic. Without this, you're flying blind.
Key Takeaways
- Chunking is the most important decision in your RAG pipeline. Start with 512-token paragraph-aware chunks and iterate based on retrieval quality.
- Use hybrid search (vector + keyword) from day one. Pure vector search misses exact-match scenarios that users expect.
- Keep temperature low (0.1-0.3) for RAG responses. You want grounded facts, not creative writing.
- Query expansion is a simple technique with outsized impact. Generate 2-3 alternative phrasings before searching.
- Cache embeddings aggressively. The same queries hit your system repeatedly, and embedding generation costs add up.
- Build evaluation infrastructure early. You can't improve what you can't measure, and RAG quality is notoriously hard to judge manually.
Semantic Kernel provides excellent abstractions for building RAG systems in .NET. The key is understanding that the SDK handles the plumbing — your job is to get the data preparation, chunking, and prompt engineering right. That's where the real engineering challenge lives.
Comments
Ajit Gangurde
Software Engineer II at Microsoft | 15+ years in .NET & Azure