← Back to Blog AI / ML

Building a RAG Chatbot with Azure OpenAI and AI Search

Saurav Rai

Founder & Lead Architect

· · 14 min read

Overview

Production RAG implementation: chunking strategies, hybrid search, re-ranking, and grounding verification.

What Is Retrieval-Augmented Generation?

RAG solves the fundamental problem with LLMs: they don't know your proprietary data. Instead of fine-tuning (expensive, data-hungry), RAG retrieves relevant context at query time and injects it into the prompt.

The flow is: user asks a question → search your knowledge base for relevant chunks → inject chunks into GPT-4 prompt → return grounded answer with citations.

  • No hallucinations about your data
  • No fine-tuning cost or complexity
  • Always up-to-date with your latest documents
  • Verifiable answers with source citations

Chunking Strategy Matters

How you split documents into chunks dramatically affects retrieval quality. Fixed-size chunks are simple but break semantic units. Sentence-window chunking maintains context. Hierarchical chunking creates summary → detail relationships.

We use semantic chunking based on markdown headers for technical documentation—splitting at h1/h2 boundaries preserves complete concepts and dramatically improves retrieval precision.

  • Fixed size: simple, often adequate
  • Sentence window: preserves local context
  • Semantic/markdown: best for structured docs
  • Hierarchical: summary nodes accelerate multi-hop reasoning

Hybrid Search with Re-Ranking

Azure AI Search supports hybrid retrieval combining dense vector search (semantic similarity) with sparse BM25 keyword search. Reciprocal Rank Fusion then merges the result sets.

Adding a cross-encoder re-ranker as a final step—scoring each candidate chunk against the full query—consistently pushes retrieval accuracy above 90% in our production systems.

var searchOptions = new SearchOptions {
    VectorSearch = new VectorSearchOptions {
        Queries = { new VectorizedQuery(embedding) { Fields = { "contentVector" } } }
    },
    SemanticSearch = new SemanticSearchOptions {
        SemanticConfigurationName = "my-semantic-config",
        QueryAnswer = new QueryAnswer(QueryAnswerType.Extractive),
        QueryCaption = new QueryCaption(QueryCaptionType.Extractive)
    }
};

Key Takeaways

  • RAG is cheaper and more maintainable than fine-tuning
  • Chunking strategy is the most impactful RAG tuning lever
  • Hybrid search outperforms pure vector or keyword search
  • Always include source citations in responses
  • Monitor retrieval quality with evaluation datasets

Saurav Rai

Founder & Lead Architect, Omni Stack

7+ years building enterprise .NET and cloud applications for clients across Australia, USA, and the Middle East. Passionate about clean architecture, developer experience, and shipping fast.

AI / ML · 12 min read

ML.NET in Production: Custom Classification Models for .NET Apps

Training, evaluating, and deploying ML.NET classification models embedded directly in your ASP.NET Core apps.

Read More →
.NET / Blazor · 12 min read

Blazor Auto Render Mode: Server vs WebAssembly Per Component

.NET 8 per-component interactivity: our production guide to choosing Server vs WASM—with benchmarks.

Read More →
.NET / Blazor · 8 min read

Building Real-Time Dashboards with SignalR and Blazor Server

Step-by-step: live-updating dashboards using SignalR Hub groups, Blazor Server, and efficient diff rendering.

Read More →