RAG Document QA - Applied ML / Data Systems

Overview

Problem. Teams store procedures, SLAs, and policies in text files or PDFs. Finding a precise answer quickly—without guessing—requires search that understands meaning, not just keywords.

What this system does. It ingests chunked documents, builds dense vector embeddings, indexes them in FAISS, and exposes a REST API. Each POST /ask retrieves top-k chunks, assembles them into a prompt, and calls an OpenAI-compatible chat model. The response includes document name, page, and chunk id for each source.

Why grounding matters. Pure LLM answers can confabulate on factual details (dates, SLAs, thresholds). Conditioning on retrieved passages ties the answer to specific text and makes errors easier to audit.

Architecture & workflow

End-to-end path for a single question:

Question Query embedding FAISS top-k Prompt + context LLM Answer + sources

One request; offline steps (extract -> chunk -> embed -> index) are separate from this path.

1

User question

JSON body {"question":"..."} to POST /ask.
2

Embedding & retrieval

The query is embedded with the same model used at index time (all-MiniLM-L6-v2, normalized vectors).
3

FAISS search

Inner-product search on the index returns the top-k chunk rows with similarity scores.
4

Context assembly

Chunk text is formatted with source labels (document, page) and capped for prompt size.
5

LLM answer

A chat completion generates the answer constrained to the provided context.
6

Sources in the response

The API returns answer plus sources[] (and optionally retrieved chunk payloads for debugging).

Features

Document-grounded answers

Answers are driven by retrieved passages, not the model's unconstrained prior knowledge.

Source references

Each citation includes document name, page, and chunk id for traceability.

Retrieval pipeline

Separate ingestion, chunking, embedding, and indexing steps; API loads index and metadata at startup.

API-first

FastAPI with OpenAPI; easy to integrate from web clients, scripts, or other services.

Dockerized backend

Single image with pinned dependencies; artifacts baked or mounted per environment.

VPS deployment

Container on a VPS behind Caddy for HTTPS and reverse proxy to the app.

Tech stack

Runtime / API: Python 3.11, FastAPI, Uvicorn
Vectors: sentence-transformers, NumPy, FAISS (CPU)
Generation: OpenAI API (or OpenAI-compatible base URL)
Data: CSV chunk metadata, pandas in the indexing pipeline
Ops: Docker, Linux VPS, Caddy (TLS + reverse proxy)

Deployment & engineering

The service runs as a Docker container exposing the FastAPI app on an internal port. Caddy terminates TLS and proxies public HTTPS to the container. The live instance is reachable at rag-qa.vahdetkaratas.com with GET /health reporting index and metadata presence and whether retrieval is loaded in memory.

This matches a small production-style loop: build image -> run on VPS -> configure DNS and reverse proxy -> verify health before sending traffic to /ask.

Build and run notes are documented in the repository README.md (extraction -> chunking -> FAISS indexing -> Docker).

Live demo

Calls POST https://rag-qa.vahdetkaratas.com/ask from your browser. If the request is blocked (e.g. CORS or API key on the server), use Swagger UI or curl instead.

Try an example (fills the box; then press Ask):

Your question

Why this project

The repo demonstrates a full retrieval + generation loop with a real HTTP API and deployed endpoint - not only notebooks or local scripts. Design choices (separate indexing from inference, explicit sources, health checks) reflect how similar systems are operated in practice.

RAG document intelligence QA

Limitations and scope

Overview

Architecture & workflow

User question

Embedding & retrieval

FAISS search

Context assembly

LLM answer

Sources in the response