Portfolio artifact - retrieval + serving

RAG document intelligence QA

Role-fit proof of engineering ownership: I designed and shipped a production-minded RAG API with retrieval, explicit source citations, deployment on VPS, and embedding consistency checks between indexing and startup serving.

  • FastAPI
  • FAISS
  • sentence-transformers
  • Docker
  • OpenAI / Ollama

Limitations and scope

  • Fixed corpus only; the system does not browse the open web.
  • Retrieval quality constrains answer quality.
  • LLM behavior still needs inspection even when sources are present.

Overview

Problem. Teams store procedures, SLAs, and policies in text files or PDFs. Finding a precise answer quickly—without guessing—requires search that understands meaning, not just keywords.

What this system does. It ingests chunked documents, builds dense vector embeddings, indexes them in FAISS, and exposes a REST API. Each POST /ask retrieves top-k chunks, assembles them into a prompt, and calls an OpenAI-compatible chat model. The response includes document name, page, and chunk id for each source.

Why grounding matters. Pure LLM answers can confabulate on factual details (dates, SLAs, thresholds). Conditioning on retrieved passages ties the answer to specific text and makes errors easier to audit.

Architecture & workflow

End-to-end path for a single question:

Question Query embedding FAISS top-k Prompt + context LLM Answer + sources

One request; offline steps (extract -> chunk -> embed -> index) are separate from this path.

  1. 1

    User question

    JSON body {"question":"..."} to POST /ask.

  2. 2

    Embedding & retrieval

    The query is embedded with the same model used at index time (all-MiniLM-L6-v2, normalized vectors).

  3. 3

    FAISS search

    Inner-product search on the index returns the top-k chunk rows with similarity scores.

  4. 4

    Context assembly

    Chunk text is formatted with source labels (document, page) and capped for prompt size.

  5. 5

    LLM answer

    A chat completion generates the answer constrained to the provided context.

  6. 6

    Sources in the response

    The API returns answer plus sources[] (and optionally retrieved chunk payloads for debugging).

Features

Document-grounded answers

Answers are driven by retrieved passages, not the model's unconstrained prior knowledge.

Source references

Each citation includes document name, page, and chunk id for traceability.

Retrieval pipeline

Separate ingestion, chunking, embedding, and indexing steps; API loads index and metadata at startup.

API-first

FastAPI with OpenAPI; easy to integrate from web clients, scripts, or other services.

Dockerized backend

Single image with pinned dependencies; artifacts baked or mounted per environment.

VPS deployment

Container on a VPS behind Caddy for HTTPS and reverse proxy to the app.

Tech stack

  • Runtime / API: Python 3.11, FastAPI, Uvicorn
  • Vectors: sentence-transformers, NumPy, FAISS (CPU)
  • Generation: OpenAI API (or OpenAI-compatible base URL)
  • Data: CSV chunk metadata, pandas in the indexing pipeline
  • Ops: Docker, Linux VPS, Caddy (TLS + reverse proxy)

Deployment & engineering

The service runs as a Docker container exposing the FastAPI app on an internal port. Caddy terminates TLS and proxies public HTTPS to the container. The live instance is reachable at rag-qa.vahdetkaratas.com with GET /health reporting index and metadata presence and whether retrieval is loaded in memory.

This matches a small production-style loop: build image -> run on VPS -> configure DNS and reverse proxy -> verify health before sending traffic to /ask.

Build and run notes are documented in the repository README.md (extraction -> chunking -> FAISS indexing -> Docker).

Live demo

Calls POST https://rag-qa.vahdetkaratas.com/ask from your browser. If the request is blocked (e.g. CORS or API key on the server), use Swagger UI or curl instead.

Try an example (fills the box; then press Ask):

Why this project

The repo demonstrates a full retrieval + generation loop with a real HTTP API and deployed endpoint - not only notebooks or local scripts. Design choices (separate indexing from inference, explicit sources, health checks) reflect how similar systems are operated in practice.