Research Assistant - LLM Research Pipeline
An intelligent, end-to-end pipeline for processing research PDFs using LLMs (Ollama or Gemini) with dynamic category generation, accurate PDF parsing with OCR fallback, LLM-based metadata extraction, multi-category scoring, deduplication, and topic-focused summarization.

Gallery





View on GitHub
View on PyPi
Problem & Solution
The Problem
Organizing research PDFs is labor-intensive: extracting metadata, categorizing across multiple themes, filtering by topic relevance, detecting duplicates, and creating summaries. Manual workflows are slow, inconsistent, and hard to reproduce.
The Solution
Research Assistant automates this pipeline with LLMs. It generates a dynamic taxonomy from your topic, parses PDFs with OCR fallback, extracts rich metadata, scores papers across all categories, moves each to the best-fit folder, removes duplicates, and produces topic-focused summaries with CSV/JSONL indices for downstream analysis.
Technologies Used
Core & Parsing
- Python 3.12+
- PyMuPDF
- ocrmypdf
- Tesseract
LLM & Embeddings
- Ollama
- Google Gemini
- nomic-embed-text
Indexing & Quality
SQLite Cache
- MinHash LSH
- pytest
Key Features
Dynamic LLM-Driven Taxonomy
Generates categories from your topic (no hardcoding) and scores each paper across all categories simultaneously to choose the best placement.
Accurate Parsing with OCR Fallback
Uses PyMuPDF for born-digital PDFs and seamlessly falls back to OCR (ocrmypdf + Tesseract) for scanned documents, ensuring high-quality text extraction.
Smart Deduplication & Resume
Combines hash-based exact matching with MinHash for near-duplicate detection. A SQLite cache and manifests support resumable processing at scale.
Topic-Focused Summaries & Indices
Produces per-paper summaries that emphasize your topic, plus JSONL and CSV indices for analysis, and Markdown summaries per category.
Related Projects

resbuilder
This project was implemented as a reliability first resume transformation system that uses AI inside strict operational boundaries. The architecture combines meta prompting, evidence driven review, constrained building, rendering feedback loops, and multi layer validation to enforce factual grounding and one page A4 output. Production behavior is stabilized through queue based orchestration, stale job recovery, debug artifact observability, and CI/CD quality gates that block regressions before deployment.

Multi-turn Multi-Agent System for Prompt Injection detection
MAPD is a production‑ready FastAPI service and research harness for detecting prompt injection/jailbreaks using a multi‑agent LLM pipeline: Agents work to normalizes obfuscated prompts and judge them with optional ProtectedContext signals and an incremental history “unsure” loop for multi‑turn cases. It supports Ollama or Gemini backends, detailed per‑conversation logging and audit trails, a Vite frontend for interaction, and experiment tooling to run sweeps/ablations and generate metrics and figures for evaluation.