An intelligent, end-to-end pipeline for processing research PDFs using LLMs (Ollama or Gemini) with dynamic category generation, accurate PDF parsing with OCR fallback, LLM-based metadata extraction, multi-category scoring, deduplication, and topic-focused summarization.

View on GitHub

View on PyPi

Problem & Solution

The Problem

Organizing research PDFs is labor-intensive: extracting metadata, categorizing across multiple themes, filtering by topic relevance, detecting duplicates, and creating summaries. Manual workflows are slow, inconsistent, and hard to reproduce.

The Solution

Research Assistant automates this pipeline with LLMs. It generates a dynamic taxonomy from your topic, parses PDFs with OCR fallback, extracts rich metadata, scores papers across all categories, moves each to the best-fit folder, removes duplicates, and produces topic-focused summaries with CSV/JSONL indices for downstream analysis.

Technologies Used

Core & Parsing

Python 3.12+
PyMuPDF
ocrmypdf
Tesseract

LLM & Embeddings

Ollama
Google Gemini
nomic-embed-text

Indexing & Quality

SQLite Cache

MinHash LSH
pytest

Key Features

Dynamic LLM-Driven Taxonomy

Generates categories from your topic (no hardcoding) and scores each paper across all categories simultaneously to choose the best placement.

Accurate Parsing with OCR Fallback

Uses PyMuPDF for born-digital PDFs and seamlessly falls back to OCR (ocrmypdf + Tesseract) for scanned documents, ensuring high-quality text extraction.

Smart Deduplication & Resume

Combines hash-based exact matching with MinHash for near-duplicate detection. A SQLite cache and manifests support resumable processing at scale.

Topic-Focused Summaries & Indices

Produces per-paper summaries that emphasize your topic, plus JSONL and CSV indices for analysis, and Markdown summaries per category.

Research Assistant - LLM Research Pipeline

Gallery