sift-kg Python juanceresa/sift-kg

Python CLI that turns document collections into structured knowledge graphs via LLM extraction with a human-in-the-loop deduplication step.

Python 0.5k stars MIT

About

sift-kg is a Python CLI built by Juan Ceresa that turns document collections into structured knowledge graphs. You point it at a folder of PDFs, papers, depositions, FOIA releases, or whatever else is on disk, and it produces a NetworkX graph where every node is an entity and every edge is a relationship the LLM extracted from the underlying text. Every entity and relation links back to the source document and passage. The project crossed 470 GitHub stars in the months after its first public release and ships under MIT.

The defining design choice is the human-in-the-loop deduplication step. LLMs over-extract. A single document referring to Sam Bankman-Fried, SBF, and Bankman-Fried will produce three separate entities. Most extraction pipelines either ignore this or auto-merge with heuristics that produce silent errors. sift-kg does neither. It runs a deterministic pre-dedup pass during sift build, then asks an LLM to propose merges during sift resolve, then writes those proposals to a YAML file as DRAFT. Nothing is merged until you change a status to CONFIRMED and run sift apply-merges. For genealogy, legal review, or investigative work, that distinction is the whole point.

The second design choice is being LLM-agnostic via LiteLLM. The same pipeline runs against OpenAI, Anthropic, Mistral, or a local Ollama backend with no code changes. Document ingestion goes through Kreuzberg for 75+ file formats, with optional OCR (Tesseract, EasyOCR, PaddleOCR, or Google Cloud Vision). The graph itself is a NetworkX MultiDiGraph persisted as JSON.

Architecture

The pipeline is six CLI verbs in src/sift_kg/cli.py: extract, build, resolve, review, apply-merges, and narrate. Each one is a thin Typer wrapper that loads SiftConfig, resolves a domain (bundled or YAML-defined), and dispatches into the corresponding subsystem. The same operations are exposed as run_extract, run_build, etc. in src/sift_kg/pipeline.py for library use.

Extraction lives in src/sift_kg/extract/. extractor.py chunks document text, runs schema discovery for schema-free domains (one LLM call samples the corpus and designs entity/relation types saved to discovered_domain.yaml), and processes chunks concurrently via asyncio.Semaphore. prompts.py builds a single combined prompt that asks the LLM to extract entities and relations in one call rather than two. Results are written as JSON to output/extractions/, one file per source document.

Graph construction is in src/sift_kg/graph/. builder.py walks the extraction JSON and adds each entity and relation to a KnowledgeGraph wrapping networkx.MultiDiGraph. Before nodes are added, prededup.py runs Unicode normalization, title stripping, and SemHash fuzzy matching at a 0.95 threshold to collapse obvious near-duplicates. postprocessor.py fixes reversed edge directions when the LLM swaps source and target types relative to the domain schema, normalizes synonymous relation type names, and removes redundant edges. communities.py runs Louvain detection without an LLM call.

Entity resolution is the system's defining feature and lives in src/sift_kg/resolve/. resolver.py sends batches of same-typed entities to the LLM with overlapping windows so entities near a batch boundary appear in both batches. clustering.py optionally replaces alphabetical batching with KMeans clustering on sentence-transformer embeddings (the [embeddings] extra). Proposals are written to output/merge_proposals.yaml with status DRAFT. reviewer.py presents each proposal in a Rich-styled terminal panel and reads a single keystroke for approve/reject/skip. engine.py applies confirmed merges by rewriting every edge through a member_id → canonical_id map, dropping self-loops, and combining source documents.

Start here

The Pipeline tour walks from CLI entry through extraction, graph construction, resolution, and apply-merges. Read it first to see how the pieces fit together end to end.

Start the Pipeline tour

The Entity Resolution tour zooms into the four-layer dedup model: what runs automatically, what the LLM proposes, where the human reviews, and how merges actually mutate the NetworkX graph. Read it after the pipeline tour, or directly if you came for the human-in-the-loop design.

Start the Entity Resolution tour

For context on why this project exists and where it sits next to other knowledge graph tools, read the origin story.

Read the sift-kg origin story

Tours

Maintainers

Juan Ceresa Author and primary maintainer
@juanceresa

Juan Ceresa is the author and primary maintainer of sift-kg. He works out of Austin, Texas, and lists his contact email as jcere@umich.edu in pyproject.toml, suggesting a University of Michigan affiliation. He also maintains forensic_analysis_platform, the closed-source Civic Table tool that wraps the sift-kg pipeline with analyst verification, LaTeX dossier generation, and a web interface for legal and journalism use cases.

Origin Story

sift-kg: Knowledge Graphs You Can Actually Trust

Why Juan Ceresa built a knowledge graph CLI where nothing gets merged without your approval, and what that costs.

Read the full story

Related Projects

Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free