sift-kg Python juanceresa/sift-kg

sift-kg: Knowledge Graphs You Can Actually Trust

Why Juan Ceresa built a knowledge graph CLI where nothing gets merged without your approval, and what that costs.

Timeline

  1. 2026-02-10
    Repository created

    Juan Ceresa creates juanceresa/sift-kg on GitHub. The first commits land later that month.

  2. 2026-03-16
    v0.9.0 tagged

    The pinned commit for these tours. 470 stars at this point. Schema discovery, embedding-based clustering, and the four-layer entity resolution model are all in place.

  3. 2026-03
    Civic Table launched

    The same author releases forensic_analysis_platform (Civic Table), a closed-source forensic intelligence platform built on the sift-kg pipeline. It adds analyst verification, LaTeX dossier generation, and a web interface for legal and journalism use cases.

The Problem With Auto-Merging

If you run an LLM over a corpus of documents and ask for the entities, you will get duplicates. A single deposition mentioning Detective Joe Recarey, Joe Recarey, and Recarey will produce three separate entities. Across hundreds of documents, the duplication compounds: Sam Bankman-Fried, SBF, Bankman-Fried, Samuel Benjamin Bankman-Fried all show up as different nodes in a graph that is supposed to represent reality.

The standard responses are bad in different ways. Auto-merging with string similarity collapses real distinctions: Cameron Winklevoss and Tyler Winklevoss are not the same person despite having identical surnames and 95% string similarity. Auto-merging with LLM judgment introduces silent failures: the LLM hallucinates that two unrelated organizations are the same and your graph quietly gains a wrong edge. Skipping deduplication entirely produces a graph so cluttered with synonyms that it loses analytical value.

Any of these approaches is acceptable for exploratory use, where the reader will look at the graph and adjust their interpretation. For legal review, genealogy, or investigative journalism, none of them are. The graph is supposed to be evidence, and evidence cannot rely on the LLM's confidence score.

The Four-Layer Model

sift-kg's response is a four-layer model where each layer catches a different kind of duplicate, and the user controls the boundary between automatic and manual.

Layer one is deterministic pre-dedup during sift build. Unicode normalization collapses Jose Garcia and Jose Garcia (with a non-ASCII e) into one node. A list of about thirty-five title prefixes (Detective, Dr., Senator, Honorable) is stripped before comparison. Inflect singularizes plural nouns. SemHash, a Model2Vec-based fuzzy string library, runs at a 0.95 threshold to catch near-identical strings like MacAulay vs Mac Aulay. No LLM call, no token cost, and the result stays reversible because you can edit the source extraction JSON later.

Layer two is the sift resolve command. Entities of the same type are batched (alphabetically by default, or by KMeans clustering on sentence embeddings if the [embeddings] extra is installed) and sent to the LLM in overlapping windows. The LLM proposes merges. None of these proposals are applied. They are written to output/merge_proposals.yaml with status: DRAFT on every entry.

Layer three is the human review step. The user runs sift review and walks through each DRAFT proposal in a Rich-styled panel, makes a single-keystroke decision, and the YAML is written back. The other path is to open merge_proposals.yaml in any editor and change DRAFT to CONFIRMED or REJECTED by hand. The README recommends the manual path for genealogy and legal use cases. Auto-approve and auto-reject thresholds are configurable on the review command, including --auto-approve 1.0 to disable auto-approval entirely.

Layer four is sift apply-merges. Only at this point are the confirmed proposals applied to the NetworkX graph. Member node data is merged into the canonical node, every edge through the merged member is rewritten to point to the canonical, self-loops created by the rewrite are dropped, and the merged member nodes are removed. The graph is saved back to graph_data.json. Rejected relations from relation_review.yaml are removed in the same operation.

The Cost of This Design

The cost is wall-clock time and review fatigue. A 1000-entity graph might generate a few hundred merge proposals. Walking through them manually is real work. Auto-approve at 0.85 confidence reduces that workload, but the user is now trusting LLM confidence scores to gate evidence, which was the problem in the first place. The README is direct: for high-accuracy work, set --auto-approve 1.0 and review every merge by hand.

The benefit is that nothing gets merged silently. If the user later finds two nodes in the graph that should have been merged, the proposal is still in merge_proposals.yaml as DRAFT or REJECTED, and they can change the status and re-run. The decision history is auditable. The graph is reproducible from the YAML.

The Civic Table Connection

The same author maintains Civic Table as a closed-source product on top of the sift-kg pipeline. Civic Table adds a four-tier verification system where analysts and JDs validate AI-extracted facts before they carry evidentiary weight, plus LaTeX dossier generation for legal submissions and a web interface for sharing results with clients and families.

The human-in-the-loop design exists for that production deployment. sift-kg is the open-source engine; Civic Table is the deployment where someone has accepted legal liability for the output. Real legal review demands a clean line between what the LLM proposed and what a person decided, and the DRAFT-CONFIRMED-REJECTED status on every merge is where that line lives.

What This Means for Reading the Code

The pipeline tour walks the code in execution order: cli.py, extract/, graph/, resolve/. Each subsystem is small enough to fit in one file or two. The entity resolution tour zooms into resolve/ and the relevant pieces of graph/prededup.py. Read the pipeline tour first if you came to understand the system. Read the entity resolution tour first if you came to understand the human-in-the-loop design.

The codebase is small: under 25,000 lines of Python at the pinned commit, including tests. It is readable end to end in an afternoon. The interesting decisions are concentrated in resolve/ and graph/prededup.py; the rest is conventional pipeline plumbing done well.

Sources

  1. juanceresa/sift-kg README at commit d5d3de2e (fetched 2026-05-04)
  2. juanceresa/sift-kg pyproject.toml at commit d5d3de2e
  3. Entity Resolution Workflow section of the README
  4. KGGen — the NeurIPS 2025 paper sift-kg credits for pre-dedup and clustering techniques
  5. Civic Table — the closed-source forensic platform built on sift-kg

Ready to explore the code?

Start the sift-kg tour
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free