sift-kg Python juanceresa/sift-kg

Human-in-the-Loop Entity Resolution

How sift-kg layers deterministic dedup, LLM proposals, YAML review, and graph mutation so nothing gets merged without your approval.

6 stops ~22 min Verified 2026-05-04
What you will learn
  • How layer 1 (deterministic pre-dedup) collapses unicode, title, and fuzzy variants without an LLM call
  • Why title prefixes are stripped before normalization and which thirty-five prefixes the code recognizes
  • How layer 2 (LLM proposal) batches same-type entities with overlapping windows
  • The YAML proposal format and the DRAFT/CONFIRMED/REJECTED status that governs whether a merge runs
  • How the interactive reviewer auto-approves high-confidence proposals and panels the rest
  • How layer 4 (graph mutation) actually rewrites edges in the NetworkX MultiDiGraph
Prerequisites
  • Comfort with Python type hints and Pydantic models
  • Basic NetworkX terminology (nodes, edges, degree)
  • Ideally: read the Pipeline tour first for end-to-end context
1 / 6

Layer 1 — Deterministic Pre-Dedup

src/sift_kg/graph/prededup.py:75

Before any LLM is involved, two phases collapse obvious near-duplicates: deterministic normalization, then SemHash fuzzy matching.

Calling an LLM to dedup Companies and Company is wasteful: the answer is mechanical. Layer one of the four-layer model handles the mechanical cases without spending a token. Names get grouped by entity type (you would never merge a PERSON and a LOCATION just because they share a string), unicode-normalized to ASCII, title-stripped, singularized via the inflect library, then matched at a 0.95 SemHash threshold for the cases that still slip through.

The return type is the contract: a map from (entity_type, original_name) to canonical name, populated only where they differ. build_graph consults this map every time it adds an entity, so the duplicates never become separate nodes and the user never sees a review prompt for them.

Key takeaway

Layer one runs in sift build before nodes exist. No LLM call, no review, no cost.

def prededup_entities(
    extractions: list[DocumentExtraction],
    similarity_threshold: float = 0.95,
) -> dict[tuple[str, str], str]:
    """Map (entity_type, original_name) -> canonical_name for near-duplicates.

    Args:
        extractions: List of document extractions to scan
        similarity_threshold: SemHash threshold for fuzzy matching (0-1)

    Returns:
        Mapping from (entity_type, original_name) to canonical_name.
        Only contains entries where original_name != canonical_name.
    """
    # Collect all entity names grouped by type
    names_by_type: dict[str, list[str]] = {}
    for extraction in extractions:
        if extraction.error:
            continue
        for entity in extraction.entities:
            names_by_type.setdefault(entity.entity_type, []).append(entity.name)
2 / 6

Layer 1 Detail — Stripping Titles

src/sift_kg/graph/prededup.py:29

About thirty-five title prefixes are stripped before comparison so titled and untitled forms of the same name collapse.

The use cases that drove this list are visible in the prefixes themselves: detective, sergeant, special agent for legal and investigative work; senator, representative, governor for political documents; esquire, attorney, judge for court filings. Detective Joe Recarey and Joe Recarey would otherwise be two nodes; the strip pass collapses them before SemHash even runs.

The loop is iterative on purpose, because Honorable Judge Smith needs two passes (first Honorable, then Judge). The same constant is reused in resolver.py for sorting PERSON entities by surname, so titled and untitled variants of the same person cluster into the same LLM batch in layer two.

Key takeaway

Title stripping is iterative, and the prefix list is shared with the LLM batch sorter, so the same normalization shapes layers one and two.

# Common title prefixes that don't change entity identity
_TITLE_PREFIXES = (
    "detective", "det.", "officer", "sergeant", "sgt.", "lieutenant", "lt.",
    "captain", "cpt.", "chief", "deputy", "agent", "special agent",
    "dr.", "dr", "doctor", "prof.", "professor",
    "mr.", "mr", "mrs.", "mrs", "ms.", "ms", "miss",
    "judge", "justice", "hon.", "honorable",
    "senator", "sen.", "representative", "rep.", "governor", "gov.",
    "president", "vice president",
    "attorney", "atty.", "counsel", "esquire", "esq.",
    "reverend", "rev.", "father", "sister", "brother",
    "sir", "dame", "lord", "lady",
)


def _strip_titles(name: str) -> str:
    """Strip common title prefixes from a name."""
    changed = True
    while changed:
        changed = False
        for prefix in _TITLE_PREFIXES:
            if name.startswith(prefix + " "):
                name = name[len(prefix) + 1:].strip()
                changed = True
                break
    return name
3 / 6

Layer 2 — Cross-Type Duplicates Without an LLM Call

src/sift_kg/resolve/resolver.py:190

If the LLM extracts the same name under two different types, no LLM call is needed to merge them; the higher-degree node wins.

Even after schema discovery and the fix_relation_directions post-processor, the LLM sometimes assigns inconsistent types, like reading comprehension as both CONCEPT and PHENOMENON within the same graph. find_merge_candidates calls into this helper alongside the LLM batches and gets a free dedup pass.

The heuristic is degree-based: the node with the most connections becomes the canonical, and every other typed copy is added as a member with confidence 0.95. The reasoning the function writes into the proposal is direct: Same name across types... Relations will be combined. The proposal is still DRAFT, so the user can reject it if they actually meant to keep the types separate. But the LLM never sees this case.

Key takeaway

Cross-type duplicates are caught by graph topology, not language. Highest degree wins; the user still gets a DRAFT proposal to confirm.

def _find_cross_type_duplicates(kg: KnowledgeGraph) -> list[MergeProposal]:
    """Find entities with the same name but different types.

    When the LLM extracts "reading comprehension" as both CONCEPT and
    PHENOMENON, these are the same entity with an inconsistent type.
    The canonical is the one with more connections (more context for
    the type assignment). All relations get combined on merge.
    """
    from collections import defaultdict

    # Group by normalized name
    name_groups: dict[str, list[tuple[str, str, int]]] = defaultdict(list)
    for nid, data in kg.graph.nodes(data=True):
        entity_type = data.get("entity_type", "")
        if entity_type in SKIP_TYPES:
            continue
        name = data.get("name", "").strip().lower()
        if not name:
            continue
        degree = kg.graph.degree(nid)
        name_groups[name].append((nid, entity_type, degree))
4 / 6

Layer 3 — The Proposal as Pydantic Schema

src/sift_kg/resolve/models.py:28

Every merge is a Pydantic model whose ternary status (DRAFT, CONFIRMED, REJECTED) gates whether the merge runs.

The whole human-in-the-loop design hinges on this status field. StatusType is a Literal["DRAFT", "CONFIRMED", "REJECTED"]. Every proposal the LLM produces starts as DRAFT. apply_merges only acts on MergeFile.confirmed: proposals where the user (or auto-approval at a high confidence threshold) explicitly changed the status.

The file format follows directly. MergeFile is the top-level Pydantic model that maps to merge_proposals.yaml. reason captures the LLM's natural-language justification, which the reviewer surfaces to the user. The README explicitly recommends opening this YAML in any editor and changing statuses by hand. The schema is designed for human readability, not just programmatic round-trips.

Key takeaway

Status is the gate. Nothing applies until DRAFT becomes CONFIRMED, and the YAML is designed so a human can flip it directly.

class MergeProposal(BaseModel):
    """A proposed merge of multiple entities into one canonical entity.

    The `members` list contains the non-canonical entities that will be merged
    INTO the canonical entity. The canonical entity itself is identified by
    `canonical_id` and is not included in `members`.
    """

    canonical_id: str
    canonical_name: str
    entity_type: str
    status: StatusType = "DRAFT"
    members: list[MergeMember] = Field(min_length=1)
    reason: str = ""  # LLM's explanation for why these should merge


class MergeFile(BaseModel):
    """Top-level model for merge_proposals.yaml."""

    proposals: list[MergeProposal] = Field(default_factory=list)

    @property
    def confirmed(self) -> list[MergeProposal]:
        return [p for p in self.proposals if p.status == "CONFIRMED"]

    @property
    def draft(self) -> list[MergeProposal]:
        return [p for p in self.proposals if p.status == "DRAFT"]
5 / 6

Layer 3 — Interactive Review With Auto-Approve

src/sift_kg/resolve/reviewer.py:54

The reviewer auto-confirms proposals where every member meets a confidence threshold and panels the rest for keystroke review.

The threshold check uses min, not mean. A proposal with three members at 0.99 confidence and one at 0.40 does not auto-approve, because every member must clear the bar. The default is 0.85, set in the CLI command. The auto_approve_threshold < 1.0 guard means --auto-approve 1.0 disables the auto-confirm path entirely and forces manual review of every proposal. The README recommends exactly that for genealogy and legal use cases.

Whatever survives the auto-approve gate goes into manual_review and is shown one proposal at a time as a Rich-styled panel: canonical entity, member table with confidence colored by threshold, the LLM's reason. The user types a, r, s, or q and the function writes the YAML back.

Key takeaway

Auto-approval uses minimum member confidence, not average. Setting --auto-approve 1.0 disables it entirely, the recommended setting for high-accuracy work.

    drafts = merge_file.draft
    if not drafts:
        console.print("[dim]No merge proposals to review.[/dim]")
        return {"auto_approved": 0, "approved": 0, "rejected": 0, "skipped": 0}

    # Auto-approve high-confidence proposals
    auto_approved = []
    manual_review = []
    for proposal in drafts:
        min_conf = min(m.confidence for m in proposal.members)
        if auto_approve_threshold < 1.0 and min_conf >= auto_approve_threshold:
            proposal.status = "CONFIRMED"
            auto_approved.append(proposal)
        else:
            manual_review.append(proposal)
6 / 6

Layer 4 — Mutating the NetworkX Graph

src/sift_kg/resolve/engine.py:12

Confirmed proposals build a single member-to-canonical map, then a single pass through every edge rewrites endpoints and drops self-loops.

The docstring is the contract. Four steps, in order, applied only to merge_file.confirmed. Everything still in DRAFT stays untouched and survives to the next run, so the user can come back to it later, or the next round of sift resolve can re-propose it differently.

merge_map is built first as a flat member_id → canonical_id dictionary across every confirmed proposal. The next step (shown in the pipeline tour) materializes the edge list, rewrites each edge through this map, drops self-loops where both endpoints collapsed into the same node, and finally removes the orphaned member nodes. The graph is saved back to graph_data.json. Re-export, re-narrate, and re-visualize all read from the cleaned graph.

Key takeaway

Only CONFIRMED proposals run; DRAFT survives to the next round. The graph mutation is a single pass, predictable and reproducible from the YAML.

    """Apply confirmed merge proposals to the knowledge graph.

    For each confirmed proposal:
    1. Merge member node data into canonical node
    2. Rewrite all edges pointing to/from members to point to canonical
    3. Remove member nodes
    4. Remove self-loops created by merging

    Args:
        kg: KnowledgeGraph to modify in place
        merge_file: MergeFile with proposals (only CONFIRMED are applied)

    Returns:
        Stats dict with counts
    """
    confirmed = merge_file.confirmed
    if not confirmed:
        logger.info("No confirmed merges to apply")
        return {"merges_applied": 0, "nodes_removed": 0, "self_loops_removed": 0}

    stats = {"merges_applied": 0, "nodes_removed": 0, "self_loops_removed": 0}

    # Build full merge map: member_id → canonical_id
    merge_map: dict[str, str] = {}
    for proposal in confirmed:
        for member in proposal.members:
            if member.id != proposal.canonical_id:
                merge_map[member.id] = proposal.canonical_id
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free