sift-kg Python juanceresa/sift-kg

Getting Started with Your First Contribution to sift-kg

Install the package, run the CLI, find the test convention, and locate the YAML extension point where most non-code contributions actually land.

6 stops ~16 min Verified 2026-05-04

What you will learn

How to install sift-kg for development with the right extras
The CLI surface and which command makes the cleanest smoke test
Where tests live, what a representative test looks like, and how to invoke pytest
Where bundled domain YAML files live and how new domains plug in
Which existing tours to read next once you can build and run the project

Prerequisites

Python 3.11+ with pip and a working virtualenv
A LiteLLM-compatible API key if you want to run the full pipeline (OpenAI, Anthropic, Mistral, or local Ollama)

1 / 6

Quick Start From the README

README.md:11

The whole pipeline is eight CLI verbs. The README shows them in execution order so first-time users have a runnable baseline before reading any code.

A first contribution starts with running the thing. The README opens with eleven shell lines that take you from pip install to a browseable graph, and each line maps one-to-one to a Typer command in src/sift_kg/cli.py. sift init writes a sift.yaml and .env.example in the current directory, which is the cheapest possible smoke test: no LLM key, no documents, just confirms the binary is on your PATH and the package imports clean.

The order matters. extract writes JSON per document, build turns that JSON into a NetworkX graph, resolve proposes merges, review walks them, apply-merges mutates the graph. Reading the README block is faster than reading any code, and it tells you which file to open when something breaks.

Key takeaway

Install with pip install sift-kg and run sift init first. If that prints the Created sift.yaml message, the install is correct.

```bash
pip install sift-kg

sift init                           # create sift.yaml + .env.example
sift extract ./documents/           # extract entities & relations
sift build                          # build knowledge graph
sift resolve                        # find duplicate entities
sift review                         # approve/reject merges interactively
sift apply-merges                   # apply your decisions
sift narrate                        # generate narrative summary
sift view                           # interactive graph in your browser
sift export graphml                 # export to Gephi, yEd, Cytoscape, SQLite, etc.
```

2 / 6

Dev Install and the pytest Configuration

pyproject.toml:54

The dev extras list pytest and ruff, and the same file pins the test discovery rules, so installing with [dev] runs the suite.

Two pieces of the file matter on day one. The dev extras tell you exactly what to install: pytest>=8.0 and ruff>=0.9.0. After cloning, run pip install -e ".[dev]" from the repo root and you have the editable package plus the test runner plus the linter. The [project.scripts] entry resolves the sift command to sift_kg.cli:app, which is why pip install -e is enough with no separate console-script build step.

Test discovery is pinned in [tool.pytest.ini_options] further down the same file: tests live under tests/, files start with test_, classes start with Test, functions start with test_. Run pytest from the repo root with no arguments and it finds everything.

Key takeaway

Install with pip install -e ".[dev]" from the cloned repo, then run pytest. The setup needs no extra config files and no path tweaks.

dev = [
    "pytest>=8.0",
    "ruff>=0.9.0",
]
all = [
    "sift-kg[embeddings]",
    "sift-kg[ocr]",
]

[project.scripts]
sift = "sift_kg.cli:app"

3 / 6

The CLI Entry: sift init

src/sift_kg/cli.py:1015

sift init is the first command a contributor runs and the simplest one to read; it writes two files and prints next steps.

Reading code is faster than reading docs once you know which file to open. cli.py is one Typer app with a function per verb. init is the easiest entry to read: no LLM client, no async, no NetworkX, just two file writes and a Rich panel of next steps. If your first contribution is a doc tweak or a config flag, this is the function whose pattern you copy.

The shape repeats across every verb in the file. Decorate with @app.command(), take typed Typer options, do the work, exit cleanly. Once you have read this one and the matching command in pipeline.py, the pattern for adding or modifying a CLI verb is set.

Key takeaway

Read init first. Every other CLI verb in cli.py follows the same Typer-decorator-then-do-the-work shape.

@app.command()
def init(
    domain: str | None = typer.Option(
        None, help="Path to custom domain YAML to set in project config"
    ),
) -> None:
    """Initialize a new sift-kg project in the current directory."""
    env_example_path = Path(".env.example")
    sift_yaml_path = Path("sift.yaml")

    # Create .env.example
    if not env_example_path.exists() or typer.confirm(
        "Overwrite existing .env.example?", default=False
    ):
        env_template = """# sift-kg Configuration
# Copy this file to .env and fill in your API keys

# === LLM API Keys ===
# At least one required. Ollama needs no key (local models).
SIFT_OPENAI_API_KEY=
SIFT_ANTHROPIC_API_KEY=

4 / 6

The Test Convention

tests/test_prededup.py:73

Tests are plain pytest classes with small fixtures. New tests look exactly like the existing ones.

The convention is small inputs, plain asserts, no mocks unless an LLM is involved. _make_extraction at the top of the class builds a DocumentExtraction from a list of (name, type) tuples, and every test reuses it. The function under test, prededup_entities, returns a dict, so assertions are len(result) == 0 or >= 1. No fixtures from conftest.py, no async, no parametrize. New behavior gets a new method on the relevant Test class.

The same pattern applies across test_graph.py, test_resolve.py, and test_export.py. Anything that touches an LLM lives in test_llm_client.py or test_extract.py with explicit mocks. If you fix a bug, find the matching test module, add a method to the relevant class whose name describes the fix.

Key takeaway

One Test* class per module under test, one method per behavior. Add yours next to the existing siblings.

    def test_exact_duplicates_merged(self):
        """Same name appearing twice should not produce a mapping (same canonical)."""
        ext = self._make_extraction([
            ("Alice Smith", "PERSON"),
            ("Alice Smith", "PERSON"),
        ])
        result = prededup_entities([ext])
        assert len(result) == 0

    def test_case_variants_merged(self):
        """Case-only differences should merge."""
        ext = self._make_extraction([
            ("Alice Smith", "PERSON"),
            ("alice smith", "PERSON"),
        ])
        result = prededup_entities([ext])
        assert len(result) >= 1

5 / 6

The Extension Point: Bundled Domain YAML

src/sift_kg/domains/bundled/academic/domain.yaml:37

Most non-code contributions land here: a YAML file declaring entity types and relation types for a new field.

The interesting place to contribute is rarely the Python. src/sift_kg/domains/bundled/ ships four domains: schema-free, general, osint, academic. Each is a single YAML file declaring entity types (with description and extraction_hints) plus relation types (with source_types and target_types). The same YAML format works for any custom domain you point at with the --domain flag.

If you work in genealogy, legal review, scientific literature outside the bundled set, or any field where the corpus has its own vocabulary, a domain YAML is the contribution. DomainLoader in src/sift_kg/domains/loader.py reads the file and the LLM prompt builder in src/sift_kg/extract/prompts.py consumes the descriptions and hints directly. No code change needed to ship a new bundled domain; add a directory under bundled/ with a domain.yaml inside.

Key takeaway

New domains are YAML, not Python. Drop a domain.yaml under src/sift_kg/domains/bundled/<your-domain>/.

entity_types:
  CONCEPT:
    description: "Core ideas, constructs, variables, and technical terms central to the research area"
    extraction_hints:
      - "Look for defined terms, key variables, and constructs that papers build arguments around"
      - "Include both broad concepts (e.g. 'cognitive load') and specific operationalizations"
      - "Capture the definition or description when the text provides one"
      - "Use CONCEPT for ideas that don't have a named theoretical framework — if it has a proper name and makes predictions, use THEORY instead"

  THEORY:
    description: "Named theoretical frameworks, models, paradigms, and schools of thought"
    extraction_hints:
      - "Look for named theories, frameworks, and models (e.g. 'Actor-Network Theory', 'Dual Process Model')"
      - "Include paradigms and schools of thought (e.g. 'constructivism', 'positivism')"
      - "Note the originator or key proponents when mentioned"
      - "Must have a proper name — 'Cognitive Load Theory' is a THEORY, 'cognitive load' is a CONCEPT"

  METHOD:
    description: "Research methods, techniques, analytical approaches, and tools"

6 / 6

Where to Read Next

README.md:391

The README documents the output directory layout. Once you have run the pipeline once, this is the map for everything you might want to fix.

The output directory is the contract between every CLI verb. extract writes extractions/*.json, build reads those and writes graph_data.json, resolve reads the graph and writes merge_proposals.yaml with status DRAFT, review mutates that YAML, apply-merges reads the CONFIRMED entries and updates graph_data.json in place. Every file in this layout has one writer and one or more readers, and most bugs are at those boundaries.

For end-to-end execution detail, read the Pipeline tour. For the four-layer dedup design that defines the project, read the Entity Resolution tour. Both are anchored to this same commit, so file lines line up with what you cloned.

Key takeaway

The output layout is the data contract between verbs. Read the Pipeline tour for the code path, the Entity Resolution tour for the design.

## Project Structure

After running the pipeline, your output directory contains:

```
output/
├── extractions/               # Per-document extraction JSON
│   ├── document1.json
│   └── document2.json
├── discovered_domain.yaml     # Auto-discovered schema (schema-free mode)
├── graph_data.json            # Knowledge graph (native format)
├── merge_proposals.yaml       # Entity merge proposals (DRAFT/CONFIRMED/REJECTED)
├── relation_review.yaml       # Flagged relations for review
├── narrative.md               # Generated narrative summary
├── entity_descriptions.json   # Entity descriptions (loaded by viewer)

Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free