Back
gh

RamXX/zettelvault: Transform an Obsidian vault into PARA + Zettelkasten structure using DSPy and RLM

Transform an Obsidian vault into PARA + Zettelkasten structure using DSPy and RLM - RamXX/zettelvault

by RamXX github.com 6,935 words
View original

ZettelVault

CI Python 3.13 License: MIT Ruff

Transform a messy Obsidian vault into clean PARA + Zettelkasten structure using LLMs.

Most knowledge workers accumulate hundreds of notes in Obsidian over time, but the vault gradually becomes a tangled mess of long-form drafts, bullet dumps, and half-finished thoughts. ZettelVault fixes that. It reads every note from one or more source vaults, classifies each into the PARA method (Projects / Areas / Resources / Archive), then decomposes each note into atomic Zettelkasten notes - one idea per note, heavily cross-linked. Under the hood, it uses DSPy for structured LLM interaction, with dspy.RLM (Recursive Language Models) as the primary decomposition strategy.

Beyond vault restructuring, this project also serves as reference code for using dspy.RLM for document decomposition - a technique applicable to any use case where long-form documents need to be split into structured, atomic units.

Processed vault graph in Obsidian

Table of Contents

Pipeline

The pipeline has five steps, each feeding into the next. The diagram below shows the overall flow, and the sections that follow explain each step in detail.

Source Vault(s)                                          Destination Vault
+------------------+                                      +--------------------+
| Note A (messy)   |    1. Read     2. Classify           | 1. Projects/       |
| Note B (messy)   | ---------> [PARA bucket] -------->   |    TopicA/         |
| Note C (messy)   |            [domain/tags]     |       |      Atomic Note 1 |
| ...              |                              |       |      Atomic Note 2 |
+------------------+            3. Decompose      |       | 2. Areas/          |
                                [RLM -> REPL]     |       |    TopicB/         |
                                [code analysis]   |       |      Atomic Note 3 |
                                [sub-LM calls]    |       | 3. Resources/      |
                                      |           |       |    TopicC/         |
                            4. Write  |           |       |      Atomic Note 4 |
                            ----------+---------->|       | 4. Archive/        |
                            5. Resolve links      |       | MOC/               |
                                      +---------->|       |   Domain-A.md      |
                                                          |   Domain-B.md      |
                                                          +--------------------+

Step 1: Read

The first thing ZettelVault needs to do is read every .md file from one or more source vaults. Rather than walking the filesystem and manually parsing Obsidian’s internal structures, ZettelVault delegates this to vlt.

Why vlt

vlt is a fast, zero-dependency CLI tool (a compiled Go binary) purpose-built for operating on Obsidian vaults without requiring the Obsidian desktop app, Electron, Node.js, or any network calls. It reads and writes vault files directly via the filesystem, starts in sub-millisecond time by leveraging the OS page cache, and uses advisory file locking for safe concurrent access.

ZettelVault uses vlt instead of reading files directly for several reasons:

Multiple source vaults are supported: pass space-separated vault names and ZettelVault merges them (first vault wins on title collision). No Obsidian desktop app is required for processing, but it is recommended for viewing results.

Step 2: Classify (dspy.Predict)

Once all notes are loaded, ZettelVault classifies each one into the PARA framework. This gives every note a clear place in the output vault’s folder hierarchy. Each note receives:

The classification uses a typed DSPy Signature with Literal type for PARA buckets, ensuring the model always produces a valid category.

Classification results are cached to classified_notes.json after every 50 notes for crash resilience. If the cache exists on a subsequent run, only new (uncached) notes are classified.

Step 3: Decompose (dspy.RLM with fallback)

This is the heart of the pipeline. Each classified note is decomposed into atomic Zettelkasten notes, where “atomic” means one idea per note, heavily cross-linked to its siblings. To make this reliable, ZettelVault uses a three-level fallback strategy:

  1. dspy.RLM (primary) - the model writes Python code in a sandboxed REPL to programmatically analyze the note’s structure, then generates atomic notes
  2. dspy.Predict with retry - direct LLM call with escalating temperature (0.1, 0.4, 0.7) and cache bypass
  3. Single-atom passthrough - guaranteed success; emits the original note as-is

Before decomposition begins, a concept index maps meaningful words to note titles, enabling cross-link suggestions. Each note receives a list of the most conceptually similar note titles as context, so the LLM can generate relevant wikilinks.

The output format uses === -delimited markdown blocks (not JSON - see Design Decisions for why).

Notes that fall back to Predict or passthrough are logged to fallback_notes.json with the note title, reason, and atom count. This log can be used to selectively reprocess those notes later (see make reprocess).

Decomposition results are checkpointed to atomic_notes.json after every note. If the cache exists on a subsequent run, already-decomposed notes (matched by source title) are skipped automatically, enabling progressive processing across multiple sessions.

Step 4: Write

With classification and decomposition complete, ZettelVault writes the atomic notes to the destination vault. Each note gets:

A Map of Content (MOC) note per domain is generated in MOC/, containing [[wikilinks]] to all atomic notes in that domain. These MOC notes serve as navigational hubs once you open the vault in Obsidian.

The .obsidian configuration directory from the first source vault is copied to the destination (if it does not already exist), so plugins, themes, and settings carry over.

The final step cleans up the link graph. During decomposition, the LLM generates wikilinks to related notes, but not all of those targets actually exist in the output vault. ZettelVault scans the destination vault and resolves orphan [[wikilinks]] - links that point to notes that do not exist - using a four-tier strategy:

  1. Case-insensitive match: [[project planning]] resolves to Project Planning.md
  2. Fuzzy match: [[Proj Planning]] resolves to Project Planning.md if the similarity ratio is >= 0.85 (uses difflib.SequenceMatcher)
  3. Stub creation: orphan links referenced by 3+ notes get a stub note created in the most common PARA folder among referencing notes, with a “Referenced by” section
  4. Dead link removal: orphan links with only 1-2 references are stripped of brackets (converted to plain text)

The result is a clean, navigable link graph with no dangling references.

RLM vs Predict/ChainOfThought

The key insight behind this project is that document decomposition benefits enormously from programmatic analysis. Traditional dspy.Predict or dspy.ChainOfThought feed the entire document into the LLM’s context window and ask it to generate decomposed output in a single pass. dspy.RLM takes a fundamentally different approach: the document content is never loaded into the LLM’s primary context. Instead, it is stored as a variable (context) inside a sandboxed REPL environment, and the LLM writes Python code to access and process it programmatically.

This distinction matters: the LLM’s context window is used for reasoning and code generation, not for holding the document. The document lives in the execution environment as data, accessible via code. This means RLM can handle documents far larger than the model’s context window - the model only ever sees the slices it explicitly reads via code.

How RLM Works

When decomposing a note, the RLM module:

  1. Stores the note content in a context variable inside a Deno/Pyodide WASM sandbox - not in the LLM’s prompt
  2. The LLM writes Python code to access context and analyze the note’s structure (headings, bullet points, paragraphs)
  3. The code executes in the sandbox; output is captured and shown to the LLM
  4. The LLM iterates - writing more code to refine its analysis, split content, generate titles and tags
  5. For semantic tasks, the LLM calls llm_query() from within the sandbox (e.g., “what is the main idea of this paragraph?”)
  6. When done, the LLM calls SUBMIT(decomposed=...) to return the final result

Because the document is data in the REPL rather than tokens in the prompt, the model can:

Performance Comparison

To quantify the difference, we tested both approaches on 4 source notes from an Obsidian vault, using qwen/qwen3.5-35b-a3b via OpenRouter (Parasail provider):

Metricdspy.Predictdspy.RLM
Atomic notes produced1723
Notes with fallback1 (25%)0 (0%)
Success rate75% (3/4)100% (4/4)
Avg iterations per noten/a5.5
Sub-LM calls (total)n/a3
Provider-reported cost~$0.04~$0.08
Latency per note~5s~30s

Key findings:

When to Use RLM vs Predict

Use CaseRecommendation
Structured documents (headings, lists)RLM - programmatic splitting is more reliable
Short, simple notes (1-2 paragraphs)Predict - RLM overhead isn’t justified
Notes with complex cross-referencesRLM - can programmatically match against related titles
High-volume batch processing (1000+ notes)RLM with cost monitoring - 2x cost may matter at scale
Real-time / interactive usePredict - 5s vs 30s latency matters for UX
Notes that consistently fail with PredictRLM - its code-based approach bypasses template collisions

Why RLM Over ChainOfThought

dspy.ChainOfThought adds a reasoning step before output generation. For decomposition tasks, this reasoning competes with the actual output for the model’s output token budget - the model spends tokens planning what it will do rather than doing it. Whether this trade-off helps depends on the task and model; we chose not to use it for decomposition because the reasoning step doesn’t produce actionable intermediate results the model can inspect or correct.

RLM is architecturally different - it doesn’t add reasoning text, it adds execution. The model writes code that runs, producing concrete intermediate results it can inspect and refine. Each REPL iteration is a feedback loop, not a one-shot preamble.

Model Comparison

Choosing the right model matters. We tested multiple open-source models as RLM orchestrators across two dimensions: quantitative metrics (cost, speed, convergence) and qualitative assessment (output quality, content fidelity, link richness). All tests used the same 4 source notes, run via OpenRouter.

These results reflect only the models we tested during the evaluation phase. Our focus during evaluation was on open-source models because many users want to run this locally via LM Studio, Ollama, or similar tools. After evaluation, we selected GLM-5 for the production run on our full vault (see Production Run (GLM-5)).

Quantitative Results (Evaluation, 4 Test Notes)

MetricQwen 3.5-35B-A3BMiniMax M2.5Kimi K2.5
Atomic notes231618
RLM iterations21 (5.2 avg)17 (4.2 avg)15 (3.8 avg)
Sub-LM calls230
LLM calls332220
Input tokens19K48K37K
Output tokens33K13K10K
Provider cost$0.080$0.033$0.051
Max iters hit000

Qualitative Assessment (Evaluation, 4 Test Notes)

Numbers only tell part of the story. We manually reviewed every atom produced by each model, comparing against the source notes for content fidelity, duplication, granularity, and cross-linking quality.

DimensionQwen 3.5-35B-A3BMiniMax M2.5Kimi K2.5
Content preservationGood - uses original phrasingPoor - invents/paraphrasesBest - verbatim + source attribution
Content duplicationNear-duplicate atom pairsVerbatim duplicate atomsZero duplicates
Granularity judgmentOver-splits (1 paragraph -> 7 atoms)MixedSlightly over-splits (1 paragraph -> 4 atoms)
Cross-linkingBasic (links to source notes only)BasicRich (links between sibling atoms + source notes)
Source attributionNoneNoneIncludes “From:” and “See also:” references

Kimi K2.5 is the quality winner among the models we evaluated. It produces the most faithful content extraction with zero duplicates, the richest cross-link graph (it creates links between sibling atoms within the same source note, not just back to source notes), and proper source attribution. It also converges fastest (3.8 avg iterations) with the fewest total LLM calls.

Qwen 3.5-35B-A3B is a solid baseline that preserves original text well but over-splits aggressively - it will fragment a single paragraph into many single-sentence atoms, destroying the paragraph’s coherence. MiniMax M2.5 converges fast and is the cheapest, but produces duplicate content and invents text not present in the source.

Kimi K2.5’s strong showing during evaluation set the quality ceiling, and this informed our final model choice for the production run. See the next section for how GLM-5 performed at full vault scale.

Sub-LM Comparison (Qwen orchestrator, varying sub-LM)

We also tested different sub-LM models for llm_query() calls while keeping Qwen as the orchestrator:

MetricSame modelLiquid LFM-2Qwen FlashMercury-2Step-3.5-Flash
Atomic notes2419232121
RLM iterations26 (6.5 avg)23 (5.8 avg)42 (10.5 avg)42 (10.5 avg)42 (10.5 avg)
Sub-LM calls92222
Provider cost$0.094$0.051$0.111$0.116$0.125
Max iters hit00222

The sub-LM currently has minimal influence - the orchestrator makes very few llm_query() calls (2-9 across 4 notes). A signature or prompt change that encourages more sub-LM usage could shift the economics of dual-model setups.

Notes on Reasoning Models

Some models (Kimi K2.5, DeepSeek R1, etc.) default to “thinking mode” where tokens go to an internal reasoning trace before producing content. This can cause DSPy to hang indefinitely - the model burns its token budget on reasoning and never emits content.

Fix for Kimi K2.5: Disable reasoning and use XMLAdapter (which matches Kimi’s post-training format):

# config.local.yaml
model:
  id: "moonshotai/kimi-k2.5"
  provider: "openrouter"
  max_tokens: 32000
  adapter: "xml"
  reasoning:
    enabled: false

The reasoning config maps directly to OpenRouter’s reasoning parameter in the request body. For Qwen 3.5, thinking mode doesn’t cause issues because it returns reasoning within the content field rather than a separate reasoning field.

Production Run (GLM-5)

After evaluating Qwen, MiniMax, and Kimi K2.5 on 4 test notes, we chose GLM-5 from Zhipu AI for the full production run. GLM-5 was accessed via z.ai ‘s OpenAI-compatible API endpoint, not through OpenRouter.

Two factors drove this decision. First, GLM-5’s code generation capabilities proved strong enough for RLM’s REPL-driven decomposition, matching the quality ceiling that Kimi K2.5 had established during evaluation. Second, z.ai’s API endpoint is OpenAI-compatible, so no code changes were needed, just a config override pointing api_base at z.ai and setting the model ID.

Full Vault Results

The production run processed a real vault of 774 notes (merged from two source vaults containing 671 and 103 notes respectively), producing 5,805 atomic notes. Here are the key metrics:

MetricGLM-5 (Production)
Source notes774
Atomic notes produced5,805
Total REPL iterations3,981 (5.2 avg per note)
Sub-LM calls839
Fallback to dspy.Predict35 / 769 (4.6%)
Runtime~51 hours
Provider cost$0.00 (see note below)

The 4.6% fallback rate means that 35 out of 769 notes (a few notes were filtered or deduplicated during merging) fell back from RLM to dspy.Predict. Notably, every one of those 35 fallback notes produced results that required no manual adjustment. This validates the three-level fallback strategy: even when RLM fails, Predict picks up the slack cleanly.

On cost: The cost report shows $0 because the author has a yearly subscription to z.ai at a fixed price, making the marginal cost of running GLM-5 effectively zero. If you use z.ai’s API directly without a subscription, you will incur whatever per-token costs Zhipu charges. The cost tracker itself has no pricing data for z.ai (it is not in OpenRouter’s model catalog), so it cannot calculate costs for this endpoint automatically.

Comparison to Evaluation

It is worth noting that the evaluation data (Qwen, MiniMax, Kimi K2.5) used only 4 test notes, while the GLM-5 numbers come from the full 774-note vault. The evaluation was designed to compare model quality and choose a winner; the production run was designed to process everything. The 5.2 average REPL iterations per note in production aligns closely with the evaluation results (3.8 to 6.5 range), suggesting that the evaluation was representative of real-world behavior.

Local Alternative: Qwen 3.5

Although GLM-5 was excellent for the production run, Qwen 3.5-35B-A3B is a very strong contender for running the entire pipeline locally. An MLX version is available for Mac, making it practical to process a full vault without any API costs at all. For users who want full control and privacy, this is worth considering. Qwen performed well during evaluation (see Model Comparison), and running it locally eliminates both cost and rate-limiting concerns.

Cost Tracking

ZettelVault includes a standalone cost tracking module (pricing.py) that:

  1. Fetches real-time pricing from OpenRouter’s /api/v1/models catalog at startup
  2. Tracks token usage per pipeline phase by inspecting DSPy’s lm.history
  3. Reports both calculated and provider-reported costs for accuracy

Sample Cost Report

======================================================================
COST REPORT: Qwen3.5-35B-A3B
Model: qwen/qwen3.5-35b-a3b
Pricing: $0.1625/M input, $1.3000/M output
Context window: 262,144 tokens
======================================================================
Phase                      Calls      Input     Output         Cost
----------------------------------------------------------------------
classification                 4      8,234      1,102    $0.002768
decomposition                  8     45,123     12,456    $0.023526  [22 iters, 3 sub]
----------------------------------------------------------------------
TOTAL                         12     53,357     13,558    $0.026294
TOTAL (provider)                                          $0.082341
======================================================================

The “provider” total reflects OpenRouter’s actual billing, which includes routing costs and provider markup. It is more accurate than the calculated total, which uses catalog pricing.

Cost Tracking Architecture

from pricing import CostTracker

tracker = CostTracker("qwen/qwen3.5-35b-a3b")

with tracker.phase("classification"):
    for note in notes:
        classify(note)

with tracker.phase("decomposition") as phase:
    decompose_all(notes)
    phase.rlm_iterations = total_iters   # optional RLM metrics
    phase.rlm_sub_calls = total_subs

tracker.report()

The tracker works by snapshotting lm.history length before each phase and summing token counts from new entries after. DSPy stores usage data in two places per history entry:

The tracker checks both, preferring the dict. It also sums entry["cost"] (LiteLLM’s per-call cost) for the provider-reported total.

Known limitation: In our tests, the provider-reported cost (summed from entry["cost"]) was consistently higher than the cost we calculated from token counts. The exact cause of this delta is unclear - it may be due to DSPy internal retries, adapter overhead, or provider-side billing differences. We recommend treating the provider-reported total as the ground truth for actual spend, and the per-phase calculated values as useful for relative comparison between phases.

Design Decisions

Several design choices in ZettelVault are non-obvious and worth explaining. Each one was made after encountering a specific failure mode during development.

Why Markdown Over JSON

The decomposition output uses === -delimited markdown blocks rather than JSON:

Title: Atomic Note Title
Tags: tag1, tag2, tag3
Links: Related Note A, Related Note B
Body:
The actual content of this atomic note...

===

Title: Another Note
...

JSON output failed consistently because:

Markdown-delimited output is forgiving: each section is parsed independently, malformed sections are skipped, and the regex parser handles model quirks (concatenated hashtags, bracket-wrapped links, .md extensions).

Obsidian notes have two structural elements that must survive the pipeline intact:

  1. Wikilinks ([[note title]]) - the author’s link graph. DSPy’s template system uses [[ ## field ## ]] markers that collide with wikilink syntax. Stripping them would destroy the vault’s cross-references.
  2. YAML frontmatter - metadata used by Obsidian plugins (Dataview, Templater, etc.). Properties like aliases, cssclass, publish, and custom plugin fields must not be discarded.

Wikilinks are escaped to Unicode guillemets (\u00ab / \u00bb) before sending to DSPy, and restored to [[brackets]] after parsing output. The roundtrip is lossless, including edge cases like [[Note (2024)]] and [[link|alias]].

Frontmatter is extracted from the source note before sanitization via extract_frontmatter(), carried through the pipeline alongside classification data, and merged into each atomic note’s output. Generated fields (tags, domain, subdomain, source, type) take precedence; all other original properties are preserved as-is.

Three-Level Fallback

Reliability matters more than perfection when processing hundreds of notes in a batch. The fallback chain (RLM -> Predict with retry -> passthrough) ensures the pipeline never fails:

This means the pipeline can process an entire vault without manual intervention. Notes that fell back can always be reprocessed later with make reprocess.

Progressive Processing and Crash Resilience

Both classification and decomposition support progressive processing:

This design means a crash after processing 400 of 800 notes loses at most 1 note’s work. Combined with the three-level fallback, it makes large vault migrations practical even over unreliable connections or with rate-limited APIs.

Known Limitations

There is an important distinction between what vlt preserves and what the LLM decomposition step preserves. Understanding this distinction will save you from surprises, especially if your vault relies heavily on Obsidian plugins.

What vlt preserves

vlt uses a 6-pass inert zone masking system that preserves comments and metadata during scanning. Obsidian comments (%% ... %%), HTML comments (<!-- ... -->), code blocks, inline code, and math expressions are all masked (replaced with spaces preserving byte offsets) before link and tag scanning. This means content inside these zones is never modified by vlt during read operations. Frontmatter is always preserved by vlt’s write operations (only the body is modified).

What the LLM does not preserve

ZettelVault’s pipeline sends note content through an LLM for decomposition. The LLM does not have vlt’s inert zone awareness. This means:

If your vault relies heavily on Dataview queries, Templater scripts, or other plugin-generated content embedded in note bodies, be aware that this content will likely need manual restoration after decomposition. Frontmatter-based plugin data (custom properties, aliases, cssclass) will survive without issues.

Data Loss Prevention

ZettelVault processes your entire knowledge base through an LLM pipeline, so data safety is a core design concern, not an afterthought. The protections fall into five layers: preventing damage before it happens, surviving failures mid-run, keeping files safe on disk, preserving information through the LLM round-trip, and making everything observable.

Prevention: Stop Before Damage

MechanismWhat it does
Dry-run mode (--dry-run, make dry-run)Runs the full classification and decomposition pipeline but prints a sample of results and exits before writing any files to disk.
Sample mode (--sample)Selects a small representative subset of notes for testing the full pipeline without processing the entire vault.
Limit flag (--limit N)Processes only the first N notes, useful for bounded testing before committing to a multi-hour full run.
LLM output validationEvery LLM response is validated before acceptance: minimum 100 characters, must contain a real title (5+ chars), must contain a Body: field, and template garbage ({decomposed}, ## ]]) is rejected. See decompose.py:is_valid_output().

Resilience: Survive Failures Mid-Run

MechanismWhat it does
Progressive checkpointingClassification results are saved to classified_notes.json every 50 notes. Decomposition results are saved to atomic_notes.json after every single note. A crash after 400 of 800 notes loses at most 1 note’s work.
Resume capability (make resume, make resume-all)Reloads cached checkpoint data so a crashed or interrupted run continues where it left off rather than restarting from zero.
Three-level fallbackDecomposition uses a guaranteed-success chain: (1) RLM programmatic decomposition, (2) Predict with temperature retries, (3) passthrough that emits the original note unchanged. Every note always produces output.
Per-note error isolationExceptions during decomposition of a single note are caught, logged, and skipped. One problematic note never crashes the entire pipeline.
Fallback logging (fallback_notes.json)Every note that used a degraded path (Predict or passthrough) is recorded with the reason, enabling targeted reprocessing later via make reprocess.

File Safety: Don’t Corrupt the Filesystem

MechanismWhat it does
Filename collision handlingWhen a note title matches an existing file, a counter suffix (_1, _2,…) is appended instead of overwriting. See writer.py:write_note().
Filename sanitizationUnsafe characters (`< >: ” / \
Idempotent directory creationAll mkdir calls use parents=True, exist_ok=True, making folder creation safe to repeat.
Configuration layeringconfig.yaml (defaults) is overlaid by config.local.yaml (user overrides), with deep merge preserving unset keys. Misconfiguration in one layer doesn’t destroy defaults.

Information Preservation: Don’t Lose Content Through the LLM

This is the most critical layer. Files can be intact on disk while the information inside them has been silently mangled by the LLM. ZettelVault addresses this at every stage of the pipeline.

MechanismWhat it does
Source note trackingEvery atomic note carries a source_note field linking it back to the original note title, enabling audits and reconstruction.
Frontmatter preservationYAML frontmatter is extracted before LLM processing (sanitize.py:extract_frontmatter()), carried through the pipeline, and merged back into each output note. Generated fields (tags, domain, subdomain, source, type) override; all other original properties (aliases, cssclass, plugin fields) are kept as-is.
Wikilink round-trip[[wikilinks]] are escaped to Unicode guillemets (<< / >>) before sending to DSPy (avoiding collision with DSPy’s [[ ## field ## ]] template markers), then restored to [[brackets]] after parsing. The round-trip is lossless. See sanitize.py.
Tag preservationClassification tags and domain assignments are attached to every atom produced from a note, whether via RLM, Predict, or passthrough.
Zero-loss passthrough guaranteeIf both RLM and Predict fail, the passthrough fallback emits the complete original content (not the truncated version sent to the LLM) as a single atomic note with all metadata intact. It is impossible for a note to enter the pipeline and not appear in the output.
Input truncation transparencyLong notes are truncated to max_input_chars (default 8000) before LLM processing, but the full original content is preserved separately. The passthrough fallback always uses the full content.
Link resolution with stub creationOrphan wikilinks referenced by 3+ notes get stub notes created automatically (preserving the link graph). Dead links with fewer references are removed cleanly rather than left broken. See resolve.py.
Classification carry-forwardPARA bucket, domain, and subdomain assignments from the classification phase are embedded in every atomic note, so organizational context is never lost even if decomposition degrades.

Observability: Catch Problems Early

MechanismWhat it does
Progress reporting with ETAPercentage complete, elapsed time, and estimated time remaining are printed at regular intervals during both classification and decomposition phases.
Fallback audit trailfallback_notes.json records which notes fell back and why, so you can assess quality and selectively reprocess.
Checkpoint inspectionclassified_notes.json and atomic_notes.json are human-readable JSON files that can be inspected at any time, even mid-run.

Known Gaps

These are areas where data loss is theoretically possible and not yet mitigated:

  1. Link rewriting is non-atomic. resolve.py rewrites files in-place with write_text(). A crash mid-rewrite could leave a file in a partially written state. An atomic rename pattern would close this gap.
  2. No automatic git integration. The pipeline does not create git commits before destructive operations. Running inside a git repository and committing before a run is recommended but not enforced.
  3. MOC pages are overwritten. Map of Content pages are regenerated on each run without versioning. Domain-based naming makes accidental collisions unlikely, but previous MOC content is not preserved.
  4. No multi-file transaction. Notes are written individually. There is no mechanism to atomically write all notes in a batch or roll back a partial run’s file output (though checkpoint-based resumption limits the blast radius).

Setup

Prerequisites

Installation

git clone https://github.com/RamXX/zettelvault.git
cd zettelvault
make install   # runs uv sync

# Install Deno (macOS)
curl -fsSL https://deno.land/install.sh | sh

# Set your API key
echo 'OPENROUTER_API_KEY=sk-or-...' > .env

Quickstart

# Full pipeline: read from SourceVault, write to ~/path/to/dest
make run SOURCE=MyVault DEST=~/path/to/dest

# Preview without writing files
make dry-run SOURCE=MyVault DEST=~/path/to/dest LIMIT=10

# Process multiple source vaults
make run SOURCE="VaultA VaultB" DEST=~/path/to/dest

Configuration

All parameters are in config.yaml. Copy to config.local.yaml to override without touching the tracked file (it is gitignored). You can also pass --config path/to/file.yaml on the command line.

# ── LLM Models ───────────────────────────────────────────────────────────────

# Primary model - used for classification and as the RLM orchestrator.
# Optional keys: adapter ("xml"|"json"), reasoning (passed to OpenRouter),
#                route (OpenRouter provider routing), api_base, api_key_env.
model:
  id: "qwen/qwen3.5-35b-a3b"
  provider: "openrouter"
  max_tokens: 32000
  temperature: 0.1
  # adapter: "xml"              # use XMLAdapter (recommended for Kimi K2.5)
  # api_base: "http://localhost:1234/v1"  # for local models (LM Studio, Ollama)
  # api_key_env: "MY_API_KEY"   # env var name for the API key (default: OPENROUTER_API_KEY)
  # reasoning:                  # OpenRouter reasoning params (for thinking models)
  #   enabled: false
  route:                        # OpenRouter provider routing (optional)
    only: ["Parasail"]

# Sub-LM - used inside RLM for llm_query() calls (semantic tasks).
# Can be a smaller/cheaper model to improve cost ratio.
# If not set, the primary model is used for sub-LM calls too.
sub_model:
  id: "qwen/qwen3.5-35b-a3b"
  provider: "openrouter"
  max_tokens: 32000
  route:
    only: ["Parasail"]

# ── RLM Settings ─────────────────────────────────────────────────────────────

rlm:
  max_iterations: 15          # REPL iterations before fallback
  max_llm_calls: 30           # sub-LM calls per decomposition
  max_output_chars: 15000     # truncation limit per REPL output
  verbose: false

# ── Pipeline Settings ────────────────────────────────────────────────────────

pipeline:
  max_retries: 3              # Predict fallback retry count
  max_input_chars: 8000       # content truncation for LLM input
  retry_temp_start: 0.1       # initial temperature for retries
  retry_temp_step: 0.3        # temperature increment per retry
  classify_checkpoint: 50     # save classification cache every N notes
  concept_min_word_len: 4     # minimum word length for concept index
  related_top_n: 20           # top-N related notes for decomposition

# -- Link Resolution -------------------------------------------------------

resolve:
  fuzzy_threshold: 0.85          # SequenceMatcher ratio for fuzzy wikilink matching
  stub_min_refs: 3               # orphan links with >= N references get a stub note

# -- Sampling ---------------------------------------------------------------

sample:
  size: 10                        # default number of notes for --sample
  bullet_heavy_threshold: 0.40    # fraction of bullet lines for bullet-heavy classification
  heading_heavy_threshold: 0.15   # fraction of heading lines for heading-heavy classification
  prose_heavy_threshold: 0.70     # fraction of prose lines for prose-heavy classification

Local overrides merge deeply, so you only need to specify the keys you want to change:

# config.local.yaml - use Kimi K2.5 as orchestrator with reasoning disabled
model:
  id: "moonshotai/kimi-k2.5"
  adapter: "xml"
  reasoning:
    enabled: false

Configuration Reference

KeyTypeDefaultDescription
model.idstring"qwen/qwen3.5-35b-a3b"Model ID (OpenRouter or LiteLLM format)
model.providerstring"openrouter"LiteLLM provider prefix
model.max_tokensint32000Max output tokens
model.temperaturefloat0.1Base temperature for classification
model.adapterstringnoneDSPy adapter: "xml" or "json"
model.api_basestringnoneCustom API endpoint (for local models)
model.api_key_envstringnoneEnv var name for API key
model.reasoningdictnoneOpenRouter reasoning params (e.g., enabled: false)
model.routedictnoneOpenRouter provider routing (e.g., only: ["Parasail"])
model.top_pfloatnoneTop-p sampling parameter
sub_model.*Same keys as model, applied to the sub-LM
rlm.max_iterationsint15Max REPL iterations before fallback
rlm.max_llm_callsint30Max sub-LM calls per decomposition
rlm.max_output_charsint15000Truncation limit per REPL output
rlm.verboseboolfalsePrint RLM REPL traces
pipeline.max_retriesint3Predict fallback retry count
pipeline.max_input_charsint8000Content truncation for LLM input
pipeline.retry_temp_startfloat0.1Initial retry temperature
pipeline.retry_temp_stepfloat0.3Temperature increment per retry
pipeline.classify_checkpointint50Save classification cache every N notes
pipeline.concept_min_word_lenint4Minimum word length for concept index
pipeline.related_top_nint20Top-N related notes passed to decomposition
resolve.fuzzy_thresholdfloat0.85SequenceMatcher ratio cutoff for fuzzy link matching
resolve.stub_min_refsint3Minimum orphan link references before creating a stub note
sample.sizeint10Default number of notes to sample
sample.bullet_heavy_thresholdfloat0.40Fraction of bullet lines to classify as bullet-heavy
sample.heading_heavy_thresholdfloat0.15Fraction of heading lines to classify as heading-heavy
sample.prose_heavy_thresholdfloat0.70Fraction of prose lines to classify as prose-heavy

Dual-Model RLM (Primary + Sub-LM)

RLM decomposes documents using two distinct roles, and you can assign different models to each:

  1. Primary model (orchestrator) - writes the Python code, reasons about structure, decides how to split. This model needs to be capable enough to write correct code and understand document semantics.
  2. Sub-LM (worker) - handles llm_query() calls from within the REPL. These are simpler semantic tasks like “summarize this paragraph” or “generate tags for this content”. A smaller, cheaper model may work well here.

By setting sub_model to a smaller model, you can reduce the cost of the most frequent LLM calls (sub-queries) while keeping the orchestrator capable. This is configured via DSPy’s sub_lm parameter on dspy.RLM.

Using Local Models

ZettelVault works with any OpenAI-compatible API. To use a local model via LM Studio, Ollama, or similar:

# config.local.yaml - local model via LM Studio
model:
  id: "my-local-model"
  provider: "openai"
  api_base: "http://localhost:1234/v1"
  api_key_env: "LOCAL_API_KEY"    # set to any non-empty string in .env
  max_tokens: 32000

Usage

Make Targets

TargetDescription
make helpShow all targets and variables
make runFull pipeline (read -> classify -> decompose -> write -> resolve links)
make dry-runClassify + decompose, no file writes (preview mode)
make sampleSelect representative notes for pipeline preview (uses SAMPLE_SIZE)
make resumeSkip classification, reuse classified_notes.json
make resume-allSkip classify + decompose, reuse atomic_notes.json
make reprocessRe-run only the notes that fell back to Predict (reads fallback_notes.json)
make statusShow progress of caches and current run
make cleanRemove all caches (classified_notes.json, atomic_notes.json, fallback_notes.json, migration_log.txt)
make clean-allRemove caches + all .md files in destination vault (preserves .obsidian)
make installCreate venv and install dependencies (uv sync)
make testUnit tests only (no API key needed)
make test-allUnit + integration tests (needs OPENROUTER_API_KEY)
make lintRun ruff linter

Override Make Variables

make run SOURCE=MyVault DEST=~/path/to/dest
make run SOURCE="VaultA VaultB" DEST=~/path/to/dest
make dry-run LIMIT=10
make run CONFIG=config.local.yaml
VariableDefaultDescription
SOURCESource vault name(s), space-separated
DESTDestination vault path
LIMIT0 (all)Process only the first N notes
CONFIGauto-detectPath to config YAML override
SAMPLE_SIZE10Number of notes to sample

Direct Invocation

# Full run
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest

# Disable RLM (use Predict only)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --no-rlm

# Process only first 10 notes
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --limit 10

# Dry run (no writes)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --dry-run

# Skip classification (load from cache)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --skip-classification

# Skip decomposition (load from cache)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --skip-decomposition

# Multiple source vaults
uv run --env-file .env -- python -m zettelvault VaultA VaultB ~/path/to/dest

# Sample 5 representative notes for preview
uv run -p 3.13 -- python -m zettelvault MyVault --sample --sample-size 5

# Run pipeline on the sample
uv run --env-file .env -p 3.13 -- python -m zettelvault "$(pwd)/_sample/MyVault" ~/path/to/preview

# Custom config
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --config config.local.yaml

CLI Arguments

ArgumentDescription
source_vaultOne or more source vault names (as known to vlt)
dest_vaultDestination vault path (absolute or ~/...)
--dry-runNo file writes; preview only
--no-rlmDisable RLM; use dspy.Predict for decomposition
--skip-classificationLoad pre-classified notes from classified_notes.json
--skip-decompositionLoad atomic notes from atomic_notes.json (implies --skip-classification)
--limit NProcess only the first N notes (0 = all)
--sampleSelect representative notes into a sample vault (no LLM calls needed)
--sample-size NNumber of notes to sample (default: 10, from config)
--sample-dir PATHOutput directory for sample vault (default: ./_sample)
--config FILEPath to config YAML override

Testing

make test          # Unit tests, no API key needed
make test-all      # Unit + integration tests (needs OPENROUTER_API_KEY)
make lint          # Run ruff linter

Integration tests are marked with @pytest.mark.integration and require OPENROUTER_API_KEY to be set.

Test Coverage

ModuleTestsCoverage
_safe_filename7Unsafe chars, empty input, leading dots
sanitize_content4Frontmatter stripping, wikilink escaping
is_valid_output6Length, template garbage, placeholder detection
parse_atoms11Single/multi atoms, tag normalization, link cleaning, hashtag splitting
_build_content6Frontmatter generation, domain/subdomain, tags, links, heading
write_note6PARA paths, collision handling, unsafe titles
write_moc4Domain grouping, wikilinks, deduplication
vlt_run / helpers4JSON/plain fallback, non-md filtering, error handling
pricing.py16ModelRate, PhaseUsage, API fetch, history extraction, CostTracker
Integration3Real LLM classification (2), decomposition (1)

Project Structure

zettelvault/
  zettelvault/
    __init__.py         # Public API exports
    __main__.py         # CLI entry point (python -m zettelvault)
    config.py           # Configuration loading and access
    vault_io.py         # vlt CLI integration (read/list/resolve vaults)
    sanitize.py         # Content sanitization and wikilink escaping
    classify.py         # PARA classification and concept indexing
    decompose.py        # RLM/Predict decomposition and atom parsing
    writer.py           # Note and MOC file writing
    resolve.py          # Orphan wikilink resolution (Step 5)
    pipeline.py         # Pipeline class (LM init, orchestration)
    sample.py           # Representative sample vault generation
  pricing.py            # Cost tracking module (OpenRouter API + DSPy history)
  config.yaml           # Default configuration
  config.local.yaml     # Local overrides (gitignored)
  Makefile              # Build targets
  pyproject.toml        # Project metadata and dependencies
  pytest.ini            # Test markers (integration)
  .env                  # API keys (gitignored, not committed)
  .gitignore
  tests/
    test_zettelvault.py # Unit + integration tests for the pipeline
    test_pricing.py     # Unit tests for cost tracking
    conftest.py         # Shared test fixtures

Cache Files (gitignored)

FileContents
classified_notes.jsonPARA classifications + note content
atomic_notes.jsonDecomposed atomic notes
fallback_notes.jsonNotes that fell back to Predict or passthrough
migration_log.txtstdout log from the last run

Potential Improvements

The current implementation is deliberately sequential for clarity - this is reference code meant to be read and adapted. Below are optimizations that would improve throughput for large vaults (500+ notes) without changing the pipeline logic.

Parallel classification. Each classify_note() call is a single independent dspy.Predict invocation with no shared mutable state. These can be parallelized with asyncio.gather() and a Semaphore to cap concurrency:

import asyncio

sem = asyncio.Semaphore(4)

async def classify_one(title, content):
    async with sem:
        return await dspy.asyncify(classify_note)(title, content)

results = await asyncio.gather(*[classify_one(t, c) for t, c in notes])

At ~15s/note sequentially, 4-way concurrency cuts classification from ~3.5 hours to ~50 minutes on a 770-note vault.

Parallel decomposition. Each decompose_note() is also independent - the concept index is read-only during decomposition. RLM spawns a Deno subprocess per call, so ThreadPoolExecutor is a better fit than asyncio:

from concurrent.futures import ThreadPoolExecutor
import threading

lock = threading.Lock()

def decompose_and_checkpoint(title, data, related):
    atoms, iters, subs, method = decompose_note(title, data, related)
    with lock:
        all_atomic.extend(atoms)
        ATOMIC_CACHE.write_text(json.dumps(all_atomic, indent=2))
    return atoms, method

with ThreadPoolExecutor(max_workers=3) as pool:
    futures = [pool.submit(decompose_and_checkpoint, t, d, r) for t, d, r in work]

Start with 2-3 workers and increase if the API doesn’t rate-limit. At ~2.5 min/note, 3 workers reduces a 35-hour decomposition pass to ~12 hours.

What cannot be parallelized:

Pipeline overlap. A more advanced optimization: begin decomposing early notes while classification is still running. This requires building the concept index incrementally, which adds complexity for marginal gain since classification is the faster step.

References