How Zylch works

One desktop app, a local SQLite database, and a tight loop between your LLM of choice and the conversations you already have. Here's what happens every time you ask Zylch something.

01 · Conversation loop

One cached prefix, one volatile suffix.

Every chat turn goes to the LLM with a carefully split prompt. The base system prompt, your USER_NOTES, and your USER_SECRET_INSTRUCTIONS are concatenated once and marked with Anthropic's cache_control: ephemeral. That block rarely changes, so Anthropic serves it from cache on every subsequent turn.

After the cache breakpoint, Zylch appends a small volatile suffix — today's date down to the minute, your active profile, any last-second context. Because it sits outside the cached prefix, minute-granular updates never invalidate the cache.

Tools are cached too — the entire tool schema (search memory, send email, update task, create draft…) ships inside the same cached prefix.
Conversation history is cached up to the last message; a second cache marker lands on the tail, so replayed turns only pay for the delta.
Tool results flow back as tool_result blocks. Errors don't crash the turn — they surface as status: ERROR so the LLM can retry or ask you.
Direct-response tools (today: get_tasks) skip the second LLM call and return their formatted output straight to the UI — no round-trip tax.

BYOK. Anthropic and OpenAI are supported via their native SDKs. Prompt caching + tool use work on Anthropic; OpenAI gets tool use without caching. Your key sits in ~/.zylch/profiles/<email>/.env — nothing passes through our infrastructure.

02 · Compaction

Long conversations, narrated — not dropped.

Chat context grows. At around 80,000 tokens — conservatively estimated at 4 characters per token, rounded up on purpose — Zylch compacts the conversation before the next turn. The goal: keep the thread coherent without pushing against the model's context window.

The policy is simple and visible: keep the first turn, keep the last ten. The middle gets narrated by a smaller model (Haiku) into short prose: tool calls become one-liners ("called search_local_memory for Luigi Scrosati"), tool results become a sentence summary, plain text stays as-is.

Token-based, not turn-based. A ten-turn conversation full of large email bodies will compact before a fifty-turn conversation of short messages.
First turn is sacred. It usually anchors the whole conversation; Zylch keeps it verbatim.
Recent turns are sacred. The last ten stay intact so you never get déjà vu.
Failures are silent. If the summarizer call fails, Zylch returns the original history unchanged — a compaction bug must never break a chat.

The summarizer model is pinned to claude-haiku-4-5-20251001 and can be overridden via ZYLCH_COMPACTION_MODEL.

03 · Memory

Entity-centric blobs. One record per person, per company, per template.

Most AI memory layers append chunks of text forever and search them at retrieval time. Zylch is the other way around: memory is entity-centric. A blob is a single self-contained record about one real thing — a contact, a company, a recurring template you use — kept up to date as new information arrives.

Every blob carries an id (UUID), a namespace (user:<owner> for people, prefs:<owner> for preferences), the full text, its embedding, a JSON events log of every mutation with timestamp and reason, plus created_at and updated_at.

Auto-extraction. Phase 3 of zylch update runs a memory worker in parallel over unprocessed emails and events. The LLM extracts structured facts and the worker stores them as blobs.
LLM-decided updates. Two chat tools — search_local_memory, create_memory, and update_memory — force the model to search first, then explicitly pick between updating a specific blob_id and creating a new one. The tool layer doesn't guess; silent overwrites are structurally impossible.
Sentence-level embeddings. Alongside the blob's own embedding, each sentence gets its own vector in a blob_sentences table, so search can pinpoint the exact paragraph that matches your query.
Reconsolidation. When a hybrid-search result scores above 0.65, Zylch asks the LLM to merge two overlapping blobs into one. Prompt caching keeps the merge cheap.

The result: after a few weeks of use, asking "what's going on with Luigi Scrosati?" returns a structured record of the relationship — not a keyword hit in one random email.

04 · Hybrid search

Lexical and semantic, combined in-memory with numpy.

Finding a blob is a hybrid query: a lexical match against the blob text plus a semantic match against its embedding, reranked with a weighted sum. No vector database, no cloud index.

Embeddings. sentence-transformers/all-MiniLM-L6-v2 via fastembed on ONNX runtime. 384 dimensions. No PyTorch dependency, no GPU required. The model cache lives under ~/.zylch/fastembed_cache/ so it survives OS temp cleanups.
Vector similarity. Blob embeddings are loaded into a single numpy array (shape N × 384) at startup. A query is one dot product: np.dot(q, M) / (‖q‖ · ‖M‖). Sub-millisecond for < 10,000 blobs, about 750 KB of RAM per 500 blobs.
Lexical score. Multi-term LIKE matching against the blob content field, normalized to 0–1.
Reranking. hybrid_score = α · lexical + (1 − α) · semantic, with α = 0.5 by default. Tunable per profile.

At typical single-user scale (hundreds to a few thousand contacts and templates), this beats a remote vector DB on latency and beats plain keyword search on recall. Beyond ~10 k blobs the linear scan still works but starts to warm up; adding ANN is a question for that day, not this one.

05 · Local store

One SQLite file per profile. Nothing leaves the machine.

Everything — mail, WhatsApp threads, phone-call transcripts, calendar events, tasks, blobs — lives in a single SQLite database at ~/.zylch/profiles/<email>/zylch.db. WAL mode for crash safety, foreign keys enforced, each session auto-commits on success and rolls back on exception.

~19 ORM models cover the essentials: Email, Thread, Blob, BlobSentence, TaskItem, WhatsAppMessage, WhatsAppContact, MrcallConversation, CalendarEvent, OAuthToken, and friends.
Profile isolation. Every email address is a separate profile directory, each with its own .env, its own zylch.db, and its own profile.lock. Multiple profiles can run side by side in different Electron windows.
fcntl-based liveness. The lock is a POSIX advisory lock, not a PID check — the OS releases it automatically if the process crashes, so you never end up with a stale "profile already in use" error.
Credential handling. API keys (Anthropic, OpenAI, IMAP app passwords, Telegram, MrCall) sit in the profile .env with 600 permissions. If you set ENCRYPTION_KEY, OAuth credentials stored in the DB are Fernet-encrypted at rest.

Nothing syncs to our cloud. The desktop app talks to your IMAP server, your WhatsApp account, your MrCall instance, and your chosen LLM provider — directly. The only network traffic we see from your machine is the traffic you sent, to the services you chose.

The code is open — CLI and sidecar at github.com/malemi/zylch, desktop wrapper at github.com/malemi/zylch-desktop. Ready to try it? Get the installer.