When Keywords Aren't Enough
With lifestats, I shipped SQLite storage with FTS5 full-text search. It was a huge improvement over the jq archaeology workflow. Queries that took minutes now took milliseconds.
But something was off.
I’d search for “auth” and miss the conversation where I discussed “login flow.” I’d search for “error handling” and not find the session about “exception strategy.” The words were different. The meaning was the same.
FTS5 is literal. It finds what you said, not what you meant.
The Keyword Problem
Here’s a real example. I wanted to find a past discussion about streaming responses. The actual conversation used “SSE,” “server-sent events,” and “chunked transfer.” I searched for “streaming.”
Zero results.
graph LR
Q[Search: streaming] --> FTS[FTS5 Engine]
FTS --> M{Match?}
M -->|No| R1[❌ SSE discussion]
M -->|No| R2[❌ chunked transfer]
M -->|No| R3[❌ server-sent events]
style Q fill:#3f3f46,stroke:#71717a
style FTS fill:#6366f1,stroke:#4f46e5
style R1 fill:#ef4444,stroke:#dc2626,color:#fff
style R2 fill:#ef4444,stroke:#dc2626,color:#fff
style R3 fill:#ef4444,stroke:#dc2626,color:#fff
Keyword search fails when terminology varies
The information existed. The search couldn’t find it.
This is the fundamental limitation of keyword search. It operates on surface-level token matching. If the exact word isn’t there, it’s invisible—no matter how conceptually related the content might be.
Enter Embeddings
The solution is to search by meaning, not by words. That’s what embeddings do.
An embedding model converts text into a high-dimensional vector—a list of numbers that represents the semantic meaning of that text. Similar meanings produce similar vectors. “Streaming,” “SSE,” and “server-sent events” all land in roughly the same region of vector space.
graph TB
subgraph "Text Space (Words)"
T1["streaming"]
T2["SSE"]
T3["server-sent events"]
T4["authentication"]
end
subgraph "Vector Space (Meaning)"
V1["[0.82, -0.31, 0.44, ...]"]
V2["[0.79, -0.28, 0.41, ...]"]
V3["[0.81, -0.33, 0.42, ...]"]
V4["[-0.12, 0.67, -0.55, ...]"]
end
T1 -->|embed| V1
T2 -->|embed| V2
T3 -->|embed| V3
T4 -->|embed| V4
V1 <-.->|similar| V2
V2 <-.->|similar| V3
V1 <-.->|similar| V3
style T1 fill:#3f3f46,stroke:#71717a
style T2 fill:#3f3f46,stroke:#71717a
style T3 fill:#3f3f46,stroke:#71717a
style T4 fill:#27272a,stroke:#52525b
style V1 fill:#8b5cf6,stroke:#6366f1
style V2 fill:#8b5cf6,stroke:#6366f1
style V3 fill:#8b5cf6,stroke:#6366f1
style V4 fill:#27272a,stroke:#52525b
Embeddings map different words with similar meanings to nearby vectors
When I search for “streaming,” the query gets embedded too. Then it’s just geometry—find the stored vectors closest to the query vector. Cosine similarity. Dot products. Math that doesn’t care about exact word matches.
The Architecture
I added an embedding indexer that runs as a background task. It polls the lifestats database for unembedded content, batches it up, sends it to an embedding provider, and stores the resulting vectors.
graph TB
subgraph "Background Indexer"
POLL[Poll Timer] -->|every 30s| CHECK[Check for<br/>unembedded docs]
CHECK -->|batch of 10| EMB[Embedding Provider]
EMB -->|vectors| STORE[Store in SQLite]
end
subgraph "Embedding Providers"
EMB --> OPENAI[OpenAI API<br/>text-embedding-3-small]
EMB --> AZURE[Azure OpenAI]
EMB --> OLLAMA[Ollama<br/>nomic-embed-text]
EMB --> LOCAL[Local MiniLM<br/>all-MiniLM-L6-v2]
end
subgraph "Storage"
STORE --> TE[(thinking_embeddings)]
STORE --> PE[(prompts_embeddings)]
STORE --> RE[(responses_embeddings)]
end
style POLL fill:#27272a,stroke:#52525b
style CHECK fill:#3f3f46,stroke:#71717a
style EMB fill:#8b5cf6,stroke:#6366f1
style STORE fill:#10b981,stroke:#059669
style OPENAI fill:#27272a,stroke:#52525b
style AZURE fill:#27272a,stroke:#52525b
style OLLAMA fill:#27272a,stroke:#52525b
style LOCAL fill:#27272a,stroke:#52525b
Background indexer converts text to vectors across multiple provider options
The indexer is provider-agnostic. OpenAI’s API, Azure OpenAI, Ollama running locally, or a compiled-in MiniLM model—same interface, same storage format. I can switch providers without re-architecting anything.
Configuration is straightforward:
[embeddings]
provider = "remote"
model = "text-embedding-3-small"
api_base = "https://api.openai.com/v1"
batch_size = 10
poll_interval_secs = 30
For the initial indexing run, I embedded roughly 1 million tokens of past session data. Cost: $0.02. Time: about 15 minutes with rate limiting. Not bad for a complete semantic index of everything I’ve ever discussed with Claude.
But Keywords Still Matter
Here’s the thing—semantic search isn’t strictly better than keyword search. It’s different.
Semantic search excels at conceptual queries: “discussions about error handling” finds content about exceptions, panics, Result types, and error propagation even if none of those exact words appear in your query.
But keyword search excels at precision: searching for "unwrap()" should find exactly that function call, not conceptually similar error handling patterns.
The ideal is both.
Hybrid Search with RRF
This is where Reciprocal Rank Fusion comes in. RRF is an algorithm for combining ranked results from multiple search systems. It’s elegantly simple:
For each document, calculate:
RRF_score = Σ (1 / (k + rank_i))
Where k is a constant (typically 60) and rank_i is the document’s position in each result list. Documents that appear high in multiple lists get boosted. Documents that appear in only one list still contribute.
graph TB
Q[Query: streaming responses] --> SPLIT{Split}
SPLIT --> SEM[Semantic Search]
SPLIT --> KW[Keyword Search<br/>FTS5 + BM25]
SEM --> SR[Semantic Results<br/>1. SSE chunking<br/>2. Response streaming<br/>3. Async handlers]
KW --> KR[Keyword Results<br/>1. Response streaming<br/>2. HTTP responses<br/>3. Response parsing]
SR --> RRF[RRF Fusion]
KR --> RRF
RRF --> FINAL[Final Ranking<br/>1. Response streaming ⬆️<br/>2. SSE chunking<br/>3. HTTP responses<br/>4. Async handlers<br/>5. Response parsing]
style Q fill:#3f3f46,stroke:#71717a
style SEM fill:#8b5cf6,stroke:#6366f1
style KW fill:#6366f1,stroke:#4f46e5
style RRF fill:#10b981,stroke:#059669
style FINAL fill:#10b981,stroke:#059669
RRF combines semantic and keyword results—documents in both lists rise to the top
“Response streaming” appears in both result sets, so RRF boosts it to the top. “SSE chunking” only appears in semantic results but still makes the final list. The fusion captures both the conceptual matches and the literal ones.
BM25: The Keyword Side
On the keyword side, I’m using BM25 through SQLite’s FTS5. BM25 (Best Match 25) is a ranking function that improves on simple term frequency by accounting for:
- Term frequency saturation: The first occurrence of a word matters more than the tenth
- Document length normalization: Short documents with a keyword aren’t unfairly penalized against long documents
- Inverse document frequency: Rare words are more significant than common ones
graph LR
subgraph "BM25 Scoring"
TF[Term Frequency<br/>How often in doc?] --> SAT[Saturation<br/>Diminishing returns]
IDF[Inverse Doc Freq<br/>How rare overall?] --> WEIGHT[Word Weight]
DL[Doc Length<br/>Normalize for size] --> NORM[Length Norm]
SAT --> SCORE[BM25 Score]
WEIGHT --> SCORE
NORM --> SCORE
end
style TF fill:#3f3f46,stroke:#71717a
style IDF fill:#3f3f46,stroke:#71717a
style DL fill:#3f3f46,stroke:#71717a
style SCORE fill:#6366f1,stroke:#4f46e5
BM25 balances term frequency, rarity, and document length
FTS5 handles all of this automatically. I just query with bm25() ranking and get relevance-sorted results.
SELECT content, bm25(thinking_fts) as score
FROM thinking_fts
WHERE thinking_fts MATCH 'streaming'
ORDER BY score
LIMIT 10;
Lower scores are better (it’s a distance metric). The results come back ranked by relevance, not just presence.
The Full Pipeline
Here’s how it all fits together at query time:
sequenceDiagram
participant User
participant MCP as MCP Tool
participant API as Aspy API
participant FTS as FTS5 Engine
participant VEC as Vector Search
participant RRF as RRF Fusion
User->>MCP: "Find discussions about<br/>error handling"
MCP->>API: GET /lifestats/context/hybrid<br/>?topic=error+handling
par Parallel Search
API->>FTS: BM25 keyword search
API->>VEC: Embed query → cosine similarity
end
FTS-->>API: Keyword results (ranked)
VEC-->>API: Semantic results (ranked)
API->>RRF: Combine with k=60
RRF-->>API: Fused ranking
API-->>MCP: Top 10 results
MCP-->>User: Structured matches with<br/>source, timestamp, preview
rect rgba(16, 185, 129, 0.1)
Note over User,RRF: ~200ms total latency
end
Hybrid search executes keyword and semantic queries in parallel, then fuses results
Both searches run in parallel. FTS5 queries the full-text index. The vector search embeds the query and finds nearest neighbors. RRF merges them. The whole thing takes about 200ms.
Graceful Degradation
Not everyone wants to pay for embeddings. Not everyone has an OpenAI API key. The system handles this gracefully:
graph TD
Q[Hybrid Search Request] --> CHECK{Embeddings<br/>available?}
CHECK -->|Yes| HYBRID[Run both searches<br/>FTS5 + Vector]
CHECK -->|No| FTS[FTS5 only<br/>Keyword search]
HYBRID --> RRF[RRF Fusion]
RRF --> R1[Results<br/>search_type: hybrid]
FTS --> R2[Results<br/>search_type: fts_only]
style CHECK fill:#f59e0b,stroke:#d97706
style HYBRID fill:#10b981,stroke:#059669
style FTS fill:#6366f1,stroke:#4f46e5
style R1 fill:#10b981,stroke:#059669
style R2 fill:#6366f1,stroke:#4f46e5
No embeddings? System falls back to FTS-only with clear indication
The response includes search_type: "hybrid" or search_type: "fts_only" so Claude knows what kind of search was performed. If embeddings aren’t configured or haven’t finished indexing, keyword search still works. You lose the semantic matching but keep the core functionality.
Real Impact
Back to the streaming example. With hybrid search:
Query: "streaming"
Results:
1. [thinking] "For streaming responses, I need to tee the SSE stream..."
2. [thinking] "The chunked transfer encoding handles backpressure..."
3. [response] "Server-sent events flow through the proxy..."
The exact phrase “streaming” might not appear in any of these. But the concept of streaming does. Semantic search found what I meant.
This is the difference between a search engine and a memory. Search engines find documents. Memory recalls ideas.
What’s Next
The embedding system opens up possibilities I haven’t fully explored yet:
- Clustering: Group similar discussions automatically
- Anomaly detection: Surface unusual patterns in Claude’s behavior
- Cross-session linking: Find related conversations across months of history
- Summarization triggers: Auto-summarize when themes recur
For now, hybrid search is the immediate win. When Claude loses context to compaction, it can query not just what I said, but what I meant. The vocabulary mismatch problem is solved.
The spy doesn’t just remember what you said. It remembers what you meant.
Appendix: The Numbers
For the curious, here are the actual costs and dimensions:
| Provider | Model | Dimensions | Cost per 1M tokens |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | $0.02 |
| OpenAI | text-embedding-3-large | 3072 | $0.13 |
| Azure OpenAI | text-embedding-3-small | 1536 | $0.02 |
| Ollama | nomic-embed-text | 768 | Free (local) |
| Local | all-MiniLM-L6-v2 | 384 | Free (compiled in) |
Higher dimensions generally mean more nuanced semantic capture, but 1536 dimensions is plenty for context recovery. I’m using text-embedding-3-small and it handles everything I’ve thrown at it.
Storage overhead is roughly 6KB per embedded document (1536 floats × 4 bytes). For my ~5000 documents, that’s about 30MB of vector storage. Negligible.
The real cost is API calls during initial indexing. After that, it’s just new content as it arrives—a few cents per day of active usage.