Building an On-Device Vector Store for iPhone
Over the past few days, I've been exploring a simple idea for my next iOS app: an AI assistant that remembers things directly on the device. Part of this came from my own preference for privacy. I want personal data to stay on the phone whenever possible, with no servers, no cloud vector DB, and no hidden storage behind the scenes. Just a small, transparent system running locally on an iPhone.
Why store vectors on-device?
As I explored the assistant idea, it became clear that without memory it would behave like any other chatbot, reactive, forgetful, and disconnected from past conversations. Usually memory means sending data to the cloud, storing embeddings server side, and dealing with privacy and compliance. I wanted to see how far I could go if everything stayed on the phone.
My use case was simple: My assistant should remember what I tell it. But I quickly realized this same technique could power many types of apps.
- Notes with semantic search
- Health tracker spotting symptom patterns
- Offline Q&A for a small doc set
- Photo grouping by meaning
The architecture (simple version)
The architecture ended up being much simpler than I expected: SQLite handles long‑term storage, an in‑memory index keeps searches fast, and CoreML generates embeddings on the device. Everything flows in a small loop:
- User sends a message
- Embed text on-device (CoreML)
- Store the vector in SQLite so it persists across launches
- Insert the vector into an in-memory index for fast cosine search
- For future messages, embed + query the index for relevant memories
- Send those memory matches as context to the LLM
- LLM responds with normal output plus any new structured memories
- Repeat
User Message
│
▼
┌──────────────┐
│ Tokenizer │
└──────────────┘
│
▼
┌──────────────────┐
│ CoreML Embedding │
│ (GTE-small) │
└──────────────────┘
│
├──────────────► Store in SQLite (persistent memory)
│
└──────────────► Add to In-Memory Vector Index (fast search)
│
▼
┌──────────────────────┐
│ Cosine Similarity │
│ Threshold Filtering │
│ Sorted Top Matches │
└─────────────┬────────┘
│
▼
┌────────────────────┐
│ LLM Backend │
│ (query + memory) │
└────────────────────┘
Once the flow felt clear, the next big challenge was figuring out how to generate embeddings directly on the device without a server.
Getting on-device embeddings working
When I first started this project, I honestly didn't know I could run an embedding model directly on an iPhone. I assumed embeddings had to be generated by a cloud API such as OpenAI, Hugging Face, or something I'd need a server for. But it turns out that CoreML has quietly become powerful enough to run models like GTE-small completely on-device. No internet, no round-trips, and no special hardware beyond what the phone already has. Once I realized that, the entire design of the app suddenly made a lot more sense: if the phone can generate its own embeddings, then the whole memory system can stay local.
Once I learned that, the next step was figuring out how to actually generate those embeddings on the device. That meant converting an embedding model into a CoreML format that an iPhone can run efficiently.
I decided to use GTE-small, a lightweight 384‑dimensional embedding model from Hugging Face. It's small, fast, and accurate enough for mobile use.
The CoreML conversion process turned out to be much simpler than I expected. Here's the shortened version of the script I used:
# 1. Load and wrap the model
wrapper = GTEWrapper("gte-small")
wrapper.eval()
# 2. Create example tensors for tracing
seq_len = 128
input_ids = torch.randint(0, 100, (1, seq_len))
attention_mask = torch.ones((1, seq_len))
# 3. Trace the model for CoreML conversion
traced = torch.jit.trace(wrapper, (input_ids, attention_mask), strict=False)
# 4. Convert to CoreML ML Program
mlmodel = ct.convert(
traced,
inputs=[
ct.TensorType(name="input_ids", shape=(1, seq_len), dtype=np.int32),
ct.TensorType(name="attention_mask", shape=(1, seq_len), dtype=np.int32),
],
convert_to="mlprogram",
compute_units=ct.ComputeUnit.ALL
)
# 5. Save the result
mlmodel.save("GTE_Small_Embedding.mlpackage")
This produces a CoreML .mlpackage I can drag directly into Xcode. Once bundled with the tokenizer, the model runs entirely on-device and generates embeddings fast enough for real‑time use.
The Swift side is simple:
let tokens = tokenizer.encode(text)
let input = try EmbeddingModelInput(input_ids: tokens)
let output = try model.prediction(input: input)
let vector = output.embedding // [Float]
And now I have a vector I can store anywhere.
SQLite as the persistence layer (with GRDB)
Even though everything stays on the device, I still needed a reliable place to store vectors long-term. SQLite was the natural choice. It's fast, lightweight, and built into iOS. But instead of talking to SQLite directly, I used GRDB, a lightweight Swift ORM that makes database work much cleaner and safer.
GRDB gives me:
- Simple Swift structs as record models
- Migrations without boilerplate
- Safe read/write transactions
- No need to write raw SQL unless I want to
The table I ended up with is straightforward:
id TEXT PRIMARY KEY // UUID
text TEXT
embedding BLOB
memoryType TEXT
sensitivity TEXT
createdAt DATETIME
Each memory gets its own UUID, the original text, and the embedding blob. GRDB handles encoding and decoding, while SQLite takes care of persistence. The in-memory index performs the actual vector search, and SQLite simply ensures nothing disappears between app launches.
With persistence sorted out, the next step was figuring out how to search these vectors quickly.
The in-memory vector index
Once persistence worked, I needed a way to actually search those vectors. SQLite is perfect for keeping them in storage, but it won't do cosine similarity for me. The fix was a tiny in-memory index: load everything at startup, keep it warm, and answer queries instantly.
On launch, I pull every memory from SQLite, decode the embedding BLOB into a [Float], and stash an array of (id, embedding) tuples. It stays lightweight, and entirely Swift.
The search loop is intentionally simple:
- Normalize the query. Make direction matter, not magnitude.
- Normalize each stored vector on the fly. Cosine becomes a dot product of normalized vectors.
- Score. 1.0 ≈ same idea, 0.0 ≈ unrelated.
- Rank. Sort the hits from best to worst.
- Filter. Two thresholds keep noise out: one for the top score, one for neighbors.
- Return IDs. Grab those rows from SQLite and hand them to the LLM as context.
// Normalize query inline
let qNorm: [Float] = {
let norm = sqrt(query.reduce(0) { $0 + $1 * $1 })
return query.map { $0 / max(norm, 1e-9) }
}()
// Score all vectors
var scored: [(String, Float)] = []
for entry in index {
let v = entry.vector
// Normalize stored vector inline
let vNorm: [Float] = {
let norm = sqrt(v.reduce(0) { $0 + $1 * $1 })
return v.map { $0 / max(norm, 1e-9) }
}()
// Cosine similarity = dot product of normalized vectors
let score = zip(qNorm, vNorm).reduce(0) { $0 + $1.0 * $1.1 }
scored.append((entry.id, score))
}
// Filter and sort
let filtered = scored.filter { $0.1 >= 0.75 }
let sorted = filtered.sorted { $0.1 > $1.1 }
return sorted.prefix(topK).map { $0.0 }
Even with brute-force cosine, a few thousand vectors search instantly in RAM. For this assistant, that's the sweet spot: easy to debug, nothing to tune, and everything stays on-device.
What if this grows?
Today, brute-force is perfect. But if I ever push past ~10k vectors, looping over everything starts to feel less instant. That's the point where approximate nearest neighbor indexes like HNSW (Hierarchical Navigable Small World graphs) make sense.
HNSW builds a layered graph so you don't scan every vector. You start high, hop toward promising regions, and drill down. Search time grows slowly while results stay close to exact. It's a natural upgrade path if this tiny Pinecone in your pocket ever needs to stretch.
And once you add something like HNSW, the system stops being just a playful experiment and starts resembling a real Vector Database running entirely on an iPhone. Fast inserts, scalable search, and approximate nearest neighbor lookup all happening locally. It works without cloud services or external dependencies; the entire vector store lives and runs inside the app.
I don't need this today, but it's good to know where the road leads if the dataset ever grows.
How this vector system fits into my app
Up to this point, everything I built (embeddings, SQLite, the in-memory index) was just the infrastructure. The real question is: how does this actually help the assistant feel more intelligent?
Here's a simple example. Suppose the user says:
I drive a Toyota Prius, and it's been running well so far.
The assistant replies normally, but the important part happens behind the scenes. The model also returns a small structured memory candidate:
memory: User owns a Toyota Prius
My app then:
- Embeds that memory on-device ([0.0213, -0.0841, 0.3672, 0.1459, -0.0124, 0.0931, -0.2017, 0.0584, … 376 more values …])
- Stores it in SQLite
- Adds it to the in-memory index
Nothing leaves the device.
Later, the user asks:
I think my car needs maintenance soon. What should I check?
Before sending the message to the LLM, the app:
- Embeds the new message
- Runs a cosine search against past memories
- Finds: User owns a Toyota Prius
- Sends that memory along with the query
And now the assistant can respond with more awareness of context without needing a cloud database or server-side personalization.
This is the part that made the whole vector system feel worthwhile. It's not just storing numbers; it's connecting pieces of a conversation in a way that feels natural and respectful of user privacy.
Conclusion: What I learned
Building this on-device vector database taught me a few things:
- Modern iPhones are much more capable than I assumed. Running embeddings and cosine search locally is not only possible. It's fast.
- I don't need a massive infrastructure to get useful semantic memory. A few hundred lines of Swift are enough for many real-world apps.
- Keeping everything on-device simplifies privacy dramatically. No servers, no external storage, no trust me messaging. Everything is local.
- A small, transparent memory system can feel more personal than a complex one.
- Most importantly, the assistant starts to behave a little more like something that remembers you rather than reacting in isolation every time.
Sometimes a simpler solution is all I need.
And that made the experiment worth it.