⚠️ This post links to an external website. ⚠️
When I came across Turbopuffer, a vector database built entirely on object storage, I got curious. Really curious. The architecture seemed almost too simple.
Write-ahead logson S3?Centroid-based indexes?Stateless query nodes? It felt like someone had taken all the "rules" of database design and just... ignored them.I wanted to understand the tradeoffs myself. Not by reading more blog posts, but by actually building something. Also I broke up with my girlfriend, so I have nothing else to do.
We'll build a naive vector database on S3, inspired by Turbopuffer's architecture, using first principles and napkin math. We'll hit walls, make mistakes, and hopefully learn what design decisions we need to make.
The more interesting thing about this is, S3 is meant to be object storage, it's not that designed for database operations. We need to think about updates, deletes, managing indexes - how we store them, update them, and minimize roundtrips because every extra roundtrip is 200ms latency which your users will not like. So it's not that easy to just use S3 and turn it up to infinite scale.
I'm not a database expert. I'm not the hardcore database guy who tweets about
LSM treesand drops randomMySQLfacts at parties. I'm just someone who likes tinkering with things to understand how they work.This is a learning exercise. I'm building my own version of turbopuffer to really understand the trade offs, the gotchas, and why certain decisions were made. If you're looking for production-ready code or groundbreaking research, this isn't it. But if you want to follow along as I figure out how to build a vector database from first principles, you're in the right place.
continue reading on blog.karanjanthe.me
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.