How Reddit designed their metadata store to serve 100k req/sec at p99 of 17ms
Vložit
- čas přidán 27. 07. 2024
- System Design for SDE-2 and above: arpitbhayani.me/masterclass
System Design for Beginners: arpitbhayani.me/sys-design
Redis Internals: arpitbhayani.me/redis
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafters.io/join?via=...
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.me/knowledge-base
Bookshelf: arpitbhayani.me/bookshelf
Papershelf: arpitbhayani.me/papershelf
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack.com
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff. - Věda a technologie
Successfully ruined my upcoming weekend. Have to view all of your videos now 😢
The Kafka CDC can solve the problem of synchronous write inconsistencies, but not the backfill overwrriting. I suspect they might do some kind of business logic or SHA/checksum validation to ensure that they are not overwriting the data during backfilling. Correct me if I'm missing something bro.
Weekend party ❌ Asli Engineering ✅
Hey Arpit, thanks a lot for putting this up. Your writing skills are next level, crisp and crystal clear. Could you please tell what's the setup you use for taking these notes?
Thanks in advance.
Large Data Migration -> Event Driven Architecture
Also, interesting to learn about postgres's extentions which are not required if going with a serverless database solution like DynamoDB.
Nice, very informative 👍
Thank you for doing this!
how many shards were used to hold those partitions to achieve that much throughput
Отлично, что на CZcams есть такие полезные видео. Спасибо, Министр!
I guess we don't need both the CDC Setup and dual writes, just thr CDC setup would suffice to insert the data in the new DB, correct?
Can u share the notes ? Pls
Hi Arpit, I think you could have gone a bit more into depth, like they have mentioned in their blog. a bit about how they are using incrementing post_id, which allows them to manage most of the query from 1 partition only. Not complaining at all. Thanks for being awesome as always.
TLDR; 7 minutes seem a bit short
I deliberately skipped it because it would have taken 4 more minutes of explanation. I experimented in this video by keeping it very surface level and around 8 min mark, and the retention for this one is through the roof 😅
In the last 15 videos I saw a massive drop in retention numbers when I started explaining the implementation nuances or when video length went beyond 8 minutes. So I wanted to experiment and test out the hypothesis in this one video.
Hence you see I did not even inject the ad of my courses or the intro. Jumped right in the topic.
But yes, given their IDs are monotonic, their batch gets for media metadata would almost always hit the single partition if they partition the data by ranges.
Would appreciate if you can make another 8 min video for details.
I am here for the meat. Surface level stuff is ok. But meat.
No complains, just stating the opinions of a random lurker.
@@AsliEngineering so in short your focus is more viewers instead of better video quality right?
@@amoghyermalkar5502 if you really feel like it then please check out other videos on my channel.
I go in depth that other people cannot even think about or comprehend.
Remember, it hurts to put in effort of 2 days on a video to be seen by just 2000 people in 7 days.
@@AsliEngineering Absolutely. I can understand that, without naming names there are so many "Tech" Creator who are getting 10x times the views we got here but they just never seem to talk about Substance.
I just want to you know we do appreciate it a lot. I am still trying to read more blogs on my own so I don't think I am being spoon fed for watching a video,
As usual great stuff🙌🏻
Since they are storing data in JSON and also scaling the postgres db, Why they did not go with non -relational db like mongoDB, which stores data in JSON and also provide scaling out of the box ?
I have already answered this in some other comment, do go through that.
TLDR; operational expertise and familiarity seems to have taken precedence.
Asli Engineering!
Good video. Would appreciate a lot it you can attach any resources you used in video like blog from reddit that is mentioned in description. Would be great if link is also attached there.
Already added in the description. You are just a scroll away from finding that out.
Hey Arpit… thanks for the video
I liked doing partition as policy that runs on a cron. But wouldn’t moving data around in partitions also warrant a change in backend(read) ?
Or you are saying the backend has been written in a way that it takes partitioning into account while reading the data?
They use range based partitioning so no repartitioning required.
They use range based partitioning so no repartitioning required. New range every 90 milion IDs.
How will you handle search, because the relevant data might be several days older partitions. Even if they're using a secondary data store, the date/time range-based partitioning or even sharding will not suffice. what do you think?
Why would it be several days older? The migration was a one time activity.
Post the switch the writes of media metadata is always going to the unified media metadata store.
Thank you for your response, Apologies, my question was not clear.
My question was related more related to searching through such a data store where the data is partitioned daily i.e. partitioned on the created_at.
Let’s say you search for an 'X term' post, and the result ideally will contain a lookup from several partitions. For example, if there is a relevant post from a year back. We are looking at many partitions. To build the search result, the DB has to load each daily partition.
Daily partition will work well if the lookup is limited to a couple of days back. That’s my understanding.
Why reddit don't go for document db for there storage as per structure and pattern .... What u think about it @arpit?
according to me familiarity of stack could be the biggest reason.
Apart from this the query pattern here is that most request hits single partition (given IDs are monotonically increasing and partitions are created on range basis). Most KV stores do hash based partitioning because of which the lookups need to be fanned out across the shards which is quite expensive.
The databases that do support range based lookup on per shard is DDB and that managed offering at scale becomes very expensive.
These are some of the pointers I could think of. But again this is pure guess.
Arpit - using cdc and kafka.. that still does not solve the problem of - Data from old source during 'migration' overriding data in the new aurora postgres, right?
What am i missing?
You will still need a bulk batch job that takes up all the archival data from all the multiple sources and ingests them into the new Aurora. Using CDC does not solve for that backport, correct?
CDC can transfer the historical data using snapshots of the existing databases if/when the transaction log is not available for old data, and then the consumers report any write conflicts into a separate table which the devs can remediate later on. Hope that answers your question
The consumers of Kafka have this responsibility. It is not that just adding Kafka solved the problem. The core conflict management is written in the consumer of it which checks and sets in the database.
How pg bounce minimizes the cost of creating a new process for each request?
May be I am wrong, can you tell me how cost is reducing here?.
let's say each connection spawns a new process. killing a connection kills the process.. what would you do logically? Think before reading the next line..
simply re-use the connections .. that's what every database proxy in front usually does in simple words.. the connections are re-used and managed accordingly.
@@niteshlohchab9219 Thanks, bro, for the easy and simple explanation; I appreciate it. What I was thinking was that the term "cost" is used for money, but I was wrong. Here, "cost" means scalability and performance, ensuring that each client gets a response as quickly as possible. So, in terms of money, we increase the cost, and in terms of scalability and performance, we decrease the cost.
If we look at it for the long term in enterprise applications, having a scalable product also increases revenue.
Let me know If I am correct or not 🙃
Because it does connection pooling, so connections are reused.
Nice. Kafka part seemed over engineering. Can just verify the hash before writing to new metadata db in syncing phase.
Amajeeng
Thanks Arpit!! Also what are your thoughts about using Pandas as a metadata DB, Dropbox had a post regarding they using Pandas wherein they explained in depth why other DBs are not better for them. (Would like to know your views too on it)
I am not aware about this. Let me take a look.
why are they using postgres, if they are storing it as json ?
Stack familiarity, plus range based partitioning support.
What is used over here to write down the notes?
GoodNotes.
Thanks looks very clean
What is CDC mentioned here ? Please suggest some pointers
Change Data Capture
How did they check if the reads from old vs new database are same?
a simple diff would work given that the final JSON has to be the same as no changes were made to the client.
@@AsliEngineering if there is any issue at scale, wouldn't it be very hard to debug?
postgres is the king still :) with extensions it is all you need...
How can I use AI to make this sound like my native language?
Are you saying the reddit has unified database per region?
Unified database implies that the data that was split across multiple services has been moved to one database.
Now this unified one can be replicated across regions to improve client side response times.
What was the motivation to go for dedicated database per service initially by the Reddit? Could you please tell how many such services they have it?
Regarding your second point, the Reddit team allowed only reads from the replicated databases and not writes. Correct?
what is CDC?
Change Data Capture ... means streaming of bin log files of database
Dayumm