@@crysist13 The idea is not to stop adding tuples to the "Likes" table, it's just that you don't count them for displaying total likes. That part wouldn't be an issue, you could still get who liked what
For people saying displaying incorrect numbers of likes, youtube does view count, the transactions are queued, and race conditions are not a big deal from a users perspective
wdym race conditions are not a big deal from a users perspective? i’m pretty sure the reason why is because they value more the availability of certain parts of youtube, like maybe the core system and like count, and consistency of certain parts, like the comments system. when a user would visit a video, it would hit a node (preferably the closest node) and that node might not have the most up-to-date information which means it’s eventual consistency
@@gabrielpedroza1346 it doesn't matter if a video gets 1001 views or 1000 views, the discrepancy is because of "eventual consistency" so it will eventually be the same as the count even it is updated in other shards. The benefit's are outweigh the drawbacks so much that big companies are using that. I don't see you being an architect there.
@@frenchButt44 Idk exactly how yt work but does it go like "you press like, the client side interface updates what should be the eventual number and then queues a transaction to update on the db"?
But ONLY for data/rows/fields that aren't mission critical...such a "likes" counts! 😁 If you're writing an airline reservation system or stock trading program "eventual" doesn't really cut it.
Eventual consistency is only okay for this model, one producer many consumers, where the content isn't liable to ever change much. Using it for something with state that changes a lot over time is always a terrible idea imo
@@choilive I mean adding the total_likes seems like the obvious thing to do from the start. Insta had to have a business plan when they started out so yes they should have anticipated 100k
It’s slightly a bit more complicated for instagram, where in order to drive engagement, it also shows which one of the people you’re following also likes the post. So other than the count which is global, there’s an aggregate query contextual to the logged in user’s first connections to the post.
The most expensive part on the read side is showing the preview comments on the post, but there are fallbacks for very high volume posts and during times of high overall usage. I worked on ranking for these surfaces
Just query in the friends like list for the last cpuple days. And you take just the 5 friends with the most engagement or which communicate with the user the most
@@ianhpete That seems like a good heuristic to start with but that would also assume that those n friends (in this case 5) would have had to have engaged with said post. If they didn't you'd have to widen your circle to a bigger subset and then that becomes a unique set for each post.
This is so untrue. Relational databases a very good at counting rows and will typically not even read the table, instead using btree index to count. What he should say is the hash join of two tables is slow. This is actually bad advice, UPDATE is a very expensive operation on a key table and will generate logging data to deal with concurrency. Ideally there should be a seperate post_statistics table which can be unlogged for high performance write while trading off durability
You raise some valid points about the efficiency of counting rows using indexed columns and the expense of update operations. I completely agree. I was happy to see btree mentioned in an Instagram comment, lol. While btree indexes can indeed speed up count operations, this efficiency diminishes as the dataset grows and queries become more complex. Instagram deals with massive volumes of data and very high traffic, which means that even indexed counts can become a bottleneck. I personally would use in memory caching (redis) and a message queue like Kafka. Also probably batch processing and composite indexing.
Some databases allow you to do something like that automatically, so when a like row gets inserted/removed it automatically updates the count. Or you could just run the count operation every 1-2 minutes and cache the result in the post row :)
Reminds me about a nosql rule , "Store related data together in a single structure to optimize retrieval, a principle particularly emphasized in NoSQL databases for efficient querying."
When I created an ecommerce websites in app. First time I was like. Okay, I'm just gonna create a column and name it. Total likes this way. I'm just gonna show how many likes this product get but later for marketing reasons, I need to know who liked this product so we can send a marketing emails so I had to create now on you table and name it liked by user. ID, for example, so if we want to do marketing email about some products. We just go to that table and find who like that product and send the marketing email. So I suggest you do both solution. Create a separate table and save the ideas of the users who like that post or that product and for speed reasons. Also, add a total likes in the products. List or product post this way. You will have everything for you. It may make the data base bigger. Because now you have a big another table
counting rows has never been slow. Its basically a core feature of almost all relational DB systems to offer ultra-fast counting. *BUT* : counting several thousands of times per second (!) introduces racing conditions and counting algorithm overhead which can become absolutely deadly for performance because of suboptimal locking or incorrect usage of mutexes, semaphores and the likes. Its not the counting itself killing the performance, its the low scalability of it. There are databases out there which offer ultra-fast counting which is also scalable. In some cases, this is realized via auto-updates of internal count fields, some use dark math wizardry and some are simply that fast by design.
@@phill6859 i recommend looking into current (and past) database technologies. Performance problems regarding row-counting have been solved for *decades* now. You need to get with the times, expand your knowledge and try out new stuff. It isnt slow anymore - and it hasnt been for decades. Except in cases where the database has been misconfigured or is being misused.
It may seem as an unreliable system which might produce inaccurate results, but with proper structuring you won't be able to mess it up. And it saves a lot of resources, too.
It depends on the system, and environment you are working with. If your count is slow on the database, you can change the operation from a post to a present operation.
It's more complicated than even that. As others have noted, you don't need (or want) to display a perfectly up-to-date like count, so it can be eventually consistent. The other important factor is despamming. A user clicking "like" and the like counter incrementing is not a single, atomic operation. If the user is identified as a spam bot, then their like should not count for the purpose of promoting the post. The real workflow is more like: 1. User clicks "like." Store that. 2. Start a batch despam pipeline to figure out whether the like is legitimate. 3. If so, update the like count on the post. The despam pipeline is going to take some time. In this case, nobody but spammers actually cares about accuracy down to a single like, so lazily updating is just fine.
Normally you batch the "despam" process. You won't do it whenever someone clicks on the button. It's a separate process. You identify bots, check which posts they liked and update the counts accordingly.
This is not how it’s done tho. Likes in a post still has a reference of who liked and when. They still have that related table but what actually happens is that they have a projection (read only) of the post with the aggregated data and that is only updated time to time, not every time someone likes the post, the UI creates the illusion but on the backend the projection will be updated after some time .
@@chawza8402 no, each time you add a new row, increase the number by 1. If you delete, decrease the number (like the video explains). That's also a cache because is still a "copy of the real data", which is the like count.
not necessarily i think you could count the new likes and the new dislikes and then add the cached value to the value already stored in the database and then set the cached value to 0@@chawza8402
You don't even necessarily need to put the counter in the *posts* table, either. 1,000,000 likes would mean 1,000,000 UPDATE calls on a table that is being read very often. Having another stats table, keeping the value hot in a cache and then reading from the cache if it's present will be vroom vroom.
What about maintaining a queue with the requests and update a value in the database with an async job that runs every 10-20 items OR every 5 seconds, which either one comes first. So you can do a batch update, keep one copy of the count, and keep normalization . And periodically you could correct this count value by doing another async check to the actual COUNT value of the database
That's is not really denormalizing the normal forms as this is not removing a foreign key. It's simply adding a counter that is basically an automatically computed and cached value.
We applied the same solution in a similar case before real load testing, but we have more then one property to calculate. And this place became the performance bottleneck of the app. it turned out that we don't need it. just counting from db (read nodes) didn't affect performance of the app (in context of our requirements) but we implemented `snapshots` solution when you take the last `snapshot` and just add all `likes` for this post after datetime when this snapshot was created
100% normalization is just theoretical fun. In the DB of normal production, some intermediate results should always be stored, so as not to overload the DB unnecessarily.
Denormalised database has higher risk of invalid or conflicting data. A process to increment count could be interrupted or terminated prematurely or an edge case can create a pattern of conflicting data. At a high scale, one might not observe a difference between 29877765 and 29877785 because both would be shown ad 29M but on lower scale it's a difference between 325 and 290. At lower scale developers are afraid of this because the mistake will easily be watched so they choose the safer route of keeping the database normalised and denormalise later if needed. This "later" things pile up and take a form of big technical dept which bites back when the app needs to be scaled up.
Right but what about concurrect access? Without the column there is no problem as each backend instance inserts the row within its own connection, so the inserts can happen out of order and concurrently. But if all of these connections want to have the value of the new column, they cannot get the same value, as they would all return n+1. So does the database queue these requests? How does it manage millions updates per second where each update has to have the newest version of the value? I think CZcams had similar problem with counting likes, that's why they did it in some clever way, and right after the upload the count would stop at around 300 views for some time. I don't know whether they still do it like this, or the times where the CZcams views got stuck at 300 are now long gone, but my heart is with you if you remember that.
This is wrong at many levels. 1. Having just a likes counter will not work, we need to also save who liked a post so that we dont increment view/like counter upon repeated liking/unliking. 2. Writing to the post table everytime some one likes is not a good approach when millions can like a post ina a few minutes. To address these we need to save both who liked a post and the reference to a post and to make sure this data takes less space, we use some probabilistic data structures like hyperloglog. These can estimate cardinalities of > 10^9 with accuracy of 2% and using around 1.5kb of memory.
You have no idea what you’re talking about. 1. “We need to also save who liked a post” - yeah that’s exactly what denormalization is doing. Preventing needing a COUNT query over the likes table every time a user views a post is exactly the problem that denormalization solves. The user only needs a ROUGH ESTIMATE of how many users liked the post and the only accurate information they need from the likes table is if THEY liked the post or not. 2. This is just technobabble bullshit. You’re just doing denormalization with extra steps. Did you even watch the video?
They optimized it even more by not updating that record so often, they would cache it in Redis in batches and only have 1 update to the central database for every 1000 likes or so. That would be distributed across servers. So you might get 5 updates per minute saying: "A thousand more please." That "thousand" number was also based on how often the like button was pressed for that post. Lots of likes? It would update every 5k likes. Slowing down? It would stop caching.
It would not be a long term solution, the solution would be up till certain number the query will run conventionally and it’s when the threshold has breached should the application writes it to the column. Which majorly covers and saves cpu cycles.
Okay I guess the reason would be that , if you add an extra column for Like then one needs to first find the row and then update it , while as a separate table u could just add a new row , no need to find or anything , saving on some overhead, but as shown in the video the shortcomings outweigh the benefits .
Instagram no longer uses a relational database for these sorts of things, i believe this specific case would have changed around 2017 but for sure by 2019
Normalization is great in theory. And as a diagram on paper. But you shouldn't come to a 3NF or BCNF in the first place if you intend to use a real-world RDBMS in a real-world application... like.. ever.
I'm assuming that you still going to log that a user liked the post because otherwise there is nothing stopping from someone hitting that endpoint over and over again if there is not validation.
Oh ok! I figured since we’re solving a problem, when the problem was essentially every time a celebrity uploaded causing the infrastructure to crash. My problem-solving skills kicked in. Immediately arriving at, we’ll If we remove celebrities from the equation problem solved! You could apply the solution to solve our problem here in the video, and in the real world! Bada bing bada BOOM, Bob’s your uncle! Isn’t coding fun!
1) Counting rows in a realtional database is one of the fastest possible ways to count something. 2) You're missing the half of this problem where millions of people are requesting the count (as well as increasing it themselves) 3) A better term for fixing this problem is "materialization", not "denormalization"
I’ve found that generic statements about sql optimization/performance are not very useful in practice. There are numerous things that affect the performance of a relational db, not least of which is the specific DBMS e.g MySQL with InnoDB optimizes count(*) to use an index scan which is quite performant.
The exact number of likes is not business-critical, for large numbers. A new, time-stamped entity: ' estimated number of likes' does *not* denormalize the database. All it takes is a semantic confession that the cache is not necessarily up to date.
The better option for this is use optimistic update in frontend side and send events using sns, topics, sqs.... whatever.. and if you wants betters results process that's events in batch to celebrities users like JB..
It's such simple thing but still a franchise level industry wont take it as secure or in options
Because you can easily get into scenarios where "total_likes" doesn't actually reflect the number of likes
And no longer check who liked the post, just the number of likes. Meaning the list of who liked what would need to be kept somewhere else.
@@carlosmspk If posts couldn't be un-liked it would be a lot simpler
Race Conditions be like 🗿
@@crysist13 The idea is not to stop adding tuples to the "Likes" table, it's just that you don't count them for displaying total likes. That part wouldn't be an issue, you could still get who liked what
For people saying displaying incorrect numbers of likes, youtube does view count, the transactions are queued, and race conditions are not a big deal from a users perspective
Yeah I can’t believe people are arguing about accuracy lmao. Who cares. Goes to show you most programmers are GEEKS who can’t actually problem solve
wdym race conditions are not a big deal from a users perspective? i’m pretty sure the reason why is because they value more the availability of certain parts of youtube, like maybe the core system and like count, and consistency of certain parts, like the comments system. when a user would visit a video, it would hit a node (preferably the closest node) and that node might not have the most up-to-date information which means it’s eventual consistency
@@gabrielpedroza1346users are fine with eventually consistent like counts
@@gabrielpedroza1346 it doesn't matter if a video gets 1001 views or 1000 views, the discrepancy is because of "eventual consistency" so it will eventually be the same as the count even it is updated in other shards.
The benefit's are outweigh the drawbacks so much that big companies are using that. I don't see you being an architect there.
@@frenchButt44 Idk exactly how yt work but does it go like "you press like, the client side interface updates what should be the eventual number and then queues a transaction to update on the db"?
Eventual consistency is a fair trade off for performance.
But ONLY for data/rows/fields that aren't mission critical...such a "likes" counts! 😁 If you're writing an airline reservation system or stock trading program "eventual" doesn't really cut it.
@@oggassaggaoggaffa No, this is why atomic transactions exist 😬
Probably still not consistent counting the rows as you need to lock the table
Eventual consistency is only okay for this model, one producer many consumers, where the content isn't liable to ever change much. Using it for something with state that changes a lot over time is always a terrible idea imo
@@Fennecbutt Yes.
The fact that they even counted shows they never thought that thing would scale to more than 100k users
you don't optimize prematurely when building a startup
@@marcialabrahantes3369this guy gets it
@@marcialabrahantes3369 When optimization is literally just 1 line of code, you absolutely should
Yep. There’s a good chance they might not ever have gotten 100K users. Why worry about good problems to have?
@@choilive I mean adding the total_likes seems like the obvious thing to do from the start. Insta had to have a business plan when they started out so yes they should have anticipated 100k
It’s slightly a bit more complicated for instagram, where in order to drive engagement, it also shows which one of the people you’re following also likes the post. So other than the count which is global, there’s an aggregate query contextual to the logged in user’s first connections to the post.
Welp they can select top 3 from the likes table which should be rather fast
The most expensive part on the read side is showing the preview comments on the post, but there are fallbacks for very high volume posts and during times of high overall usage. I worked on ranking for these surfaces
Just query in the friends like list for the last cpuple days.
And you take just the 5 friends with the most engagement or which communicate with the user the most
@@ianhpete That seems like a good heuristic to start with but that would also assume that those n friends (in this case 5) would have had to have engaged with said post. If they didn't you'd have to widen your circle to a bigger subset and then that becomes a unique set for each post.
@@wywarrenwhich it is
This is so untrue. Relational databases a very good at counting rows and will typically not even read the table, instead using btree index to count. What he should say is the hash join of two tables is slow.
This is actually bad advice, UPDATE is a very expensive operation on a key table and will generate logging data to deal with concurrency. Ideally there should be a seperate post_statistics table which can be unlogged for high performance write while trading off durability
yes... it's a real problem but his "solution" is in fact worse than the original
You raise some valid points about the efficiency of counting rows using indexed columns and the expense of update operations. I completely agree. I was happy to see btree mentioned in an Instagram comment, lol.
While btree indexes can indeed speed up count operations, this efficiency diminishes as the dataset grows and queries become more complex. Instagram deals with massive volumes of data and very high traffic, which means that even indexed counts can become a bottleneck.
I personally would use in memory caching (redis) and a message queue like Kafka. Also probably batch processing and composite indexing.
His foreword is so misleading.
It makes me feel like a less bad programmer knowing this was the solution
Same lol
Some databases allow you to do something like that automatically, so when a like row gets inserted/removed it automatically updates the count. Or you could just run the count operation every 1-2 minutes and cache the result in the post row :)
Reminds me about a nosql rule ,
"Store related data together in a single structure to optimize retrieval, a principle particularly emphasized in NoSQL databases for efficient querying."
That's called "embedding"
video starting from "counting rows on RELATIONAL database", why you mentioning nosql rules?
The term NoSQL is so misleading
When I created an ecommerce websites in app. First time I was like. Okay, I'm just gonna create a column and name it. Total likes this way. I'm just gonna show how many likes this product get but later for marketing reasons, I need to know who liked this product so we can send a marketing emails so I had to create now on you table and name it liked by user. ID, for example, so if we want to do marketing email about some products. We just go to that table and find who like that product and send the marketing email. So I suggest you do both solution. Create a separate table and save the ideas of the users who like that post or that product and for speed reasons. Also, add a total likes in the products. List or product post this way. You will have everything for you. It may make the data base bigger. Because now you have a big another table
counting rows has never been slow. Its basically a core feature of almost all relational DB systems to offer ultra-fast counting. *BUT* : counting several thousands of times per second (!) introduces racing conditions and counting algorithm overhead which can become absolutely deadly for performance because of suboptimal locking or incorrect usage of mutexes, semaphores and the likes. Its not the counting itself killing the performance, its the low scalability of it. There are databases out there which offer ultra-fast counting which is also scalable. In some cases, this is realized via auto-updates of internal count fields, some use dark math wizardry and some are simply that fast by design.
Counting rows has always been slow. Though this problem would be counting indexes.
@@phill6859 i recommend looking into current (and past) database technologies. Performance problems regarding row-counting have been solved for *decades* now. You need to get with the times, expand your knowledge and try out new stuff. It isnt slow anymore - and it hasnt been for decades. Except in cases where the database has been misconfigured or is being misused.
This sounds like a materialized view problem.
Precisely. Elasticsearch used HyperLogLog to trader off a bit of accuracy for lightning fast counting and fixed size memory usage.
Oh my god, your comment blew my mind. Master, teach me your ways.
Thanks for your music, and SQL query optimisation, Freddy!
Read optimised vs write optimised
Now tell me what happens when 1000000 people like the photo nearly simultaneously
Race condition, if not handled correctly
Simultaneous likes unlikes will always be significantly less than simultaneous views
Queue the updates and if it takes a while no big deal, because no one cares if the like count is off by a bit
Low steaks though. You could avoid the race condition or just account for it later by updating the value directly from the count
Just serialize it, no one cares if your like isn't instantly reflected on the app
It may seem as an unreliable system which might produce inaccurate results, but with proper structuring you won't be able to mess it up. And it saves a lot of resources, too.
It depends on the system, and environment you are working with.
If your count is slow on the database, you can change the operation from a post to a present operation.
It's more complicated than even that. As others have noted, you don't need (or want) to display a perfectly up-to-date like count, so it can be eventually consistent. The other important factor is despamming. A user clicking "like" and the like counter incrementing is not a single, atomic operation. If the user is identified as a spam bot, then their like should not count for the purpose of promoting the post.
The real workflow is more like:
1. User clicks "like." Store that.
2. Start a batch despam pipeline to figure out whether the like is legitimate.
3. If so, update the like count on the post.
The despam pipeline is going to take some time. In this case, nobody but spammers actually cares about accuracy down to a single like, so lazily updating is just fine.
Normally you batch the "despam" process. You won't do it whenever someone clicks on the button. It's a separate process. You identify bots, check which posts they liked and update the counts accordingly.
I did this with my project, a twitter clone web app.You're making me feel like a genius.
what did you use to draw the diagrams? it look so good!
MS Paint
Gold
This is not how it’s done tho.
Likes in a post still has a reference of who liked and when. They still have that related table but what actually happens is that they have a projection (read only) of the post with the aggregated data and that is only updated time to time, not every time someone likes the post, the UI creates the illusion but on the backend the projection will be updated after some time .
I would prefer a "total likes" table with a foreign key to the post ID. Less bloat on the posts table and achieve the same thing. (ID, postId, Total)
The answer is always "cache it".
yeah, edit the databace once in a minute, with the value which is the sum of likes and dislikes over that time period
Yeah, but to "cache it" you required to count the rows before store it to the cache
@@chawza8402 no, each time you add a new row, increase the number by 1. If you delete, decrease the number (like the video explains). That's also a cache because is still a "copy of the real data", which is the like count.
not necessarily i think you could count the new likes and the new dislikes and then add the cached value to the value already stored in the database and then set the cached value to 0@@chawza8402
@@chawza8402no just increment it. You don’t need the cache to be 100% accurate, just fast
The db really should handle this kind of caching itself
I know that snowflake cache it, it’s really fast with count(*)
You don't even necessarily need to put the counter in the *posts* table, either. 1,000,000 likes would mean 1,000,000 UPDATE calls on a table that is being read very often. Having another stats table, keeping the value hot in a cache and then reading from the cache if it's present will be vroom vroom.
Great explanation, so solid.
In addition, you can place a trigger in the db so that every update operation the counter increment
I like how perfect normalization led to this, and could've been easily fixed with a NoSQL database like Mongo that supports aggregates
What about maintaining a queue with the requests and update a value in the database with an async job that runs every 10-20 items OR every 5 seconds, which either one comes first. So you can do a batch update, keep one copy of the count, and keep normalization . And periodically you could correct this count value by doing another async check to the actual COUNT value of the database
In powerapps we call the column a rollup column a d its role is to calculate items in the child normalized table
That's is not really denormalizing the normal forms as this is not removing a foreign key. It's simply adding a counter that is basically an automatically computed and cached value.
Videos like this it what we need!
They could also parse the output of Explain select count , to get an approximate. Above a certain number, 1million is the same as 1000002
The real solution is to normalize it further by adding a new table to keep the count. Never de-normalize !!!
This is standard practice in Firestore for example.
Damn, what ER software is that?
I wanna know too
Tell me the person who designed the IG architecture had no idea what they were doing without telling me
We applied the same solution in a similar case before real load testing, but we have more then one property to calculate. And this place became the performance bottleneck of the app. it turned out that we don't need it. just counting from db (read nodes) didn't affect performance of the app (in context of our requirements) but we implemented `snapshots` solution when you take the last `snapshot` and just add all `likes` for this post after datetime when this snapshot was created
counter caches are definitely a good thing
I did that in the balance field in my app
Yes I have all the transactions for integrity but I update the balance for faster processing
100% normalization is just theoretical fun. In the DB of normal production, some intermediate results should always be stored, so as not to overload the DB unnecessarily.
Damn son,
Your vidoes Hella practict
I've just gotta say, that's a really high quality camera you got there
This is amazing shorts content. Nice, sweet, technical, interesting story. Learn a small fact, accessible
Hence they started calling it "breaking the internet"
You would have thought they had caching in the application layer. Strange to push that into the database.
I really like your explanations and examples
Subbed
Effective, but not simple as it appears
I remember when I deleted my Facebook comments, I still see the comment count even though I delete all of them.
that is because of the soft deletion of comments from db.
Most sites dont even do that, youtube updates views periodically
Only thing to keep in mind tough is using transactions. But this is supported by virtually all database systems, so should not be a big deal.
The correct answer was incremented counter.
Man, holy moly. This is really how they did it? This looks just like a newbie mistake of someone who didn't think about it twice 😅
I was thinking the same lol
Denormalised database has higher risk of invalid or conflicting data. A process to increment count could be interrupted or terminated prematurely or an edge case can create a pattern of conflicting data. At a high scale, one might not observe a difference between 29877765 and 29877785 because both would be shown ad 29M but on lower scale it's a difference between 325 and 290. At lower scale developers are afraid of this because the mistake will easily be watched so they choose the safer route of keeping the database normalised and denormalise later if needed.
This "later" things pile up and take a form of big technical dept which bites back when the app needs to be scaled up.
Real question here is why Justin Bieber has so many fans?
Everyone is talented in their own way! Never understand celebrity worship
Right but what about concurrect access?
Without the column there is no problem as each backend instance inserts the row within its own connection, so the inserts can happen out of order and concurrently.
But if all of these connections want to have the value of the new column, they cannot get the same value, as they would all return n+1. So does the database queue these requests?
How does it manage millions updates per second where each update has to have the newest version of the value?
I think CZcams had similar problem with counting likes, that's why they did it in some clever way, and right after the upload the count would stop at around 300 views for some time.
I don't know whether they still do it like this, or the times where the CZcams views got stuck at 300 are now long gone, but my heart is with you if you remember that.
This is wrong at many levels.
1. Having just a likes counter will not work, we need to also save who liked a post so that we dont increment view/like counter upon repeated liking/unliking.
2. Writing to the post table everytime some one likes is not a good approach when millions can like a post ina a few minutes.
To address these we need to save both who liked a post and the reference to a post and to make sure this data takes less space, we use some probabilistic data structures like hyperloglog. These can estimate cardinalities of > 10^9 with accuracy of 2% and using around 1.5kb of memory.
Not to mention any sort of race condition... I'm not familiar with how modern databases handle race conditions.
@@d0cxMost of them with locks, they lock the table for writing while they do their transaction and then unlock it for other processes to write again.
You have no idea what you’re talking about.
1. “We need to also save who liked a post” - yeah that’s exactly what denormalization is doing. Preventing needing a COUNT query over the likes table every time a user views a post is exactly the problem that denormalization solves. The user only needs a ROUGH ESTIMATE of how many users liked the post and the only accurate information they need from the likes table is if THEY liked the post or not.
2. This is just technobabble bullshit. You’re just doing denormalization with extra steps. Did you even watch the video?
@@d0cx a race condition on what exactly? Where’s the race condition?
@@oxelftable locking is extremely rare most are using row locking nowadays.
They optimized it even more by not updating that record so often, they would cache it in Redis in batches and only have 1 update to the central database for every 1000 likes or so. That would be distributed across servers. So you might get 5 updates per minute saying: "A thousand more please."
That "thousand" number was also based on how often the like button was pressed for that post. Lots of likes? It would update every 5k likes. Slowing down? It would stop caching.
Yep counting roses can be pricky....
It would not be a long term solution, the solution would be up till certain number the query will run conventionally and it’s when the threshold has breached should the application writes it to the column. Which majorly covers and saves cpu cycles.
Honestly this is basic design that should have been in there from the start. What happens when you don’t plan your design
which is the web page that you're using for the diagrams?
+1
@@hiteshjoshi2736 At the end I had to discover it by myself haha. It's called exalidraw and it's free
Like your explanation thank you
I am using this technique for Non-Relational DB.
Why didn’t they do this to start with?
Okay I guess the reason would be that , if you add an extra column for Like then one needs to first find the row and then update it , while as a separate table u could just add a new row , no need to find or anything , saving on some overhead, but as shown in the video the shortcomings outweigh the benefits .
Instagram no longer uses a relational database for these sorts of things, i believe this specific case would have changed around 2017 but for sure by 2019
Seriously they were worse at database design than I was when I was 10 years old in year 2000..!?
Normalization is great in theory. And as a diagram on paper. But you shouldn't come to a 3NF or BCNF in the first place if you intend to use a real-world RDBMS in a real-world application... like.. ever.
Why do you ever need a separate likes table?!😅
This is 20 years old knowledge, when data warehousing starting to be a thing
I believe that it is important to metion the database. That idea, for example, is not a good one with PostgreSQL and his locks.
sometimes a little extra processing is a good thing
I'm assuming that you still going to log that a user liked the post because otherwise there is nothing stopping from someone hitting that endpoint over and over again if there is not validation.
Oh ok! I figured since we’re solving a problem, when the problem was essentially every time a celebrity uploaded causing the infrastructure to crash. My problem-solving skills kicked in. Immediately arriving at, we’ll If we remove celebrities from the equation problem solved!
You could apply the solution to solve our problem here in the video, and in the real world! Bada bing bada BOOM, Bob’s your uncle!
Isn’t coding fun!
1) Counting rows in a realtional database is one of the fastest possible ways to count something.
2) You're missing the half of this problem where millions of people are requesting the count (as well as increasing it themselves)
3) A better term for fixing this problem is "materialization", not "denormalization"
Relational databases 101. No shit. This is a "no shit sherlock" insight
We were taught that in University years ago, it is just logical
What software is used to show the database table ?
This is so good content
nowadays they use bloom filters.
Yesterday i was talking with a colleague that I don't really like SQL, today my feed is flooded with SQL tutorials.
Their biggest mistake was using a relational database for this kind of application.
or you make it like youtube and only sometimes update the likescounter
Most likely not the count (cachable) but the federation out (newsfeed) was the issue
Good content!
I’ve found that generic statements about sql optimization/performance are not very useful in practice. There are numerous things that affect the performance of a relational db, not least of which is the specific DBMS e.g MySQL with InnoDB optimizes count(*) to use an index scan which is quite performant.
“Extra processing when someone likes a post”??? Total nonsense 😂😂😂
Unless you use column store indexes…
The exact number of likes is not business-critical, for large numbers. A new, time-stamped entity: ' estimated number of likes' does *not* denormalize the database. All it takes is a semantic confession that the cache is not necessarily up to date.
I thought he said “counting bros”
Interesting, keep up the content
this what i do...because i also encounter the same issue 😂
I thought in row type, db updates are expensive than insert.
The better option for this is use optimistic update in frontend side and send events using sns, topics, sqs.... whatever.. and if you wants betters results process that's events in batch to celebrities users like JB..
store in a virtual bucket and peek when reading is needed. couchbase for example
lmao justin bieber could have held instagram hostage by just uploading photos
Thing is they should have known how many users on Instagram then expand the cache or storage..... 😊buh they've done a great job 👏
what app that used on this video pls?
Too many things to worry about sql
This is a simplification. Nowadays we’re using big data processing like map reduce with probabilistic data structures for the real time counts
Could you elaborate on “probabilistic data structures” and what’s out there to achieve that?
So there’s a chance likes can get screwed up. You still have a likes table and maybe once a day you can reconcile the total likes to the likes table.