We Had Our First Outage...
Vložit
- čas přidán 12. 10. 2023
- Thankful that UploadThing is fine and our community is dope. Outages like this are always scary, I'm proud of how we handled it
Check out UploadThing: uploadthing.com
ALL MY VIDEOS ARE POSTED EARLY ON PATREON / t3dotgg
Everything else (Twitch, Twitter, Discord & my blog): t3.gg/links
S/O Ph4seOn3 for the awesome edit 🙏 - Věda a technologie
“Should have used OCaml.” Prime’s text, while petting his pet camel. 😂 🐫
Goddamn prime
Died
lol
Camels don't like being touched
Get Tom on your team, he is genius when it comes to this stuff
Jdsl for the win
tom is a genius
Who’s Tom ?
@@gauravkalele7491 he is the most known programmer on yt these days the primagen a staff arch at netflix
@@gauravkalele7491 a genius
If the product was to serve non-tech users, the situation might have gone differently. Only among the tech community do we have the condolences of how hard things can be when building software
Also depends on the product. If YT or Twitch would have a 3h outage, ppl still would use it.
Open telemetry APM makes debugging issues like this an absolute breeze, regardless of which layer of the stack they happen in.
theo is just intermediate chances are he has nothing like that in place
what's open telemetry used for ?
@@maksadnahibhoolna-wc2ef its just an open source framework for telemetry, you still have to design a profile of information to collect from your users that would alert you of an outage like this while also not invading privacy more than necessary.
otel is great and having the right tracing, metrics and logs are great. I wish things like datadog would support them better. They say they do but really they put very little effort in and it's not good enough for production use, so you have to use their own protocols/agents/etc.
Being able to view the internal query stats from the db in a time window would also be amazing though. We currently get which query is slow from metrics, and what the CPU/mem/iops etc are from the database but not things like rows read vs returned in a nice graph like that.
Love these type of "here is something that actually happened in a real world production app" videos.
The theory people are gonna say "well I always put correct indices", but of course they dont. And if they do, something equally "simple and obvious and avoidable" is gonna break.
Databases tend to spoil Developers who are just getting started learning them. The amount of performance you can get out of a DB with no auto generated indexes under a million rows is honestly insane. This is why scaling issues like this are not all that uncommon (unless perhaps in the case where you have a dedicated DB team or person who understands these things already) because generally things just tend to "work" until they don't.
"Should have used OCAML" 🤣🤣
I think everyone had a problem like this :D
That's why some kind of monitoring and notifications is a must-have. Learning about the mistake before the customer is a huge win. Mongo Atlas can send you an email when your query does a big row scan, which is nice because it is not necessarily slow today, but might be slow tomorrow.
Cool video - good post-mortems are awesome
You should always index the column(s) used in your where clauses.
If there are mutiple columns that are fixed in the where clause, then create a multicolumn index to further improve performance.
A multi column index is not the same as indexing each column separately
Ehh I wouldn't say always but the multi column index is a good tip.
Also ... if you have another query that does NOT use all columns, you need an extra index. The MUL will only be used if all columns are in the condition 🎉
@@nimmneun I think that depends on the order of the columns in the multicolumn index. If the column that's used alone is first in the list of columns in the multicolumn index, then there's no need for a separate index
Ah you're totally right, totally forgot about that 👍
It depends how often that query actually needs to get run. Just adding indexing for everything has tradeoffs as well
Hehe... been there, done that with indexes. Be sure to perform database maintenance over time. Index fragmentation happens, too, so you need to keep up on that, and that's something that's a little harder to debug. You'll be like, "but we have an index on that column, why isn't it working?" Symptoms are nearly identical with large numbers of rows read, exponentially slowing stuff down. I have more experience with this in SQL Server than in MySQL, but it's a thing there, too.
Great lesson on indexing for databases - I think I will check on this for my own projects, lol
reminds me of the funny bug Haskell once had. On windows if the source file was in a different folder and got a type error it would report the error then it would delete the source file. When users contacted the devs about it they were understanding and explained they now had a cron job running to backup the source files regurarly.
Simon Peyton-Jones mentioned this in a 2017 talk about Haskell
That is kinda funny because oracle teache us that indexing is a no brainer since oracles dbs ignore indexing when is worse to us them. However, when you use a nosql the basics become important again. Great video, reminded me when i used to work on a bank and we switched from sql to nosql. At tha time we use to run performance test so we pcked up before production 😅
The transparency shown in this video is powerful. You managed to take a not so positive situation, and turn it into an invaluable learning experience for many of your viewers including myself. Nice job and glad things are back up. This is why I love your channel. I’m also an upload thing user and a huge fan of planetscale. Keep fighting the good fight!
Nice! How much an index can improve queries is amazing 😮
the only outage I had was when vercel had problems with initializing serverless function for a few hours. Blew up my inbox with error log alerts that were entirely useless to me and my attempts to push a new deployment that temporarily disables affected features was useless since deployments were also affected...
I wish he went deeper on the technical reason why this happen. Like explain what a unique Index is, how it helps the query, unique key constraint vs unique key index, etc.
He sort of did. Unindexed queries put crazy loads on database. If you want a deeper dive on database indexes, one of the keywords is "B-tree"
Unique key index is an additional constraint added to make sure the value of the field is unique across the table. In this case, the app api key should be. That helps with data integrity.
"unique key constraint", "unique key index" and "unique Index" are basically the same thing. These terms are often used loosely.
An index is the same thing you find in a book In large books you would have a few pages at the back which typically list by alphabetical order certain topics in the book and at what page it is on. So if you had to search, its easier to look through the few pages of index rather than few hundred pages of the book.
So when you put an index on column "name". At every insert the name value is stored sorted in an index file which the database can use to scan instead of the whole table.
A unique index is the same thing, just that it doesnt allow duplicates. Unique indexes will slow down insert queries as it has to check for uniqueness.
The thing theo should be doing is using multicolumn indexing. General rule of thumb is to index the columns used in your where clauses.
Take a look at PlanetScale's own videos that explains indexes and different kinds of optimizations where you may not be able to utilize indexes on their own. Specially "Don't hide your database indexes!" and "Faster database indexes (straight from the docs)"
So adult programmers also forget to add Indices sometimes!? YAY!
Indices
And there I was, testing after my production update, thinking I broke my app, bashing my head into a wall for a while, although I hadn't touched the upload logic. All of a sudden it starts working. Didn't really look further into it, but this explains it!
if an outage occurs and you dont have any users, is it really an outage?
This video gave me the urge to keep brushing my hair out of my eyes which is interesting since I don't have long hair.
Cool to see a prod bug solved! Nice video!
Your thumbnail photo for this video is wild! 😂
You should probably also hash / encrypt your API keys, as you right now basically store passwords in cleartext in the database (if I'm not missing anything obvious here).
I suppose it's possible that `key` is already hashed. Just a naming thing as opposed to calling it `hashed_key`.
Maybe consider caching the keys since they don't really change if I understood correctly. Redis or in-memory sqlite are extremely fast and could relieve pressure from db. But yeah, these things happen...
Fixed it for my birthday. Thanks!
Would it be possible to cover a costing analysis of the impact of this, what was the Planetscale cost of this sudden hit ?
Yea how much did the spike cost? Great question.
I'd even store it cached in Redis or something
When you realized planetscale doesn’t give Theo free planetscale but gives you free planetscale through the free tier
The best way to explain db indexes imo is without an idx its doing Array.find, with an idx it is Map.get or obj[key]. But more than that but it’s easier to explain why it’s faster
Perhaps a bit too much of an oversimplification. I think calling it a glossary is apt.
@@DryBones111 looking through a glossary to find something is an array find. Unless you know the exact place to look something up. Like a map.
k6 is a good way to do load testing simply to find fault points like this.
thanks for sharing this
Love your thumbnail game man 😂
Kudos for mitigating the issue that fast and being honest about it. I really hate kicking when someone is down, but this is clear indicator that no amount of code-as-whatever shortcomings and ORM abstractions that allegedly make life easier can replace neither an architect or seasoned senior nor the DB admin. As you've said yourself, it's a surprise this didn't bit you sooner. You've been lucky with your DB provider to be so last-mile oriented. Imagine having such an issue on AWS Redshift cluster that could easily cost you as much as a house in few hours.
your comment is really backwards. just read it. nothing lucky about his db provider.
@@AmericazGotTalentYT maybe I should have been more clear - a DB hosting provider
Which pscale tier/cpu do they use for uploadthing?
What is the calculator that you used called? Looks better than default Spotlight one
That is that cowboy attitude and lifestyle.
you went all in, in the thumbnail
I ddos'd my own postgres db this week by accident. Have a process where 10,000 vms are collecting stats and then there was supposed to be a final step where a single vm then aggregates the metrics and upserts. I accidentally left code that was for testing and each one of the 10_000 vms started upserting it's own metrics (millions of rows) all at once. 😅
You making me switch from railway to planetscale 👀
The minute you mentioned planetscale i knew it was an index issue
5:21 wouldn't the unique index on line 113 cover it without also needing the index on line 112?
Good question
Try it yourself with an EXPLAIN; if it's just a constraint mysql may not look at it with the query, but whatever index they added and what the schema looks like now (probably can go find the PR?) the query plan would change
This was a fun day indeed 😂
So many of the major app problems are caused by overlooking basic DB optimizations.
I’m surprised that a query against a single table using a single column in the where clause didn’t have an index. Imo this is a failure of the ORM mindset that abstracts how queries are run
He is a idiot lol
I share your opinion on ORM’s
I totally agree. I love Theo , but this like database 101 level thing that his team missed. Fucking ORMs !
And these "infinite scale" cloud services. On something like AWS this mistake could be SUPER costly, wherras your $5 linode server might have locked up for half an hour due to one dude, it doesnt seem like you have that many users yet where that would be noticed unless it was prolonged. Im a big OOP guy but im a guy who learnt OOP in like 2002 lol, so while i might have some habits that can lead to some minor over abstractions, the issue here is your database is abstracted, the query and therefore communication layer is abstracted, your server hardware is abstracted.... people, sometimes its OK to not abstract every sibgle conceot so a 5yo could understand. I get it, realiability and maintainability etc, but uf your dev doesnt know how to interface and communicate with the database, should they be allkwed to....?
Problems like this are inherent and numerous due to modern development trends.
Dang planetscale makes them a walk in the park, dang. AWS RDS Perf Insights is so far behind
If you think about it, It was a free stress test, and a trial by fire of upload things ability to scale.
Scale up is one of the first things on the list if you intend to have tons of customers. And also this proof database index is something you always should use.
Theo is such a beginner I sweae
@@perc-aiby beginner level you meant that the fact that his team missed indexing on the first place ? And the fix was a beginner level in ur opinion, right ?
@@maksadnahibhoolna-wc2ef a beginner would not create indexes for queries in general. Its very obvious which queries are gonna be ran over and over again. This is basic db design
Null is a problem in index, avoid them, use some other default, like zero.
Wow the ui is awesome
Thank you
Funny how there were no concusions for "How the team should be able to get a hold of me quickly when they need to" 😅
Meh I wasn't too bothered about it. The CTO was on, Theo got in within an hour or so.
Pager Duty is prohibitively expensive, so best to leave it till capital is flowing
@@seannewell397 Yeah, come to think of it, why didn't anyone just call him rather than just leave urgent messages? 😁
shouldn't need to get hold of him
I don't care how many shareholders you have. If lives aren't on the line, it can wait until morning.
Are you using drizzle orm ??
those metrics for sure help debugging this, quite nice. But this should have been surfaced when writing the sql and looking at the execution plan in the first place.
This is probably the "break things" part of "move fast and break things".
Load test your apps. Who knows what else is laying below the waters. Load testing should have exposed this issue.
Planetscale is an awesome platform, especially the DX that they offer.
My curious thought here is if you need an API Key lookup then why isn't it cached in-memory somewhere?
It makes no sense to query matches from disk for some operation which would probably require the highest throughput?
Also, If this was a short lived JWT, the issue could've been avoided, I believe.
Nonetheless, good work fixing it as quickly as you did.
why cache when the db query is done in miliseconds?
@@willi1978 because using the cache will free up the connection pool for the db which needs to deal with a lot many other queries than just do look ups.
Caching makes sense in cases where queries are _slow_ and/or where data is unlikely going to change often and is read more often.
If your queries takes ms to retrieve, caching is just a premature optimization that causes more headaches as you have to now think about cache invalidation strategies for each query based on the data you need.
PlanetScale already provides a caching mechanism that they call Boost that automatically handles cache invalidation, etc. but it highly depends on your queries!
@@dealloc API keys rarely change. I agree that the query is a quick one but if something isn’t changing utilising a db connection to fetch unchanged data is just too amateur to me especially for a production system.
There’s no doubt that you can pull this off without an in-memory cache but why not separate the backing store to ensure better distribution of load?
Finally, I have heard of boost and it sounds like a great optimisation from planetscale’s end but I’d still prefer to avoid involving the db in every way possible unless you need realtime data and caching isn’t an option.
@@willi1978 You just watched a video giving one reason :P
I really hope you didn't reach out to the CEO of Planet Scale over this small issue.
This is your brain on needless abstractions
No hate, but isn't indexing a table like a beginner concern?
Theo is essentially a junior React dev lol. Fools a lot of people too
@MrStupiuno He's a manager, not a coder. He's allowed to be mediocre. Leading people require diff skillet.
I'm assuming he wasn't the one implementing.
@@venicebeachsurferin the video he said he is the one that added the index
@@MrStupiuno
You guys are too harsh. lol
People made stupid mistakes regardless their level. I saw a lot of senior devs/leads, or so called 10x devs made extremely silly and naïve mistakes sometimes but I caught them up during the code review or whatever. It's as simple as that, some mistakes are so dumb (they also admitted that) that made you feel like they are juniors.
The reality is that amount of complex cool projects (insert something complex like Figma for frontend here) they had created or working on is not something junior or mid level even a lot of seniors I had known can't do, some even handling millions of data daily. Then 90% of devs are not even qualified for being a developer, maybe junior junior.
Just take their lessons whether you knew it before or not, and reminds yourself avoid to make the similar mistakes, if it happens, wish you have good colleagues to catch them out before production.
Technically it was indexed since all primary keys index by default for most major dbs.
LMAO this thumbnail is hilarious
What happened with the serverless (homeless) infinite web scale?
The bad SQL aside, S3 probably gets 40k file uploads per second, not per 30 minutes. Doubt that serverless infra is going to scale to that magically by itself.
I stand by my mantra that everyone should be able to self-host, set up everything by hand and have intimate knowledge of all configuration in order to succeed. Other people's services is like using a Mac as opposed to something nice like Gentoo
@@spicynoodle7419 We don't know how many files are being uploaded per second but we know that s3 handles ~100 million requests per second. LOL
Why include an employees (Mark) info on this. Around 5:55... Or why not?! Just curious.
1. Wasn't intentional, just happens from my gitlens VS Code plugin
2. No info was shared that isn't already in the public Github repos
3. Mark watched the vid before I posted it
🤷
@@t3dotggyou are an incompetent cto lol this is exactly why Hasura is a huge W it creates indexes for you as well and caches queries
still down
(a simple) Status Page should not slow you down. Do a simple ping/uptime kind of report at least. It's the time to BE better, not make excuses. Every idea should have a clear, written, analysis.
Send a follow up with clearly how a simple status page impl with one dimension, one ping check, would slow you down.
Good job on owning it and being transparent.
Are you using drizzle
down again..
Brave of you to admin you don't index tables properly. If I was a customer, my loyalty would be affected :)
Yep, This is a characteristic of a very young DBA. But his admission allows you to grow with him.
I would actually shift about 80% of my traffic to someone else and maybe keep 20% with him by throttling traffic/queries with a proxy|[load balancer]. Its a trade secret that real smart DBAs will use a amateur small business to help refine their code and find bottlenecks. Or you use a low resource hardware and throw billions of queries at it and see what happens. Then increase over time just for the comparison and then make a permanent decision. But being able to make a support call and having him do a quick turn-around... that is what most business owners want when growing with a company. You need that business to business relationship with a real human.
Also, this totally sold planetscale for me.
Wait. How the hell do you have 410M+ API keys? Or am I missing something?
One minute just one minute, you talk about upstash one time every month 😢
"Should've used OCaml" 😂😂
I feel like you`ve reached the max amount of stretch on your next on this thumbnail, it`s peak vascularity now
Can someone tell me what are these p[number] things? How can I understand this
Look up percentiles in distributions
"37ms is reasonable" ... looks around...
Serverless 🎉
How did you get famous? Oh I dosed a youtubers website!
LMAO should have used ocaml. The trolliest comment ever. Primeagen troll level 8
Probably using ORM KEKW
9 out of 10 times i have had this it was indexing :(
I make a sacrifice, usually it helps
ah yes, try to aim for 5 indexes per table. Index on joins and filter clauses, gg.
This was a completely rookie mistake
If you do a select with columns then the index should be in all of them.
You mean a where clause on columns…
@@dave_jones yes but also if you select a,b,c then you should have a select on abc and the where attribute
@chrishabgood8900 Indexing a column (or many, in a multi-column index) affects the search criteria - that is, by creating an index that "answers" the where clause, you reduce the time to create response by reducing the amount of rows to search. At no point does an index affect "what columns to return" - that part is trivial, the time cost is in "what rows to return" - not "what columns to return." This is all done by the "Query planner" which determines how it can most effectively solve the query it has received. You can use any rdbms' "explain" process to understand this more.
If you tested your code...
LOL c'mon, as if ya'll would start investigating performance bottlenecks while there are none yet and all your apps are prepared for 10x load spikes, especially when you've been growing at a steady pace before.😂😂 Shit happens, everywhere.
On a side note, Planetscales query stats really are nice 👍
This is my thought. He just talked about the whole "line of prime" analogy which perfectly applies here. A new product thats just starting to acquire more users isn't going to invest more time to performance optimizations over just building the product. I'm surprised at the number of comments criticizing that this wasn't caught. I've had a similar issue pop up at one of my previous jobs for a new feature; things like this can easily be missed.
These comments are unhinged. I can get behind criticisms of ORMs, as thats likely why this happrned but some of you have clearly never worked on a product where you own the entire stack. Theo says a lot of things i dont agree with, but acting as though a new product should spend the time to write performance tests when theyve had no issues is over engineering at the current time of their product. Its the equivalent to a junior writing optimized code thats completely unreadable to everyone else with no justification as to why the optimization was necessary.
And at least to me it sounds like the CEO of planetscale volunteered to help. Nowhere in this does it ever sound like Theo asked for help.
Yeah maybe an entry level data engineer could’ve spotted this problem months ago haua
The companies who have room for an entry level data engineer or DBA are not making services like UploadThing, they are looking at the backing DB schema of some SalesForce plugin for some SAP integration deep within Oracle, and they don't ship code probably. /rant /tooharsh
Startups go brrrrr (b/c their query plans don't match their business plan, gotem! xD)
classic
gcp cloud sql ❤
thumbnails keep getting weirder and weirder
This is an amateur move
Omg im early af😂
Mmmm, kinda embarrassing tbh.
😂😂😂😂 should have used ocaml
Wow, this was an extremely dumb issue. I get it, everyone makes mistakes, but this is just a bit too silly given how preachy the rest of this channel is. Not indexing a field in the where clause of an extremely frequently used query, what? The presentation here is quite dramatic too, considering how silly the problem was. I really hope the PlanetScale CEO wasn't brought in to help with this.
Does that team have no DBAs? Is it all this scalable cloud stuff abstracting away core principles? Why's he writing a DB schema in JavaScript? I have no idea.
lol @ DBAs - what do you think this is an enterprise?
This was fine tbh. They had a few indices/constraints and missed one. Shit happens.
Honestly it wasn't that bad (no data loss/corruption)
This issue is symptomatic of using ORMs. If you're gonna abstract SQL you better have the tooling to help flag mistuned queries.
DBAs are not a real position.
@@seannewell397 Didn't say it was a bad issue, I said it was a dumb issue (could also be bad, depends on your metric). If this channel is going to preach so hard and give advice in the tone it does, it should be held to enterprise standard. Theo doing his heavy-breathed wise senior voice making bold claims about how software should be made, then uploading a 10 minute postmortem on a missed index, isn't a good look.
Maybe pickup a book by Joe Celko. Or use Redis that wouldn’t care about such mistakes.
"should have used redis" - god damnit internet, you never fail to make me want to stab my eyes out
@@dave_jones I know right. Like what does that (that = "should have used redis") even mean, lols.
Firsssssstttt
💀
hahahaha
nobody gives a single flying fuck
Yup, TypeScript + no testing == problems. Good luck "changing world" with "innovative" products like file upload. Peak solutionism
How is this a testing issue? Most tests wouldn't have caught this unless your specifically doing some type of stress test on your product, and as Theo literally stated, this product just recently blew up, so they wouldn't have any metrics to expect to go against.
If anything, this is an issue of ORM abstracting away the query being executed in the database as other commentary have pointed out.
it's not supposed to be innovative, it's supposed to solve customer's problem. which is exactly what products are supposed to do.
@@TFDusk Have you heard of performance testing?
@@AmericazGotTalentYT People literally equate products like these with innovatios.
@@90sokrates Do we just not read responses or do we believe that this is an appropriate response. Excluding yourself and the minority of people who are of your mindset, most people wouldn't write performance tests for a greenfield proejct they're just trying to get out the door, because making premature optimizations are often not needed, which is even demonstrated here by the fact this issue happened due to an unusual load being placed onto the server.
Sure, performance tests could've caught it, just like writing raw SQL would've likely mitigated the mistake. But to both of those decisions, your sacrificing velocity to make those types of decisiions, which as a buisness isn't acceptable.
#Free_Palestine
your thumbnail faces are getting cringe and ridiculous now