The Problem With UUIDs

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 6. 05. 2024
  • THANK YOU PLANETSCALE FOR SPONSORING THIS VIDEO
    UUID's have a time and place, but I rarely see them used correctly. I've wanted to do this rant for awhile and I'm happy I did because CUID2 is NOT a good option either 🙃
    SOURCE
    planetscale.com/blog/the-prob...
    Check out my Twitch, Twitter, Discord more at t3.gg
    S/O Ph4se0n3 for the awesome edit 🙏
  • Věda a technologie

Komentáƙe • 900

  • @flymetothemoon5138
    @flymetothemoon5138 Pƙed 14 dny +223

    It's amazing watching somebody talk shit about something he obviously doesn't understand beyond his sponsors blog post

    • @pchris
      @pchris Pƙed 6 dny +38

      He talked up the video at the start like he's gone through many iterations of it so I was kind of shocked when the entire video was just him reading somebody else's blog post. Did he really have so few thoughts on it himself, or were they all about GUID and made obsolete as he read the UUID article?

    • @hellowill
      @hellowill Pƙed 5 dny +17

      this frontend guy needs to stfu about backend and databases lol

    • @aguenter
      @aguenter Pƙed 5 dny +7

      @@pchris Skimmed right over ULID too, which fits his stated requirements/wants.

    • @DavidOtto82
      @DavidOtto82 Pƙed 2 dny +1

      lol i saw this top comment and thought "lets see how long it takes until i can call bullsh*t"...took 2 minutes. btw: former frontend dev here xD

    • @tahvohck
      @tahvohck Pƙed 22 hodinami +1

      Thanks for saving me some time.

  • @cherubin7th
    @cherubin7th Pƙed 16 dny +879

    I propose v9, it is the same as v4, but instead of a 4 it has a 9. This changes everything.

    • @mx-kd2fl
      @mx-kd2fl Pƙed 16 dny +15

      lmao

    • @DavidLindes
      @DavidLindes Pƙed 16 dny +22

      vibes of the IPv9 (April Fools) RFC. :)

    • @johnblyberg4801
      @johnblyberg4801 Pƙed 16 dny +3

      Genius!

    • @follantic
      @follantic Pƙed 14 dny

      It's incremented. Good for perormance. Then we can proceed with the A-F versions.

    • @follantic
      @follantic Pƙed 14 dny

      Also, Tom, is that you?

  • @einrealist
    @einrealist Pƙed 16 dny +542

    UUID versions are NOT supposed to replace each other. They are just different implementations that address different problem domains. Once humans become a serious space-faring civilization, we probably need yet another UUID version for problems like time dillation. But such a UUID V99 is not supposed to replace V4, but to complement it. V4 can still be used within certain bounds.

    • @vikingthedude
      @vikingthedude Pƙed 16 dny +8

      So its like newton’s laws

    • @TheOriginalBlueKirby
      @TheOriginalBlueKirby Pƙed 16 dny +39

      ​@@vikingthedudeTerrible analogy.

    • @Imperial_Squid
      @Imperial_Squid Pƙed 16 dny +62

      Calling them something like flavours rather than versions is probably more fitting

    • @steamer2k319
      @steamer2k319 Pƙed 16 dny +23

      ​​@@Imperial_Squid
      Yeah. I've never been confused by the UUID types but I could see how one could be misled into expecting obviation/revision from a numerical sequence of "versions".

    • @vikingthedude
      @vikingthedude Pƙed 16 dny +3

      Newtons laws are still useful even though we’ve got general relativity

  • @xdaniels13
    @xdaniels13 Pƙed 16 dny +728

    Oh I am still waiting on the video about PlanetScale removing the free plan and the cheaper price twice as much as competitors...

    • @portalpacific4578
      @portalpacific4578 Pƙed 16 dny

      We need to stop watching compromised influencers pushing businesses despite not having any support from the dev community.
      I cant even hear the info cuz im so annoyed hes shilling their BS and taking their money.. Fck you Theo.

    • @portalpacific4578
      @portalpacific4578 Pƙed 16 dny +140

      Theo = sell out.

    • @yiannis_p
      @yiannis_p Pƙed 16 dny

      @@portalpacific4578do you ever Google anything before calling people names? Name me one other company offering a hosted vitess cluster with a https proxy, backups, branches and online ddl that is cheaper than planetscale and then I will agree with you.

    • @yiannis_p
      @yiannis_p Pƙed 16 dny

      You do realise planetscale offers something no other company offers right ?
      The only other way to get a hosted vitess cluster with online ddl, backups and an http proxy is to make it yourself, and unless you value your time at zero dollars, it would be a lot more expensive


    • @tanotive6182
      @tanotive6182 Pƙed 16 dny +14

      He already made a video about that

  • @eamonburns9597
    @eamonburns9597 Pƙed 13 dny +131

    17:58 Just so you know, "I'm not telling you the difference between them, that's your problem", takes a lot more time to say than "1 byte is 8 bits"

    • @nArgari
      @nArgari Pƙed 9 dny

      Nice... So no difference between an octet and a Byte !? Or may be...
      The terms "octet" and "byte" are often used interchangeably in common usage, but there is a technical distinction between the two.
      A byte is a unit of digital information that consists of 8 bits. A bit is the smallest unit of data in a computer, and a byte is a group of 8 bits that can represent a single character of text or a small amount of numerical data. Bytes are commonly used to represent characters in text, encode instructions for a computer program, store data in memory, and transmit data over networks.
      An octet, on the other hand, is specifically defined as a unit of digital information that consists of 8 bits. In networking and data communication contexts, the term "octet" is often used to emphasize the size of the unit (8 bits) without assuming any particular character encoding or representation. The use of the term "octet" in networking standards helps to avoid confusion that may arise from different character encodings that use different numbers of bits per character.
      In summary, an octet is a specific term for a group of 8 bits, while a byte is a more general term for a group of 8 bits that can represent a character or other data. In practice, however, the two terms are often used interchangeably, especially in everyday language.

    • @vorrnth8734
      @vorrnth8734 Pƙed 9 dny +13

      @@nArgari Actually a byte is the smallest directly accessible piece of memory. It does not need to be 8Bit wide. Historically there were other sizes too.

    • @uncaboat2399
      @uncaboat2399 Pƙed 8 dny

      @@nArgari I thought an "octet" was a baby octopus? 😝

    • @PhilHibbs
      @PhilHibbs Pƙed 7 dny +2

      Anyone that deep into the video who doesn't know that is in the wrong place.

    • @eamonburns9597
      @eamonburns9597 Pƙed 6 dny +1

      @@PhilHibbs I mean, if they are watching this channel, they are probably JavaScript developers, so....

  • @tarunpai2121
    @tarunpai2121 Pƙed 16 dny +241

    Dude literally just read an article and nothing more. Offer some POV at least.
    - I don't understand if this is really an issue with mysql or with postgres too?
    - Is this even as big a problem? or is it just clickbait?
    - What's the actual performance impact? there's a lot of "perf issues" thrown around with no(?) benchmarks.

    • @urmom8322
      @urmom8322 Pƙed 15 dny

      What do you mean? He was also able to get paid to shill for planetscale. Amazing content

    • @intentionally_obscure
      @intentionally_obscure Pƙed 14 dny +56

      Thanks for saving me half an hour

    • @hum6le
      @hum6le Pƙed 14 dny +65

      thats kinda this whole channel

    • @AxioMATlC
      @AxioMATlC Pƙed 11 dny +1

      Highly variable, but consider where an auto-incrementing 32bit int index is 16GB. UUIDs could be as large as 4x or 64GB. Now you are talking about the difference from a cheap server to a more expensive one unless you degrade to searching within an index based on drive speed instead of RAM. ~100x slower for index usage

    • @pchasco
      @pchasco Pƙed 10 dny +1

      @@AxioMATlCOK but you wouldn’t use cuid in a scenario where you can use a 32-bit int, either. You haven’t pointed to a problem with UUID here. This is a problem of choosing the wrong solution to a specific problem.

  • @chepossofare
    @chepossofare Pƙed 16 dny +506

    The proble with UUIDs is that you are not using PostgreSQL.

    • @mvnkycheez
      @mvnkycheez Pƙed 16 dny

      What's the difference?

    • @KangoV
      @KangoV Pƙed 16 dny +127

      In PostgreSQL a UUID is stored in 16 bytes. Indexes on these are super fast as they are performed numerically.

    • @spicynoodle7419
      @spicynoodle7419 Pƙed 16 dny +14

      When using PostgreSQL instead of MySQL, your whole DB is much slower rather than only the UUIDs xD

    • @milanmiljus823
      @milanmiljus823 Pƙed 16 dny +183

      @@spicynoodle7419 cringe

    • @landsman737
      @landsman737 Pƙed 16 dny

      hehe exactly

  • @ChrysusTV
    @ChrysusTV Pƙed 14 dny +30

    "I spent years researching this to make sure I got it right. Turns out, I don't need to. Here's what my sponsor has to say." Sheeesh, DarkViper's video accurately notes a lot of key issues, like this one. Something always felt off about this channel and getting a different perspective really made things clear for me. This channel really is just regurgitation with, as you admit yourself in the first few seconds, little research.

    • @raphaelmorgan2307
      @raphaelmorgan2307 Pƙed 2 dny +1

      also in the video shows us himself posting someone else's joke on Twitter lmao

  • @rikschaaf
    @rikschaaf Pƙed 16 dny +166

    3:49 To my knowledge, not all versions in UUID should be seen as "the next version". Some are more like variants than versions. That's why both v1 and v4 are still used often. You could say that variant 5 is a new version of variant 3 though, because we now have better hashing algs that might have not been considered in v3. Variant 6 similarly is a new version of v1, where they reordered the date to be sortable (with v7 being a variant of that). You shouldn't think of every new --"version"-- variant being better than the previous, but that they complement each other.

    • @workmad3
      @workmad3 Pƙed 16 dny +12

      Exactly this. Going back to RFC-4122 (2005 one detailing the 5 currently standard variants), it says this about the variant bits:
      > The variant field determines the layout of the UUID. That is, the
      interpretation of all other bits in the UUID depends on the setting
      of the bits in the variant field. As such, it could more accurately
      be called a type field; we retain the original term for
      compatibility.
      The idea is you pick the type that suits your needs. If you don't have any particular needs, v4 (the 'random' type) is a good choice, because you simply want a long, random value.

    • @aredrih6723
      @aredrih6723 Pƙed 16 dny +4

      It's unfortunate that the article didn't mention "version" 0 (that only contains the nil UUID), it would have made that point more obvious.

    • @davidmartensson273
      @davidmartensson273 Pƙed 16 dny +1

      Yes, they are intended to solve different problems and all come with different trade offs to achieve that, a user bound one is good if you need to be able to identify the user over values stored in many many systems to not have to add multiple extra columns.
      I see some problems with this, but I can se the reason.
      And while you sometimes do want UUIDs that are sortable, especially if you want to use them as primary keys to avoid having to inject things in pages, sometimes you really want more random with no traceable time component.
      This feels very much like a commercial for planetbase and not informational.

    • @rikschaaf
      @rikschaaf Pƙed 16 dny +11

      @@davidmartensson273 I don't think that the issue here was planetscale's article. It was very informative about the consequences of using those UUIDs in keys/indexes. I think that the issue here mostly came from Theo's prejudice and inability to see them as variants, rather than versions. To be fair, Wikipedia also talks about versions rather than variants, so Theo isn't alone in that regard. It would be good if official documentation and knowledge bases would be updated to reflect this. So, who's gonna change the word version to variant on wikipedia? 😄

    • @onoff5604
      @onoff5604 Pƙed 14 dny +6

      sounds like 'versions' of raid storage...

  • @workmad3
    @workmad3 Pƙed 16 dny +74

    Thinking about the different UUID variants as 'versions' that replace previous versions isn't quite right... the idea is that the different variants are used in different circumstances, so a deterministic variant is useful in some circumstances. Which is why there's a 'variant' number in it, so different variants can be checked differently.

  • @steveoc64
    @steveoc64 Pƙed 14 dny +73

    Javascript programmer using MySQL gives us a tech talk about the importance of performance at scale
    lol

    • @romankoncek150
      @romankoncek150 Pƙed 11 dny +6

      Probably more experienced with real projects at scale than 99.9% of people here posting useless opinionated statements having nothing better to do

    • @JohnSmith-xv1tp
      @JohnSmith-xv1tp Pƙed 8 dny +12

      @@romankoncek150 As someone who has actually worked at scale, Theo should definitely have stayed in his lane on this one. He clearly doesn't know enough about databases nor UUID for this conversation.

    • @iPuls3
      @iPuls3 Pƙed 2 dny

      Don't mind me, the Node.js TypeScript developer using MariaDB...

  • @supersat
    @supersat Pƙed 15 dny +15

    V1 was a pretty good idea because collisions were basically guaranteed to never occur. But that was in a world where MAC addresses were guaranteed to be unique, and VMs basically killed that.

  • @danhorus
    @danhorus Pƙed 16 dny +107

    This channel does read a lot of blog posts

    • @Dekutard
      @Dekutard Pƙed 16 dny

      i rather listen than read.

    • @thomassynths
      @thomassynths Pƙed 16 dny +19

      Reading blog posts is fine and dandy. Regurgitating sponsor speak isn't though.

  • @dijikstra8
    @dijikstra8 Pƙed 16 dny +82

    Using a char(36) to store a uuid just seems like the worst idea every, does anyone actually do this? Postgres has a specific datatype for uuid, does MySQL still not have that?

    • @mehmeterencelik594
      @mehmeterencelik594 Pƙed 16 dny +13

      Yes. For my opinion, this video only shows how better the postgres than other relational dbs.

    • @nathanalberg
      @nathanalberg Pƙed 16 dny +9

      Im moving our app to postgres because of the UUIDs alone. UUIDs in mysql is basically roll-your-own.

    • @SPeeSimon
      @SPeeSimon Pƙed 16 dny +4

      Yes, they should have used varchar(36). Barely joking, because i actually saw a team use nvarchar(36) on an Oracle DB for their PK. Not really a shocker, because the NIH disorder was really high. So they did not use the same raw(16) type as everywhere else in the system and did not reuse existing code.

    • @Lexaire
      @Lexaire Pƙed 15 dny +2

      @@SPeeSimon Codebase I work on uses varchar(40) and some uuids have hyphens and some don't. Thanks MySQL.

    • @johnbernardlambe8582
      @johnbernardlambe8582 Pƙed 11 dny +1

      If someone wants hexadecimal, they could at least omit the hyphens, giving 32 characters. Even worse is the 38-character format, with added chain brackets.

  • @averageemogirl
    @averageemogirl Pƙed 3 dny +5

    "hey guys! today i will be talking about something i clearly don't understand" is a great summary of this video

  • @loganyt8818
    @loganyt8818 Pƙed 16 dny +159

    instead of PlanetScale sponsoring Theo, they should have used that money to keep reasonable free plan

    • @sexymeal2427
      @sexymeal2427 Pƙed 16 dny

      lmao

    • @__Brandon__
      @__Brandon__ Pƙed 15 dny

      Except now 100 thousand new people know about planet scale and probably 5% of those people are likely to use their service. 5k new users is a lot of users

    • @epajarjestys9981
      @epajarjestys9981 Pƙed 14 dny +9

      @@__Brandon__ "probably 5%", lol. that's a completely ridiculous and extremely improbable number. it's just as probable as theo's sponsorship money being enough to finance the free plan.

  • @unknown.profile.handle
    @unknown.profile.handle Pƙed 10 dny +6

    btw it is an RFC, not just a proposal anymore.
    But there were many wrong assumptions in this video.

  • @NicolasTroncoso
    @NicolasTroncoso Pƙed 16 dny +25

    I believe there is some confusion about sortability.
    All primary keys are sortable, regardless of the value. They will fit in the B+ Tree pages with different packing sizes, and with the balancing issues described in the video.
    What some people want is that records are sorted by creation time. Which is what the CUID post talks about. Don't worry about the record being naturally sorted by creation time. Use a random key to avoid hot spots, and use a createAt (or reverse CreatedAt) index if you need to scan the table the naturally created order.

    • @rainmannoodles
      @rainmannoodles Pƙed 12 dny +1

      Exactly.
      Locating a given CUID is still efficient with the tree. Internally, the tree structure just won’t follow the creation time order. There is still a defined sort order, it just doesn’t sort based on criteria that a human would consider “useful”.

    • @PatrikTrefil
      @PatrikTrefil Pƙed 11 dny

      I am glad that someone cleared this up in the comments.
      There is still one thing that is unclear to me though. How is it any different if you use ids that are not sorted based on time creation and use an index for createdAt? The secondary index for createdAt is again a B+ Tree (or any other datastructure) which again causes hotspots. In this case the hotspots are just in a different index. It seems like the performance issues are still there. Could anyone explain if and how this solves the problem?

    • @pacifico4999
      @pacifico4999 Pƙed 11 dny +2

      But what about page fragmentation?

  • @Kazyek
    @Kazyek Pƙed 14 dny +29

    What I don't see mentioned here and should be mentioned is that your Database IDs, which you use for indexing, and referencing relationships, doesn't HAVE to be the same value you use to reference records from the API's PoV.
    You can just use an integer pirmary key, then add a binary UUID field with a unique key constraint, which you don't use for joining / referencing in the database but is used at api-surface-level to identify the records.

    • @TheCryn
      @TheCryn Pƙed 11 dny +1

      Yes but I don't see how this improves things. Consider your primary key being an autoincrement integer, that you never expose via api and an object id that is some sort of UUID that you do expose. If you expose an object via the object id (UUID), you probably want to have an index on the object id, as you now need to access the object via object id (UUID), as you don't know the primary key on an api level. So you are at the rebalancing problem again, this time "just" for the index and you probably don't win anything performance wise. If your dataset is so small that you don't index, then the performance hit for using a UUID PK probably won't matter.

    • @HenryLoenwind
      @HenryLoenwind Pƙed 8 dny +2

      @@TheCryn It improves things as you now only have one table with one index that has UUIDs. And this table will, in many cases, be relatively small, as the things you expose via ID often are short-lived (sessions, password reset tokens, email verification tokens, ...). That's way better than having a UUID primary key on each and every table---that could easily be hundreds of them.
      In any case, using a UUID as an external ID for a limited number of externally addressable objects is different from using it as the primary key for every table in the system.

    • @TheCryn
      @TheCryn Pƙed 7 dny +1

      @@HenryLoenwind I think most real world applications I have worked on had their PK for most tables exposed in one way or another (depending on where you draw the line on exposure).
      I have worked on a legacy system that dates back before the introduction of UUIDs there the original designers have put much effort into not using autoincrement PKs but to generate something time based (for sorting purposes) but also random (to prevent congestion on DB pages, if a lot of IDs/objects are added to the DB).
      You should also consider that a lot of IDs you expose (i.e. order IDs for a web shop) can give away insights you often don't want out there (see: German tank problem).
      So you probably still want most API IDs to not be predictable / autoincrement and need an ID-table for almost each table and an additional join for most queries (that can have a noticeable performance impact itself).

  • @tabletuser123
    @tabletuser123 Pƙed 16 dny +86

    “this video is sponsored by planetscale” is a good way to make sure we know the video is not credible

    • @xelspeth
      @xelspeth Pƙed 15 dny +13

      I mean he's just reading a blogpost and giving his own opinions, I don't see how this video would be credible in the first place lol
      You're supposed to form your own oppinion anyway

  • @falkkyburz
    @falkkyburz Pƙed 15 dny +11

    Why are we mixing "version" and "variant"?

  • @dragons_advocate
    @dragons_advocate Pƙed 16 dny +21

    "Planetscale is not very liked atm..." -- "Thank you planetscale for sponsoring this video"

  • @ckpioo
    @ckpioo Pƙed 16 dny +122

    did he really just say storing 20x more is no big deal casually?, all computer engineers biting their nails rn

    • @Luxalpa
      @Luxalpa Pƙed 16 dny +35

      The 20x figure is highly misleading, because it depends on what is multiplicated. As he said in the video, your records already contain a lot of data, in that case it wouldn't be 20x, more like 1.05x.

    • @ckpioo
      @ckpioo Pƙed 16 dny +11

      ​@@Luxalpa its not misleading instead you're misunderstanding it, the 20x figure is taking into account just the primary key, and youre also considering the *useful* data stored with the primary key, which you shouldn't be doing because the point is that if you consider the keys as useless data (because its mostly not used by the end user in an app) then what the 20x figure says is that as your data scales up, a larger % of the data will be "useless" data compared to another database which has the same data but is using a incremental primary key. the whole point of the key is just for locating the useful data, so if you are using a system which requires 20x more space to do just that then its basically just a big waste of storage space AND compute.

    • @Sharess
      @Sharess Pƙed 16 dny +5

      @@ckpioo you are correct, but overall, that 20+ x increase can be disregarded, unless you have an insane amount of data where reducing the storage and compute requirements would result in significant monetary savings

    • @DavidLindes
      @DavidLindes Pƙed 16 dny +5

      @@Sharess and @Luxalpa: the thing is, 20x _storage_ (long-term) for just the primary keys may not be a big deal in terms of overall table size, but it _is_ (or at least can be) for the _index size_ -- which does indeed have compute implications, as Theo discusses. And, ya know, it's nice when indexes are actually fast -- I've seen complex queries go from something absurd like 30 minutes (memory hazy on this, so grain of salt, but a long long time) to I think sub-1-second just by indexing a column that should have been indexed in the first place. But if those indexes are 20x slower (which may not strictly be the case here, not sure, but still)... ??

    • @danhorus
      @danhorus Pƙed 16 dny +5

      Keep in mind that these primary keys also appear as foreign keys in several other tables. It is not uncommon to have multi-million row transaction tables with dozens of foreign keys. This shit adds up.
      I'm also disappointed the video doesn't mention the performance loss in JOIN operations, which is also a big deal for applications with lots of users.

  • @Techman83
    @Techman83 Pƙed 15 dny +9

    This article appears to be entirely misunderstanding UUID as incremental versions, along with MySQL not handling them as well as something like PostgreSQL. UUID4 for guaranteed random, the rest have different, but valid use cases.

    • @iPuls3
      @iPuls3 Pƙed 2 dny +1

      If used properly in MySQL, it's stored as binary.

  • @benmeehan1968
    @benmeehan1968 Pƙed 16 dny +28

    Curious why there is no evidence regarding performance, only vague handwaving that storage space is a factor. What this does show is that MySQL has an incrementing index type and an indexing mechanism that is optimised for incrementing indexes. My (naive) understanding was that with a B-Tree, if the index values were random, the tree will likely tend towards being balanced (at least over time), and that B-Trees don't perform well with sequential index values, hence the need for an optimisation. Sequential values don't scale horizontally, hence the need for a larger 'namespace'. Sequential 32 bit values and random 128 bit values are apples and oranges, and meet different requirements. Choose your index keys appropriately, if horizontal scalability is a likely to be a factor (which is often a guess when we don't know just how large or distributed a storage system is going to be). Clearly a UUID can't compete with a 32bit sequential when you have a single key generator, but a 32bit sequential can't even function when there are multiple key generators (without segmenting the address space). Horses for courses.

    • @workmad3
      @workmad3 Pƙed 16 dny +15

      Yeah, the main thing I got from this was "MySQL sucks for UUID keys", because they've optimised their b-tree index page generation around sequential keys, and don't have the ability to change index type easily.
      Postgres, on the other hand, has a native uuid type that can automatically cast from strings, so you don't need to mess around with conversion functions, has a different optimization for b-trees that, while still has uuid impacts, isn't as severe from what I can tell (it's mostly a cache issue in postgres, as uuids are more likely to be spread over multiple pages, so if you've loaded a page into cache, it's not as likely to be hit again), and also gives the option of other index types that could be better depending on exact db use (like the hash index type, which only supports equality operations)

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 16 dny +4

      You don't need evidence. This is btree 101. Left inserts are slow. Go watch a video to find out why right-only append is faster.

    • @benmeehan1968
      @benmeehan1968 Pƙed 15 dny +3

      @@7th_CAV_Trooper I absolutely agree, but there is no quantification of how big the effect is, and that might mean any argument for optimisation is not relevant to any particular use case.

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 15 dny +1

      @@benmeehan1968 worst case is every pk inserts at index zero and forces full right shift of the array each time. I think DB perf is quantitized in terms of O notation rather than ms.

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 15 dny +1

      @@benmeehan1968 good point about perf not relevant for every use case

  • @debasishraychawdhuri
    @debasishraychawdhuri Pƙed 16 dny +59

    MySQL does care about a distributed database, so they are interested mostly in the performance of a single B-Tree. Imagine how hard it would be to create an auto-incrementing ID in a distributed insertion. If you use an auto-incrementing ID, you can only insert in sequence. This is why people use UUID. Theo needs to learn about databases. Everyone writing enterprise software needs to learn about databases.

    • @scragar
      @scragar Pƙed 16 dny +4

      Percona solved distributed ID insertions by having each master get a sequence, then each master only inserts IDs matching it's sequence mod the number of masters.
      So if you have 3 master servers you'll have one only inserting 1,4,7,10; another only inserting 2,5,8,11; and a third inserting 3,6,9,12; this ensures they never conflict on the ID.
      This does however mean the auto increment doesn't retain consistency between servers.

    • @workmad3
      @workmad3 Pƙed 16 dny +7

      @@scragar it also means that if you need to add a new master into the mix, you have more work to do adjusting every node, especially if you're using the id%n to identify the master for some purpose as well

    • @keithjohnson6510
      @keithjohnson6510 Pƙed 16 dny

      @@workmad3 A variant of @scragar option you just use a compound primary key of RecordID(seq), ClusterID, . ClusterID could even be a single byte giving you 256 nodes, U32 for the RecordID, then every node can have 4294967295 records. Guaranteed no collision with 5 bytes. Synchronising clusters is also very simple. Note also because is a sequence, RecordID, ClusterID, performance is also a added bonus as B-Tree's balance better than UUID.

    • @Novascrub
      @Novascrub Pƙed 16 dny +4

      @@scragar We used to do something like this on the client side. Each client (web sever) would claim a block of ids by inserting a claim into an ids table (i claim id's 90-99), and then assign them client side. Combed uuids are just better. We knew about the page splits problem and solved it in 2003.

    • @paulie-g
      @paulie-g Pƙed 16 dny +2

      You need to learn a bit more than you know to get you to the level you think you are at. Lamport timestamps or equivalent are necessary in a distributed system anyway for serializability and you get your unique, monotonically increasing id for free.

  • @titbarros
    @titbarros Pƙed 16 dny +30

    ULIDs resolve most, if not all of the problems mentioned. Is my go to now

  • @Denominus
    @Denominus Pƙed 16 dny +41

    UUIDv7 is suitable for all the relational dbs, except MSSQL which would require V8 and a custom generator.
    Our largest table is many billions of rows and UUIDv7 still performs well in Postgres.

    • @paulie-g
      @paulie-g Pƙed 16 dny +1

      Out of curiosity, what stops you using int autoincrement as primary key and uuids as an extra field if you need them for some reason? Is it some sort of roll-your-own-sharding thing?

    • @Denominus
      @Denominus Pƙed 16 dny +10

      @@paulie-g An earlier version of the system did actually have that, but it didn’t buy us much (if anything). It’s a big distributed system, so to say 99%+ of queries were done by the uuid (external_id as we called it) would be an underestimate.
      It also wrecks one of the benefits of using uuids as identifiers, that is you will need to do round trips to the DB on insert to get the ID for FKs.

    • @paulie-g
      @paulie-g Pƙed 16 dny

      @@Denominus I always assumed people would use stored procedures or equivalent to avoid the round trip.

    • @uttula
      @uttula Pƙed 16 dny +3

      @@paulie-gwhen you already have a (for most practical purposes) unique key you can use, adding a autoincrement field to use as your synthetic primary key serves zero purpose aside the fact that meatbag humans like sequential integers for some weird reason.

    • @paulie-g
      @paulie-g Pƙed 15 dny

      @@uttula UUIDs are synthetic as well. Integer primary keys are an implementation detail. And they're an implementation detail for a reason that has nothing to do with human preference.

  • @disjustice
    @disjustice Pƙed 14 dny +6

    I generally use an integer sequence as the primary key and have an associated UUID in a unique index. The primary key is only ever used internally in the DB model and in application code. When exposing the entity in public references, the UUID is used. In addition to the issues touched on in the article, this saves us from another drawback: having to use that same 128bit UUID in foreign keys all over the DB. This goes for any globally unique identifier scheme. Because these unique identifiers need to be globally unique, they are generally so long that they are unwieldy to use as primary keys, even if they are sortable.

  • @worldadmin9811
    @worldadmin9811 Pƙed 16 dny +24

    RIP ULIDs got no mention even when shown in article

  • @tesilab994
    @tesilab994 Pƙed 3 dny +2

    There seems to be just a couple of quick takeaways here. The only UUIDs of real interest are v4, and they only serve two essential use cases. 1. Prevent collision: you need to generate a key to associate with a record before it enters the database, or 2. Create a session token - you need a key which cannot be guessed. You are free to use more efficient primary keys where an outside agent cannot use it to access records. You are free to acquire the more efficient database assigned key and use that after the fact. You are certainly free to use time-sortable keys for tracing activity in your system.

  • @Viviko
    @Viviko Pƙed 16 dny +15

    Why do we need to sort by ID? Why not sort through other fields, like a user’s name, creation date, etc
?
    Point of the ID is
 as the name implies, identify some data. We don’t need to sort it using that field.

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 16 dny +2

      Because the clustered index is sorted by the PK.

    • @itsbazyli
      @itsbazyli Pƙed 15 dny +5

      As explained in the video - it helps to improve performance of retrieving a single record by ID. It's not related to your application logic's sorting, just to how databases access records by ID.

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 15 dny +1

      Sorted data makes it possible to perform a binary search. The ID isn't the offset into the data file. So even a point query requires a search.

    • @everton000rsc
      @everton000rsc Pƙed 4 dny +1

      When the database gets to huge sizes any index created is costly. I have a use case in my company where the ordering by ids is used, if we had to order by createdAt we'd need an index in that column, and guess what would be the size of that index in disk in our DB? Almost 1TB (and it takes weeks to create)... The primary key is almost always in memory (cached), almost always the first and fastest to load, so using it for ordering is much less expensive compared to creating another index just for that. But I understand that use cases vary, depends on what kind of system you're working with and it's business rules and requirements.

  • @username7763
    @username7763 Pƙed 14 dny +5

    UUIDs were not designed to be database primary keys. They were designed to solve problems with distributed networking, not data storage. There is no problem with UUIDs if used as designed. Kids these days see UUIDs and immediately think databases. Ha!

  • @Chamassa1210
    @Chamassa1210 Pƙed 16 dny +4

    The bad part about the size comparison, is that where it matters you are probably using bigint already, making binary(16) only twice as big as bigint, without collisions

  • @TFayas
    @TFayas Pƙed 4 dny +2

    Solution I've always used is have an incrementing int id primary key and a guid for external reference (and in some cases a unique human reasonable code name).
    Id for performance and internal joining, code name if you need to reference it in code, and guid for matching across environments in staging situations where you push items from one environment to another.

  • @MeriaDuck
    @MeriaDuck Pƙed 16 dny +6

    As others also pointet out, they are not versions, they are _variants_ with purposes per variant.

  • @ed.puckett
    @ed.puckett Pƙed 16 dny +3

    I avoid sequential ids because they make it too easy to accidentally reference an incorrect record. With uuids, there is no chance of this.

  • @user-qq7yc1qp8z
    @user-qq7yc1qp8z Pƙed 16 dny +56

    Why even care about planetscale? For production, you have AWS RDS, for small projects you have Supabase.

    • @MrDadidou
      @MrDadidou Pƙed 16 dny +6

      how does it invalid the whole discussion around uuids ?
      It's like having a guy screenshare his IDE when talking about a complex programing question and your only input is "dude why even use the gruvbox theme in your IDE? you have Catpuccin and DoomOne available."

    • @carlosmspk
      @carlosmspk Pƙed 16 dny +6

      @@MrDadidou I like that you tried to show how silly his comment was, but ended up providing a valid comment for a video. It's perfectly valid for you to comment on whatever shows up in the video. He could even just go "Damn Theo, your hair is looking mighty fine today" and it would be a valid comment.

    • @jeffreyjdesir
      @jeffreyjdesir Pƙed 15 dny +1

      @@MrDadidou I just wish the discussion wasn't based on faulty premises (like UUID variants serve to succeed their formers) cause it suggests biases I'm sure we're all aware of

    • @everton000rsc
      @everton000rsc Pƙed 4 dny

      RDS gets really expensive when you get to the point you need horizontal scaling. Even planetscale is too expensive, my company choose to deploy our own vitess cluster manually to be able to save almost 2/3 in database costs. I get it that people don't like that planetscale removed the free plans but after I suffered so much in the migration from mysql/mariadb to vitess I understand the real value of what they are delivering. And they're almost the only maintainers of vitess atm so i hope they don't go out of business anytime soon...

    • @raphaelmorgan2307
      @raphaelmorgan2307 Pƙed 2 dny

      "why even care about planetscale?" because they're paying him money, obviously! (ty for posting other options)

  • @ericf6964
    @ericf6964 Pƙed 6 dny +2

    Another problem with UUIDs is logging. First, logs almost always store UUIDs as text, which as you pointed out is very inefficient at 38 bytes per. Second, since they are designed to be unique, they tend to be relatively uncompressable compared to normal text. The time based UUIDs may be less bad in that respect. But the bottom line is that if you have a requirement to log your identifiers, then using UUIDs will practically guarantee huge storage costs for those logs.

    • @everton000rsc
      @everton000rsc Pƙed 4 dny +1

      Good point, at massive scale any extra byte matters

    • @someed3702
      @someed3702 Pƙed 3 dny

      @@everton000rsc If we're talking extra bytes on a record, not really. When we're talking about extra bytes on every index record, Absolutely.
      It's easy to think that the index files are storing one copy of the primary key per record in the database. If your index is just for equality matches, that's even correct. But if your index handles sub string searches or most kinds of approximate match that's very much not the case. Note that I'm not just talking about the index for the primary key, which I would expect to be an equality index. I'm talking about any other columns on that record, so some of them could reasonably be indexed for sub string searches.

  • @fifty6737
    @fifty6737 Pƙed 16 dny +5

    UUIDs event account for Time Travel but only as back as 1568, apparently, in the 1567 they were using Integers

  • @XCanG
    @XCanG Pƙed 16 dny +6

    I used snowflakes a lot and like them slightly more, than other type of identifiers, especially considering that each of them have datetime baked in.

    • @2penry2
      @2penry2 Pƙed 16 dny

      Yeah snowflake IDs are nice, I like being able to sort via time on them

  • @sd_pjwal
    @sd_pjwal Pƙed 16 dny +15

    My first startup I worked for, the main DB admin insisted on using primary keys made up of real user data or composite keys made up of real information. It was so insane. Social Security # as primary ID for a user? This was a benefits enrollment company. Shocker, some of our customers had illegals that shared SS #'s.

    • @User_2
      @User_2 Pƙed 16 dny +3

      > Illegal argument exception đŸ€­

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 16 dny +4

      SSN isn't unique by law and person can get a new one. They are not identifiers. Don't use natural keys.

    • @Spiker985Studios
      @Spiker985Studios Pƙed 15 dny

      @@7th_CAV_Trooper SSNs + Birth Date is unique, and it's why they are associated together. You will never see an SSN field without a DOB field, unless they're looking up your DOB by some other means

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Pƙed 15 dny

      @@Spiker985Studios yes, but OP suggested using SSN alone as PK, unless I misunderstood.

    • @RickYorgason
      @RickYorgason Pƙed 15 dny

      In Canada, I believe it's illegal to use social security numbers as database keys.

  • @ShinigamiZone
    @ShinigamiZone Pƙed 15 dny +1

    UUID v4 do not only have the "4" statically at index 12, but also there is a "8, 9, a or b" at index 16 (starting index by 0)
    So its /^[0-9A-F]{8}-[0-9A-F]{4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

  • @paulmoore3755
    @paulmoore3755 Pƙed 12 dny +2

    Some reference to the collisions that included the context around them would be appreciated

  • @paulurban2
    @paulurban2 Pƙed 15 dny +3

    Having sequential IDs create contention hotspots in the B-Tree. No good for horizontal scalability or vertical scalability.

  • @mmathijs
    @mmathijs Pƙed 16 dny +7

    Who needs standards, its all about the vibes. UUID v8 is by far the best

    • @Zeragamba
      @Zeragamba Pƙed 16 dny

      variant 8 still uses the same formatting of the 128bits so it's compatible with any system that can store UUIDs, but leaves the specific way to fill those bits up to the user/system

  • @JimCraddock
    @JimCraddock Pƙed 6 dny +1

    35 years as a data architect. Inevitably, I run into uuid's being used by people that do not understand databases. Use an incrementing integer for your clustered index, preferably make it your primary key. The primary key is included in every index on a table, so if you are using the uuid as the primary key you just increased the size of every single index by a lot compared to an integer.

    • @everton000rsc
      @everton000rsc Pƙed 4 dny

      Yes, I initially believed that UUIDs are excellent and didn't care much about what ppl said about it's dawbacks, but after working on backend doing some query optimizations and understanding better how relational databases work I now understand how naive I was

  • @michaelutech4786
    @michaelutech4786 Pƙed 16 dny +4

    You can always use UUIDs (or any stable data whatsoever) to compare them orderly. The problem with non-ordered ID's that that when inserting data, they are not growing according to that order. Sequential IDs do that, the next ID will always be greater than the previous one. When inserting random IDs, they tend to fill pages better (it's random on which side of a node the tree grows). Not all indexes are necessarily trees.
    The problem with such statements is that there is no A is better than B in all generality. You have to actually look at what your systems do with your data.
    If you need distributed keys, e.g. keys where there are multiple authorities (instances) that generate synthetic keys (keys that are not data driven), then UUIDs and other mechanisms make the handling of these authorities easy. That has a price. If you don't want to pay this price, you need to pay another one. Maybe you need to use data storage that works efficiently with your keys. Maybe you organize your key-namespaces or scopes manually such that you can use the most efficient local storage and key management and save runtime performance but then have to invest in deployment configuration.
    Any attempt to find a silver bullet can at best work for were wolfs and vampires but will fail for mosquitos or blue whales.
    These generalizations and over-simplifications are just stupid. UUIDs are perfectly fine and battle proven data types for certain applications. When used in the wrong context or in the wrong way (version, ...), they fail, just like any other mechanism does, 32 bit numbers or 640K were proven not to always be enough, despite timely expectations.

  • @johnredberg
    @johnredberg Pƙed 15 dny +3

    So... "The Problem with InnoDB".

  • @yuwei2522
    @yuwei2522 Pƙed 14 dny +9

    Sounds more like a problem with MySQL. Some non-relational databases would actually benefit from the randomness of UUIDs.

    • @jongeduard
      @jongeduard Pƙed 7 dny

      Very true, I don't remember all the details, but for example SQL Server has solved this problem because of the way it sorts and stores the rows in a sorted order, while it generates the ids based on timestamps at the same time.
      Although it's still important how you configure things exactly, there are certain details which need to be done right.

    • @everton000rsc
      @everton000rsc Pƙed 4 dny

      Unfortunately not all systems can/should/are built with non-relational DBs, it has some really bad drawbacks for some use cases

  • @mattiasfagerlund
    @mattiasfagerlund Pƙed 3 dny

    Random ids also have the disadvantage that for a large table, if you need to access 1000 rows, they'll be stored in 1000 different pages, even if the rows are conceptually consecutive. Non random ids instead create hotspots during inserts. So a semi-random index would tick both boxes, but every use case is different.

  • @zebraforceone
    @zebraforceone Pƙed 16 dny +2

    Of course they hurt performance, they will continue to do so until 128-bit architecture is a thing.
    They're not supposed to match integers in terms of performance.

  • @TizzyT455
    @TizzyT455 Pƙed 14 dny +3

    Cuid2 sounds unnecessary. Its just a seeded sha3 from my reading. What is wrong with just combining a sequential counter and any secure prng? What I tend to use is an N-bit sequential part and an X-bit random number (usually 64/448). This allows to generate sequential numbers up to N, while having collision resistance of X). If you really wanted to, you could just use SHA3 to generate the X bits yourself if you want the same randomness as cuid2. Of course adjust as needed.

  • @Hypn0s2
    @Hypn0s2 Pƙed 16 dny +24

    Jebus. I couldn't last the full video. I don't know how you followed this all.
    I need to wash my brain now with something else and pretend this isn't real.

    • @diamondkingdiamond6289
      @diamondkingdiamond6289 Pƙed 16 dny

      It isn’t those aren’t even the main reasons of why you shouldn’t use uuid in MySQL. Using uuids in MySQL is bad for performance in MySQL but it’s fine in postgresql

    • @KangoV
      @KangoV Pƙed 16 dny +6

      Perfectly easy to understand. Versions in this instance are actually Variants. Each variant does not supersede the last.

    • @denisblack9897
      @denisblack9897 Pƙed 16 dny +1

      Like in Loki?😅

  • @acasualviewer5861
    @acasualviewer5861 Pƙed dnem

    One thing that seems to be a wrong assumption in the video: just because UUIDs may not be naturally sortable (in the sense that they don't generate in a sequential sortable way), doesn't mean that you need to do a table scan in a B-Tree.
    If you use random values you can still sort those in a B-Tree by whatever number or string that they represent. So a B-Tree will still find the data quickly. You aren't forced to do a table scan. Indices still work.

  • @DonAlcohol
    @DonAlcohol Pƙed 15 dny

    also for a start date/time to begin counting i have a good one :P , theres this relly weird but consistent thing the eath does , its like a wobble, and the whole thing lasts about 26000 years (25800) but as a consequence of this the sun rises in a different constellation on the spring equinox every 2160 years... so if we pick the beginning of one of those as the starting point to begin counting time wouldnt that be a great idea ? coincedentally the last time we shifted (from Ram to Pisces (its reverse form the horoscope) , its at the same time we shift from BC to AD so 1 jan of the year 1 seems like a nice spot to start counting time not?

  • @peppybocan
    @peppybocan Pƙed 16 dny +3

    I haven't heard of CUID, but I have heard of the ULID. I don't need the 128bit of random data. I am fine with the 64-80 bits of random data. It doesn't even need to be cryptographically secure, to be honest.

  • @dfs-comedy
    @dfs-comedy Pƙed 16 dny +6

    Using MySQL is the first mistake.
    FWIW, PostgreSQL has a native UUID type that occupies 16 byes of space. But yeah... I don't like UUIDs as primary keys.

    • @nicolaicornelis2853
      @nicolaicornelis2853 Pƙed 15 dny +1

      binary(16) is the same as a native UUID type though (storage wise).

  • @markm1514
    @markm1514 Pƙed 12 dny +1

    Editor: 'If you use the word "represented" too many times, too closely, it's bad writing, grammar and meaning be damned.'

  • @Marfig
    @Marfig Pƙed 13 dny +1

    Ok. Good video. But just one clarification: UUID versions are not to be interpreted as sequential version numbers. They have been created simply to address different ways to generate a UUID values according to use cases. The idea is simply, If you need a certain type of UUID you will probably find it. V4 is the most common, not because it's better or worse but simply because it's fully random generation algorithm is a more prevalent requirement in computing.

  • @dahahaka
    @dahahaka Pƙed 16 dny +3

    Is indexes nowadays an accepted form of indices? When did that happen 😭
    Edit: typo

    • @ceigey-au
      @ceigey-au Pƙed 16 dny +4

      Well, it's been more acceptable than spelling "accepted" as "excepted" for a while, but both are understood colloquially _irregardless_ ;-) (sorry, couldn't let those opportunities pass by...)

    • @dahahaka
      @dahahaka Pƙed 16 dny

      @@ceigey-au omfg how did that happen, well that's embarrassing XD

    • @dahahaka
      @dahahaka Pƙed 16 dny +1

      @@ceigey-au no worries, I would've done the same 💀

  • @Mike-zr9wq
    @Mike-zr9wq Pƙed 16 dny +3

    You'd have to be crazy to pay PlanetScale today

  • @TerranVisitor
    @TerranVisitor Pƙed 15 dny +1

    @Theo : Funny. About 4 Decades ago I designed a UID that incorporated brute force and accident aspects - such as mentioned herein - for incorporation into (truck) weigh scale software. The software which utilized this UID was smaller than 1.44MByte, including graphics, credit card processing, remote access and update, and recording of transactions. The software was stored on an EEPROMish type device which emulated, as you MAY have guessed, a floppy disk. For those curious, the OS was (a form of) DOS and it was programmed in BASIC and assembly.

  • @pauldraper1736
    @pauldraper1736 Pƙed 8 dny +1

    Bigger data size is extremely important. Sure it's irrelevant for disk storage or row size, but database are often RAM constrained. And perform terribly if they can't fit their indices (including UUID PKs, FKs) in memory.

  • @DominikZogg
    @DominikZogg Pƙed 16 dny +21

    Can we all agree that mysql is more of a pain in the ass than a decent database? So it's not a relevant metric for any solution.

    • @aldi_nh
      @aldi_nh Pƙed 16 dny +2

      no

    • @DominikZogg
      @DominikZogg Pƙed 16 dny +3

      @@aldi_nh if you burned your fingers often enough you'll change your mind. Postgres besides its ugly user management is superior. Or switch to non relational database cause most uses cases are better represented by documents anyway.

    • @aldi_nh
      @aldi_nh Pƙed 16 dny +3

      @@DominikZogg oh please, i beg to differ, mongodb are even harder to mantain. i dont know where you're coming from but from my experience, very little use cases are suitable with documents

    • @DominikZogg
      @DominikZogg Pƙed 16 dny

      @@aldi_nh web applications, websites

    • @byuwur
      @byuwur Pƙed 16 dny +4

      My dude doesn't know how a relationship works.

  • @TheOmfg02
    @TheOmfg02 Pƙed 16 dny +25

    5:00 complains about 1568 as an implementation detail
 proceeds to use 1970 instead.

    • @linminsu3443
      @linminsu3443 Pƙed 16 dny +5

      1970 is the unix epoch, which is already used everywhere

    • @mzg147
      @mzg147 Pƙed 16 dny +1

      complains that years 1568-1990 will never be used, therefore it's wasteful

    • @TheOmfg02
      @TheOmfg02 Pƙed 16 dny

      @@linminsu3443 i think you missed the point of my comment 😅

    • @KangoV
      @KangoV Pƙed 16 dny

      Anyone who's done anything with dates should know what the Gregorian calendar is and when it started. I knew it straight away :) Try converting dates to other calendars like Chinese and JIS and you'll see how important this is!

    • @gbjbaanb
      @gbjbaanb Pƙed 16 dny +3

      UNIX epoch time is as arbitrary a standard as Gregorian or any other time standard. You just have to pick a start point and a storage size. UNIX time isn't necessarily the best, as we will find out in 2038.

  • @MrL22
    @MrL22 Pƙed 6 dny

    Can someone explain what is wrong with storing number based id's but then using an encrypted version of that id for anywhere visible by the user? I have been using red-explosion/laravel-sqids

  • @Danny1986il
    @Danny1986il Pƙed 15 dny +2

    I think in UUID the v stands for variant, not version

  • @klaudyw3
    @klaudyw3 Pƙed 16 dny +12

    Is the versioning done by the USB forum? It sure looks like it might be.

    • @KangoV
      @KangoV Pƙed 16 dny

      I think you've misunderstood what version means (it's actually the type).

    • @Spiker985Studios
      @Spiker985Studios Pƙed 15 dny +5

      @@KangoV I think that you missed the fact that the USB spec is horrendous, and that if using the version marker on UUID as an actual version, and not a variant - they are comparable to how horrible they are to follow

  • @robertluong3024
    @robertluong3024 Pƙed 16 dny +5

    4am, waiting for my medical procedure to start and I'm spending it learning about UUIDs.

  • @thomaseichinger1717
    @thomaseichinger1717 Pƙed 15 dny

    What do you think about to use UUIDvTimebased for Businessobjects and snowflakes for provessingIdentifyers what are also not lived "long".
    integer/bigint/.. are amazingly fast because of the processor structure and also because of the storage structure.

  • @MikkoRantalainen
    @MikkoRantalainen Pƙed 14 dny +2

    15:20 I don't understand this argument. If you're happy to store 32 bit integer in binary form, you should be happy to store the 128 bit UUID in binary form, too. The database can convert from binary value to human readable value on the fly, just the way it happens for binary 32 bit integers, too!
    With the same logic you could argue that 128 bit binary UUID is actually better format than 32 bit integer stored as base-10 number as a string.
    That said, if your database doesn't support UUID column type, maybe you should switch database.

  • @sturdyfool103
    @sturdyfool103 Pƙed 16 dny +4

    I always love the large discussion around databases, because they’re nearly useless to me, as in, in fifteen years I’ve never once used a database, and only three of the thousands of my applications would’ve been an ok use of a database, with the rest just needing a config file

  • @ziv132
    @ziv132 Pƙed 16 dny +3

    I personally like ULID's, not just because they are incremental, but also the fact that I can just swap to using them in the same BINARY(16) column that was used for UUIDs (probably not the best idea to switch and have mixed content, but eh)

  • @Gersberms
    @Gersberms Pƙed 6 dny +1

    It turns out I accidentally invented my own UUID for references created on different systems. I wasn't smart enough to realize that this is what I was doing, but I was smart enough to have a sortable timestamp built in. I suppose I can pat myself on the back while kicking myself.

    • @everton000rsc
      @everton000rsc Pƙed 4 dny

      Haha, it happens. Creating a whole solution for a problem and then finding out that there are already libs that solved that same thing but better... (happened to me more times than I would've liked...)

  • @nicolaicornelis2853
    @nicolaicornelis2853 Pƙed 15 dny +2

    Nobody should use UUIDs as "non-guessable identifiers", so what the cuid(1) post is saying is weird - they're not meant to be cryptographically secure, they're meant to identify rows. If you want a secure key, generate a separate random column and put an index on that, using a truly random value. It's two entirely different problems.

    • @MurtagBY
      @MurtagBY Pƙed 4 dny

      They are sort of secure - people inspecting your business can not see how many users have registered, actions made, bills paid

  • @trapfethen
    @trapfethen Pƙed 16 dny +3

    Couple Things. UUIDS can be stored as a regular column with an index for fast lookup while using an integer for the primary key. The UUID is used for externally representing the resource, while the integers are used internally for performance (it is more resource intensive to do a 128-bit compare than a 32 or 64 bit compare on most CPUs and that is if you use the binary format). This approach even allows for distributed systems to coordinate via message passing, only using the UUIDs instead of the internal integers which are treated as an implementation detail. This approach is less attractive than just storing the UUIDs as the primary key which makes coordination and replication trivially simple by comparison. It is an option worth exploring based on your needs.
    The second thing is that the rebalancing problem introduced by random UUIDs is more severe than communicated here. While it is true that random UUIDs can cause databases to not utilize a larger portion of page sizes, the real performance hit is the stochastic nature of B-tree rebalances. B-trees are a compromise between the theoretically optimal B-tree and the reality of modern computers where data retrieval is severely slower than multiple comparisons. B-trees lean into splitting up the tree more aggressively at each level and requiring more value comparisons over-all in order to reduce the number of costly data-fetches per look-up. This causes each rebalance to be more expensive than in a comparably unbalanced binary search tree and require touching more nodes than in a BST. In particular, unlike in a BST. b-tree balance procedures requires frequent changing of the node's content rather than just changing the links between nodes. Coupled with the ripple effect that can occur where b-trees might have to traverse all the way back to the root node in order to rebalance, the performance hit is worse than it first seems and is slightly random itself. When b-tree insertions are roughly ordered, these copies and extra comparisons are more akin to amortized costs like we see with auto-scaling Vectors. When the insertions are random, the frequency of multi-level rebalancing operations increases. If you support key removals, even more so.

    • @nouveu1
      @nouveu1 Pƙed 15 dny

      But primary key is also an (unique) index and all indexed columns is what causes pressure when writing stuff. More indices means bigger penalty hence having both sequential id and uuid doesn't help and only make things worse. Not to mention that under hood index probably don't store whole record but pair uuid+id hence you have to make more lookups to get whole record.

  • @fgregerfeaxcwfeffece
    @fgregerfeaxcwfeffece Pƙed 16 dny +4

    The guy who came up with the 1568 wins a paddlin. That's a paddlin well deserved if I ever seen that. Unixtime was already the established standard and this new one does nothing better.

    • @guguludugulu
      @guguludugulu Pƙed 10 dny +2

      First UUIDs were used in 1980 and were counting time since 1980-01-01 and then were standardized around 2005 to use a gregorian calendar.
      While you could argue that unixtime was an established standard in 2005 i don't think it was "the" established standard back then since the programming landscape was very different and most of the programmers would use windows (well, arguably they still are). Windows times starts at 1601 btw.
      Either way that's a non-issue. If you have at least enough of a brain to carefully read the RFC and know how to count you won't have any problem implementing UUID.

    • @fgregerfeaxcwfeffece
      @fgregerfeaxcwfeffece Pƙed 2 dny

      @@guguludugulu Fair point.

  • @fgregerfeaxcwfeffece
    @fgregerfeaxcwfeffece Pƙed 16 dny +2

    This makes me feel justified in using my own function, that literally just uses the crypto safe random byte generator of the OS.
    It's slower then the regular one, but slower in this context means: several megabytes per second. So it's just unfit to overwrite entire drives.
    I just took the shortcut and it turned out to be just way less of a mess, while being better at doing the job it is supposed to do.

  • @MonsterWriter
    @MonsterWriter Pƙed 16 dny +1

    As I understand, the database does not need to rebalance the B-Tree very often if the indexed value is sequential (the id is sortable). If the ID is random, the tree deforms much faster.

    • @BruceWilsonTMB
      @BruceWilsonTMB Pƙed dnem

      That's backwards. If the indexed value is sequential, then all the new rows will be added at the end, and the tail of the index will get very heavy. The existing part of the tree will be fast, the newly added records will be slower to access until it rebalances. With a random ID, the values are distributed equally across the index, and the index stays mostly in balance for a long time.
      He mentioned page splitting. Page splitting happens with both sequential and random IDs, it's just a question of where (a lot in one place, or a little bit everywhere). He mentioned the worst case of 50% full pages, but failed to mention that most databases tend toward 75% full over time.

  • @DominikZogg
    @DominikZogg Pƙed 16 dny +3

    Autoincrement (number) cannot be used by developers who care that an entity should have an id before being persisted, so for me any argument pro them is irrelevant.

    • @gbjbaanb
      @gbjbaanb Pƙed 16 dny

      You pre-allocate the id in such cases. Most DBs will let you do that.
      But frankly you're probably doing it wrong if you want an entity to be tracked by the DB before you've even added it to the DB.

  • @tinymouse2
    @tinymouse2 Pƙed 16 dny +9

    I just use v4 and write my code with the idea of "can this collide? Oh it can? I'll check before generating it"
    Adding a read before generating and a write is a worthwhile step to ensure I don't run into a collision

    • @NubeBuster
      @NubeBuster Pƙed 16 dny +11

      You're writing a check for UUID4 collision?
      If you gave every person on the world 500 million uuids, there would be only 50% chance that there is 1 collision.

    • @tinymouse2
      @tinymouse2 Pƙed 16 dny +5

      ​@@NubeBuster What can I say, I'm generous with my UUID's.
      It's seriously overkill I know, but in the systems I've helped implement, the result of a collision could mean personal user data gets sent to the wrong user and that's not acceptable to me.

    • @jameslouizbuilds
      @jameslouizbuilds Pƙed 16 dny +2

      Realistically how likely are you going to get a collision? Are you operating at a huge scale?

    • @tinymouse2
      @tinymouse2 Pƙed 16 dny

      @@jameslouizbuilds I added some logging for when we have a collision, and so far we've had none.
      The scale for the biggest thing I worked on with this is still relatively small, but we're generating maybe a few hundred UUID's a day, and doing a read that many times a day is a pretty small overhead for the absolute guarantee that we don't pipe sensitive information to the wrong party.
      If it was for something where security didn't matter I probably wouldn't have added the collision check honestly.

    • @NubeBuster
      @NubeBuster Pƙed 16 dny

      @@tinymouse2 tell the global news outlets when you get a collision. This will be world news.

  • @d3fau1thmph
    @d3fau1thmph Pƙed 15 dny

    I use internal sequence as PKEY and UUID as "public id". PK is never used in the code or visible in any API, it is only used on the database level for relational stuff.
    That aside, are UUIDs equally problematic in PostgreSQL?

  • @gro967
    @gro967 Pƙed 16 dny +2

    As long as you are using UUIDs with PostGres in your DB and Nanoid everywhere else, there is no problem.

  • @MichaelKire
    @MichaelKire Pƙed 16 dny +6

    Regarding the sortability being a potential security: Security by obscurity is not actual security.

    • @jwueller
      @jwueller Pƙed 16 dny +3

      Technically it is if the obscurity is big enough. All of cryptography is technically just very hard to guess, not impossible.

    • @itsbazyli
      @itsbazyli Pƙed 15 dny +4

      There are a number of features where obscurity is the default mode of security. Think about situations like "share via link". You can do that to an unlisted CZcams video. The only thing you need to know to see the contents is the link itself. If CZcams's video ID was sequential, or guessable even just in part, you would be able to enumerate videos one-by-one, and discover every unlisted video on the platform. If the video ID is fully random though, your chance of "stumbling upon" an unlisted video is significantly lower. Not zero, but combined with rate limiting and other anti-bot security measures, becomes a pretty decent barrier from a bad actor doing just that.

    • @ivanfilhoz
      @ivanfilhoz Pƙed 15 dny

      This is a good motto for source code or other implementation details, but it shouldn't be taken generally. Exposing less information is part of security. The whole point of secrets, like keys, tokens or passwords, is to keep them safe. But they're not the only data that should be hidden, are they?
      Your application must only expose information it is supposed to, and if the sorting is guessable by your IDs, you are leaking data. Timestamps are not always sensitive data, but they often are, such as in bank transactions or medical history.

  • @hansdietrich1496
    @hansdietrich1496 Pƙed 16 dny +3

    And all that hazzle, just because incrementing an integer is too boring for some people ...

    • @nnnik3595
      @nnnik3595 Pƙed 16 dny +1

      Seems like you do not understand basic programming so let me explain:
      Certain systems become so large and distributed it is difficult and slow to get definite answers.
      In those systems it is difficult to just "increment an integer".
      These systems are also fairly common due to the scale of computing we use in every day lives.
      As such we just generate an ID we know will be unique for all of our use cases.

    • @Luxalpa
      @Luxalpa Pƙed 16 dny +1

      Incrementing requires a mutation, so it requires one to wait for an authority to give out those IDs. That's a nightmare in distributed systems.

    • @hansdietrich1496
      @hansdietrich1496 Pƙed 16 dny

      @@Luxalpa No. You just need to give the initial value to the one creating its IDs (and only check for collision on insert). And for all creating authorities, you have their own IDs. which become part of the ID, e.g. 1_000_000_000_000 * CreatorID + generatedID. All easier than that UUID stuff.

    • @KangoV
      @KangoV Pƙed 16 dny

      @@hansdietrich1496 We have between 20-100 nodes that are created/destroyed based upon load. These are split across multiple availability zones and regions and are stateless. Now, how would you handle the creation of IDs across that scenario? We're talking trillions (dns queries).

    • @hansdietrich1496
      @hansdietrich1496 Pƙed 15 dny

      @@KangoV Not that complicated. Each of your upcoming nodes has a unique ID. 20-100, so one byte would be enough for that. Be generous and reserve two bytes. So each of your 2^16 max parallel hosts has its unique ID. Then you have two bytes of host-id and six bytes of whatever-id, and you got your IDs sorted within a single long int. If a single host runs out of IDs, just give him a new one of the probably superflous host IDs. And if in doubt, take some int128 and split it up accordingly.

  • @rafaelmesaglio1399
    @rafaelmesaglio1399 Pƙed 11 dny

    So how do NanoIds (and other alternate ids) solve the sortability/paging issue?
    Not to mention those likely won't be natively handled by the database and language of choice, i don't think it's a great idea to go dependency heavy either

  • @xxgn
    @xxgn Pƙed 16 dny

    One interesting benefit to having your IDs be ordered is that querying data by date tends keep data contiguous. While physical disk contiguity is not critical (on SSDs), it allows some substantial optimizations. For example, MS SQL column store queries will skip reading segments that have no ids in the queried range.

    • @gbjbaanb
      @gbjbaanb Pƙed 16 dny

      Except on postgres which doesn't have clustered indexes. Data is written to disk on a "as it comes" basis. There's even a command to reorder the data to be sequential.

  • @coolemur976
    @coolemur976 Pƙed 16 dny +5

    "The calculated probability of at least one collision occurring when generating 1 billion UUIDv4 identifiers is effectively 0. "
    Are collisions even real ?
    Ok, so you might get an error when creating few billions of records, so what ? Try, catch, retry.
    And about performance - if your UUIDv4 is too slow for your billions of records, perhaps it's time to archive the data ?
    I mean everyone was using UUIDv4 for a long time. And suddently it's a problem ? How so ? Or is it just someone wants to push their own agenda while introducing their own type of IDs?

    • @IceTank
      @IceTank Pƙed 15 dny

      Yeah, Wikipedia says you need to generate one billion v4 UUIDs per second for 86 years to have a 50% chance of collision.

  • @NeoChromer
    @NeoChromer Pƙed 16 dny +5

    imagine using mysql in 2024 holy..

  • @TapetBart
    @TapetBart Pƙed 16 dny

    I work a mot with time series data from measuring devices, and my biggest problem with UUIDs (each device has an uuid as a PK) is that you can basically not compress them.
    You can compress a 8 byte int by only storing the difference between each int.

  • @winchesterdev
    @winchesterdev Pƙed 15 dny +1

    UUIDv7 can have counters which makes them sequential as well.
    The UUIDv7 that could be shipped with Postgres 17 has a randomly-seeded counter.

  • @notangelmario
    @notangelmario Pƙed 16 dny +3

    classic theo use someone else's work when you don't want to do it. man i love your work but you got to stop being so obvious about it

    • @philadams9254
      @philadams9254 Pƙed 16 dny

      Do you mean because he copied Josh tried coding's video? Or something else?

    • @notangelmario
      @notangelmario Pƙed 16 dny

      @@philadams9254 well i didn't know about that. but yeah sure, that and most of the videos. i am quite sad that these days i just click on the video and look up the article he is reading or the video he is watching and just do that instead of watching him

  • @md5ize
    @md5ize Pƙed 15 dny +2

    Add to the advice including 'we don't write tests'. If you see this guy on a proposal, run.

  • @landmanland
    @landmanland Pƙed 16 dny +2

    The original purpose of UUID was to create RPC functions, not random numbers. This was before cryptography became important.

  • @cfhay
    @cfhay Pƙed 10 dny

    There's an important distinction to make here: MySQL was designed with a single server in mind, while cloud-native databases are designed for clusters. These are different enough that the best approach is not the same for both. MySQL likes incremental IDs. Certain cloud-native databases perform better when they keyspace is sharded.

  • @danielreed5199
    @danielreed5199 Pƙed 7 dny

    I don't use them, simply because collisions can occur. The thought of trying to find that non-existent bug upsets me too much.

  • @BradleyWeston92
    @BradleyWeston92 Pƙed 15 dny +1

    Say for example, the consumer of an API is the one generating an identifier, a UUID is required.
    So the table can have the auto incrementing ID still, and have a UUID that is unique index. The application layer would only reference the row using the uuid, but having multiple indexes are still smaller as using the AI ID not the UUID.

  • @yanickpoirier8094
    @yanickpoirier8094 Pƙed 14 dny +1

    The timestamp of the UUID is the start of the Gregorian calendar that replaced the Julian calendar on Oct 15, 1582 (not 1568, skipping 10 days), originally by the Papal State under the rule of Pope Gregory XII or XIII (hence the Gregorian Calendar). You went to sleep on Oct 5, 1582 and woke up on Oct 15, 1582. Most western countries are using it since the 18th century while Asian and eastern European countries have adopted it in the late 19th, early 20th century. Heck, Saudi Arabia started using sometime in 2015 or 2016.

  • @mikulcek
    @mikulcek Pƙed 14 dny +1

    Oh, PlanetScale, the folks that role out the carpet, just to pull it out as soon as you got both feet on it.
    Is Theo that badly paid that he is forced to accept any and every sponsorship? Should we fund a campain for him.