How Shopify’s engineering improved database writes by 50% with ULID

Sdílet
Vložit
  • čas přidán 14. 06. 2024
  • Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
    database.husseinnasser.com
    Shopify posted a blog on tips to for scalable payment system, one tip peeked my interest related to switching from UUID to ULID. I explore the reasoning behind this in this video.
    shopify.engineering/building-...
    0:00 Intro
    1:30 idempotency
    6:30 UUID vs ULID
    9:50 Clustered Index
    13:30 Why UUID4 Inserts are slow
    17:15 How ULID helps Shopify
    22:00 Problem with tail pages
    25:00 Does ULID help in all cases?
    Fundamentals of Backend Engineering Design patterns udemy course (link redirects to udemy with coupon)
    backend.husseinnasser.com
    Fundamentals of Networking for Effective Backends udemy course (link redirects to udemy with coupon)
    network.husseinnasser.com
    Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
    database.husseinnasser.com
    Follow me on Medium
    / membership
    Introduction to NGINX (link redirects to udemy with coupon)
    nginx.husseinnasser.com
    Python on the Backend (link redirects to udemy with coupon)
    python.husseinnasser.com
    Become a Member on CZcams
    / @hnasr
    Buy me a coffee if you liked this
    www.buymeacoffee.com/hnasr
    Arabic Software Engineering Channel
    / @husseinnasser
    🔥 Members Only Content
    • Members-only videos
    🏭 Backend Engineering Videos in Order
    backend.husseinnasser.com
    💾 Database Engineering Videos
    • Database Engineering
    🎙️Listen to the Backend Engineering Podcast
    husseinnasser.com/podcast
    Gears and tools used on the Channel (affiliates)
    🖼️ Slides and Thumbnail Design
    Canva
    partner.canva.com/c/2766475/6...
    Stay Awesome,
    Hussein
  • Věda a technologie

Komentáře • 103

  • @hnasr
    @hnasr  Před rokem +10

    Get my fundamentals of database engineering udemy course database.husseinnasser.com

  • @Winnetou17
    @Winnetou17 Před rokem +61

    Hussein: "A GET request by definition must be idempotent [...] Nothing chages on the backend, nothing should change"
    View counters: 😨😰

    • @torocatala
      @torocatala Před 11 měsíci +17

      Logs: 😨😰

    • @jerynmathew
      @jerynmathew Před 11 měsíci +3

      One could always use POST for fetching counters and logs. Like how elasticsearch exposes /_search API

    • @abdelrahmandwedar
      @abdelrahmandwedar Před 3 měsíci +1

      Well... Yeah that didn't change the response of GET request for most cases... 😕
      There's cases like YT watch counter but mostly it doesn't increase on your GET request itself.

  • @gwapolaub
    @gwapolaub Před rokem +37

    Pure database architectural gold. Thank you for this analysis.

  • @dhillaz
    @dhillaz Před 11 měsíci +12

    Having the timestamp embedded in the UID also means you can use the sorted index for deleting/archiving old rows when doing housekeeping, without the overhead of maintaining an additional index.

  • @pqnet84
    @pqnet84 Před rokem +14

    24:30 you can tradeoff cache memory for mutex contention by adding some random bits at the START of the id so that for a certain timestamp you have multiple pages

    • @johannsebastianbach3411
      @johannsebastianbach3411 Před rokem +2

      Could you elaborate on that a little further for a simpleton like me? 😂

    • @pqnet84
      @pqnet84 Před rokem +13

      @@johannsebastianbach3411 by having random bits at the start of the key you are distributing the objects on various parts of the btree. Example, if your first bit is random half of the keys will start with 0 then a timestamp, other half will start with 1 then a timestamp. Because of the timestamp all the 0s are going to be grouped together and all the 1s are going to be grouped together but they will be in different part of the btree. Like having the same street name, but different cities. So if you have a cache big enough for 2 pages you can keep both in memory and both will be fast (you are still limited to 2 pages), but you won't always access the same page (they are randomly load balanced to the two pages) so you reduce the contemption on the page mutex

    • @johannsebastianbach3411
      @johannsebastianbach3411 Před rokem +5

      @@pqnet84 omg, that makes perfect sense! Thanks. We use something similar at work for foldering strategy, like a/b/abab-abab-ababa-restOfUUID, where we get the first two folder names from the first two chars of the uuid, never would have thought about doing that for indexing though!

  • @justinsullard1519
    @justinsullard1519 Před 11 měsíci +5

    I've been doing something like this for almost a decade now. The first 12 hex digits a time stamp, the next 4 he'd digits used as either a shard ID or a sequence, and then the rest random. Useful for so many reasons.

    • @Zezandro
      @Zezandro Před 4 měsíci

      Hey @justinsullard1519, from somebody who has not worked a lot in DB, how secure is this approach? Have you had any security risk issue using this? I've been looking for info about this aspect, but haven't found anything at all.

  • @lfsr1288
    @lfsr1288 Před rokem +7

    It's a great approach. The only case when you ended up inserting "old time request" is with the retry logic retrying some requests since they have a timeout of 24hs or less. But again, this is a great approach to improve the insert

  • @guss77
    @guss77 Před 11 měsíci +6

    Correction about UUID randomness - please note that only UUID version 4 are random. There are other UUID versions, notably version 1 - which is time based and has lexicographic sorting, just like ULID, but it is standard and have a lot of supported implementations. The main problem with UUIDv1 is that the bit order is a bit weird and if you look at it as an opaque bit set, it looks quite random as the least significant bits of the time (time low in UUID spec) are in front. This was done purposefully.
    MysQL version 8 improved support for using UUIDv1 as primary keys, with good index locality, by adding specific functions to reorder the time fields in a UUID so that they are ordered from most significant to less significant - similar to cardinals, and that convert the UUID string representation to a compact binary storage that is almost as efficient as using an integer.

    • @gsilva877
      @gsilva877 Před 9 měsíci

      Thanks for the coment, is useful ❤

  • @javedutube10
    @javedutube10 Před rokem +2

    Just bought 'Fundamentals of Networking for Effective Backend Design' from UDEMY, even though I could learn it from different channels that also suggested by you : ), including yours, of course. That's your magic. Thanks. Love you.

  • @EngineerNick
    @EngineerNick Před rokem

    Thankyou for this :) I love the idea that data can be 'tucked in' to a db 🧸

  • @amitmahadik4653
    @amitmahadik4653 Před 4 měsíci +1

    For race condition, we can think of synchronization (allow one thread to write), but it will generally slow down insertion

    • @abdelrahmandwedar
      @abdelrahmandwedar Před 3 měsíci

      That's exactly what I thought about. How would it increase the speed of insert by 50 percent while having everything synchronized?

  • @kickeddroid
    @kickeddroid Před 11 měsíci

    Great video man keep up the great work!!!!

  • @kenbee85
    @kenbee85 Před rokem

    I just stumble upon your video as suggested by YT. Though I merely focusing on Frontend I really got curious about this.

  • @abhaygoswami7021
    @abhaygoswami7021 Před rokem +23

    One can also use UUIDv7 which also use timestamp and is sortable in nature and 62 bits randomness as compare to 80 bits in ULID

    • @DrBatroni
      @DrBatroni Před 11 měsíci +1

      I also wonder why?
      They come from UUID v4, so in both cases the new timespamt would be new.
      So i wonder whats the big diffrence between UUID v7 and ULID?

    • @Tordek
      @Tordek Před 11 měsíci +2

      ULIDs are (slightly) shorter, the encoding is trivial (and easier to split: you split the first 8 characters for the timestamp without fiddling with dashes or variable length fields). Standards only matter if you need to interact with someone else, and you shouldn't depend on other people parsing your IDs. There doesn't seem to be any downsides to ULID over UUID.

  • @nemtudom5074
    @nemtudom5074 Před 11 měsíci +5

    This is the kinda optimizing we used to see in the 90's when you *HAD* to optimize your code because you couldnt just buy more performance, not like this anyway.
    Not to mention that even if you did, you didnt get much. Things perform so well now that you can just let your code be unoptimized and people wont care.
    Im glad they are putting in the effort.

  • @svencowart
    @svencowart Před 6 měsíci

    Fantastic explanation

  • @nuculabs
    @nuculabs Před rokem

    Great content, thank you!

  • @Vikasptl07
    @Vikasptl07 Před rokem

    Great stuff !!

  • @abcdef-fo1tf
    @abcdef-fo1tf Před rokem +2

    I feel like in url shortener, the ULID would still help reads as you're more likely to read newer URLs which might be in the buffer pool

  • @Aleks-fp1kq
    @Aleks-fp1kq Před 6 měsíci +1

    I didn't understand the part about 24hrs indempotency. How is it related to ulid?

  • @Stig12
    @Stig12 Před 6 měsíci +1

    This is either the same or very similar to a Comb Guid. Not sure if you mentioned it, but you'll want to modify the byte order before storing in MS SQL Server to get the same sequential benefit.

  • @sanzharsuleimenov6380

    Certainly! Here's a simpler version of your text:
    Thanks!
    I found your video about Discord switching from Cassandra, and I really like how you share your thoughts.

  • @ankgaur
    @ankgaur Před rokem +10

    Can anyone name youtube channels like this
    with this kind of detailed backend videos

  • @davidvultur8704
    @davidvultur8704 Před rokem

    Any plans on doing a new twitter architecture/speed video with the introduced updates? (compared to the old "Twitter Backend is slow" vide)
    Thanks

  • @CTimmerman
    @CTimmerman Před rokem +2

    TL;DW: UUID tends to be too long and random. ULID is 10 byes shorter and starts with a sortable timestamp.

  • @DurgaShiva7574
    @DurgaShiva7574 Před rokem +1

    Amazing video, but, one doubt, who is actually generating the UUID or ULID's .. in both the scenarios.. client or Server ??

  • @nerdative
    @nerdative Před 11 měsíci +1

    Do they assume that clients have the correct time? What happens if the client have the wrong time? Then it would generate "old" or "incorrect" ULIDs. Unless they somehow synchronize the time between server and client, or it happens so rarely that it doesn't matter.

  • @megalodon4272
    @megalodon4272 Před rokem

    This is gold

  • @eddie_dane
    @eddie_dane Před měsícem

    Which book or resource would you recommend to learn in depth knowledge on database

  • @pollathajeeva23
    @pollathajeeva23 Před rokem +2

    One video on Search engine indepth please

  • @billywang3829
    @billywang3829 Před rokem

    I love this man

  • @ashishbtech
    @ashishbtech Před rokem

    That explains well why SQL-type DBs do not perform well with the long tail of data.

  • @ahmedelzubair6907
    @ahmedelzubair6907 Před 9 měsíci

    Why do you think that ULID is not gonna help in reads?
    At the end, it's sorted, right ? So it can be easily indexed, and if we indexed our ULIDs, it will help in reading a well because we are going to use the index in reading this data.

  • @abcdef-fo1tf
    @abcdef-fo1tf Před rokem

    I'm curious why we couldn't just use the auto-incr feature on the SQL DB if we wanted sequential primary keys? Is this because this would make it harder to make it idempotent? If we're not using it for that in the case with URL shortener, we can just use auto-incr right? Also are there cases where ULID is worse than UUID? You talked about downsides, but it seems like they were also the case for UUID

    • @Zuriki09
      @Zuriki09 Před rokem

      The drawbacks of sequential IDs are discussed early on in the video.
      ULID can be worse in the following cases:
      Where you specifically don't want to expose the relative position of the ID - the timestamp gives some indication of when the ULID was created in relation to other ULIDs, this may be important for cryptographic reasons.
      When you are interfacing with systems that expect UUIDs. ULIDs can be converted to UUIDs but that may be more burdensome than simply using UUIDs to begin with.
      UUID spec is considered more stable and future proof, this may be important in large enterprise environments where change is extremely difficult to implement - you are unlikely to encounter this problem, but that is "a" criticism of ULIDs.

  • @skreibb
    @skreibb Před 5 měsíci

    UUID v4 could be beneficial for databases with partitions under the hood (dynamo db). ULID could make writes to the last partition really expensive

  • @1Eagler
    @1Eagler Před 11 měsíci

    So either threads will go sequentially either in memory or in disk

  • @tesla1772
    @tesla1772 Před rokem +40

    So can we say that mongodb ids are more efficient compared to ulids as it is 96 bits and are also sequential

    • @tesla1772
      @tesla1772 Před rokem +3

      @@JamesSmith-cm7sg we can genrate same ids locally

    • @btom1990
      @btom1990 Před rokem

      @@JamesSmith-cm7sg but MongoDB is webscale czcams.com/video/b2F-DItXtZs/video.html

    • @atmrar
      @atmrar Před rokem +9

      @@tesla1772 local generation of ulid is error prone. If system clocks are not synchronised the local id can go back in the time .

    • @tesla1772
      @tesla1772 Před rokem +1

      @@atmrar yeah. But i think machine id and process ids are aslo included in it right?

    • @MrOlivm
      @MrOlivm Před rokem +1

      Sequential importantly implies that mongo’s “web scale” ain’t so web scale. I don’t know why you would use mongo when you could use /dev/null

  • @fedemtz6
    @fedemtz6 Před rokem

    if the first 48 bits are time why is the random part so long? the chances of two identical random parts being generated at the exact same time are very low

  • @urrahman196
    @urrahman196 Před rokem +1

    Can you please make a video on refresh token? Why do we need this along with JWT? Pros n cons.

  • @JohnZakaria
    @JohnZakaria Před rokem

    That's my first time hearing about ULID

  • @oah8465
    @oah8465 Před rokem +2

    fantastic video but why are we going with GUIDs, UUID, ULID, if we have our beloved auto-increment primary key in MySQL? Let me put it this way, in which scenario is the primary key (auto-increment) by MySQL is no good and we have to go either the UUID or the ULID way?

  • @hirisraharjo
    @hirisraharjo Před rokem

    What about CUID ? is it better than ULID and UUID?

  • @akovbovich
    @akovbovich Před rokem

    Why not use UUIDv4 with hash indexes for fast lookups and a separate timestamp column for range selection?

    • @hck1bloodday
      @hck1bloodday Před rokem

      In my little knowledge of databases I feel that will cause slower inserts and they want to maximize insert speed

    • @-Jason-L
      @-Jason-L Před rokem

      If you're selecting a range, it better be indexed on that tange

  • @mayyar123
    @mayyar123 Před rokem

    Would the same problem of inefficient indexing occur if we use Non-clustered indexing w/ UUID ?

    • @hnasr
      @hnasr  Před rokem

      Still exists but not as bad as clustered indexes. Clustered indexes leaf pages are rich with data which means UUID inserts are more likely to cause page splits and encores more random IO and dirty pages flushing.
      non clustered uuids still has the same behavior but it will just take more rows to get there.

  • @nathansgreen
    @nathansgreen Před 11 měsíci +1

    Use UUIDv7, as defined in the drafted revisions to RFC 4122. Yes, it's slightly different, but it enables compatibility with existing UUID code, which already exists in a lot of places.

  • @Milchmann2
    @Milchmann2 Před rokem +17

    Using a "timestamp" as key will then also redirect all writes to the same node of the distributed database and thus creating a hotstpot. I wish the article and you talked a bit more why this is not an issue for them.

    • @fishoil5310
      @fishoil5310 Před rokem +6

      The database is probably sharded using user id. The write is still effectively distributed across different nodes.

    • @colt4by5
      @colt4by5 Před rokem +6

      Edit: I forgot about the 80 bits of random data! I think partitioners will hash the partition key, so any randomness will cause the data to be spread across nodes and avoid hotspots.
      I was also going to make this point... good insight! Using a timestamp as the partition key in a distributed database can indeed create a hotspot and degrade performance. I think it's not an issue for them because they are using MySQL, which is not a true distributed database, so that caveat doesn't apply. I guess the potential issue for them is they're limited to the write performance of a SQL database, which can't scale up as high as a distributed database can.

    • @miladin4023
      @miladin4023 Před rokem

      Can someone elaborate more on this? Thanks :)

    • @colt4by5
      @colt4by5 Před rokem +6

      @@miladin4023 I think the main point is that distributed databases use the partition key to "route" data to a physical node, so if you use just a timestamp, then all data will be routed to the same partition (and therefore node) for each time slice, which can create a hotspot and remove much of the value of using a distributed database. A 48-bit timestamp should allow for millisecond granularity, so maybe it's not a huge issue, but probably better to avoid it. The 80 bits of random data in the ULID should prevent the issue altogether though. You can find a quick explanation if you search for "DynamoDB designing partition keys to distribute your workload". On the relational database side, Hussein already did a much better job of explaining how data is routed than I could :).

    • @junaid1555
      @junaid1555 Před rokem +2

      @@colt4by5 if the partition key is hashed to avoid hotspots, won't the range queries be very expensive?

  • @figloalds
    @figloalds Před 11 měsíci +1

    The RID mechanism I use on my program is similar to that, but I use 32 byte array containing a machine specific signature, a timestamp, a random number and a in-memory sequential number, it is designed to absolutely never conflict

  • @mosespeter9711
    @mosespeter9711 Před rokem +6

    I have been using the sorted UUID strategy since MYSQL 8.0 with UUID_TO_BIN("uuid",1) and this function further convert it to BINARY(16) instead of storing it as string... this is super fast!

    • @DavisTibbz
      @DavisTibbz Před 11 měsíci

      Yes! More people should upgrade to mysql 8.0

  • @raunaquepatra3966
    @raunaquepatra3966 Před 11 měsíci

    What about BTree getting skewed

  • @harshsahu7825
    @harshsahu7825 Před rokem

    just joined you DB course on udemy :)

  • @johnsailor3590
    @johnsailor3590 Před rokem +2

    Yay

  • @engineeranonymous
    @engineeranonymous Před rokem +10

    Never use GUID's for anything if your database have the ability to use b-tree. Eventually some day you have to use the GUID field for something and you will regret it. Use integer keys and hashid library in order to use it with controllers.

  • @esra_erimez
    @esra_erimez Před rokem

    The more things change, the more things stay the same. IBM ISAM/VSAM

  • @guerra_dos_bichos
    @guerra_dos_bichos Před 11 měsíci

    Thats not a type universally, university universe should be proceeded by A not AN

  • @awnion
    @awnion Před rokem

    ULID is not something new btw

  • @cvoges12
    @cvoges12 Před 11 měsíci

    Be this guy and spend 30 minutes on talking about how sorted data makes things predictable.

  • @alikoc4713
    @alikoc4713 Před 10 měsíci

    Nothing new at all.

  • @cvoges12
    @cvoges12 Před 11 měsíci

    Stop playing with the mic. It's both highly distracting and messing with the audio by getting louder and quieter at near random. And aim the mic at your chest/jaw, not your forehead.