RAND is too slow (in MySQL)

Sdílet
Vložit
  • čas přidán 24. 08. 2024

Komentáře • 45

  • @scottp131
    @scottp131 Před 8 měsíci +10

    hey man your videos kick ass and i cannot thank you enough for your approach with these. your videos can be watched once and understood... every single one of them... i don't know how you do it, but the way you have picked to teach anything you teach is incredible. you freaking rock! thank you!

    • @PlanetScale
      @PlanetScale  Před 8 měsíci +2

      Gah that really encourages me. Thank you

  • @Wundero
    @Wundero Před 8 měsíci +5

    One other option for the "hash" variant is to convert the hash into a numeric column (take the bottom half or quarter of md5 (a 128 bit hash), for example) and then modulo that by some arbitrary number. The modulo would help produce groups that you could then filter by and randomize the order of. Ideally the number is determined by your data size, since a small number gives larger groups, so if you have like 1000 rows, % 3 or % 7 might work well, while on millions of rows, something % 17737 might work better. The concept is similar, in theory, to a hash table, where you would take the hash % the size of the table to figure out what group to put them in. This also lets you "seed" some of the randomness, especially if you do the remaining randomness in your application, which can let you have randomized pages of data, and you can just change the modulo if you want to change the "seed".

  •  Před 8 měsíci +20

    Another way: Create a generated decimal column, with default value is rand() function. Add index to them and use-it.

    • @PlanetScale
      @PlanetScale  Před 8 měsíci +3

      Good idea! That would lead to rows being really "clumpy" though. Rows with the same or similar random numbers would be returned together. Also a generated column can't have a RAND default unfortunately

    • @IsaacWalkercox
      @IsaacWalkercox Před 8 měsíci +1

      @@PlanetScale why can't Rand() be the default for a generated column? Too expensive computationally?

    • @devhammed
      @devhammed Před 8 měsíci +4

      @@IsaacWalkercox Because the function is not deterministic, A function is deterministic if, given the same data in tables, multiple invocations produce the same result, independently of the connected user. This is generally in place for features like master-slave database replication.

    • @IsaacWalkercox
      @IsaacWalkercox Před 8 měsíci +2

      @@devhammed makes sense. The index would need the same information per user. Seems like something they could get round with some sort of wider context though

    • @RandomGeometryDashStuff
      @RandomGeometryDashStuff Před 8 měsíci +1

      does this make result always same?

  • @olivierm.594
    @olivierm.594 Před 8 měsíci +1

    On your first example, you may have collisions (same id appears twice or more), i think a simple "distinct" could avoid that (if the list of ids you're parsing is small enough)

  • @samho321
    @samho321 Před 8 měsíci +3

    in my case, primary key is uuid, i can randomly pick 4 char inside 32 length from primary key and sort it and limit it, also i able to make those 4 random positions in to a random seed for pagination

  • @gfni
    @gfni Před 8 měsíci +3

    Great video as always! 👏

  • @___dyego
    @___dyego Před 6 měsíci

    You're an amazing teacher!

  • @jacksonzamorano
    @jacksonzamorano Před 8 měsíci +1

    Wow! I never knew about covering indexes. I don't need to do a random pull like in the video but I could definitely put those indexes to use. Great video, keep it up!

    • @PlanetScale
      @PlanetScale  Před 8 měsíci

      Check out the video right before this I've, it's all about covering indexes! The one with the timer in the thumbnail

  • @IvanMontilla
    @IvanMontilla Před 8 měsíci

    I'm pretty sure someone will implement a random password generator this way, by storing all possible passwords in a table and then letting MySQL do its magic.

  • @vasiovasio
    @vasiovasio Před 8 měsíci

    Super Useful Video! Thank you, Aaron!

  • @cafehodge
    @cafehodge Před 8 měsíci

    for gaps in auto incremented ids or uuids you could create a row_number with a window function to use the rand() function against and guarantee results will be returned. but idk how computationally expensive that’d be.

  • @tamasnogradi4524
    @tamasnogradi4524 Před 8 měsíci

    This was really interesting mate! Love your videos, you filled a huge gap in related db topic for a lot of people.

  • @sungvadan
    @sungvadan Před 8 měsíci

    Thank you so much for the video. I have an application that I use to learn vocabulary. It sends me ten random words by email every day. I'm using the random() function, which is fine because the vocabulary table is small. However, I've always wondered if there are better ways to do this. Thanks to your video, I now know some alternative methods.

  • @DuniC0
    @DuniC0 Před 8 měsíci

    I love videos that get me into start thinking.
    What if you strip down the size of the data several orders of magnitude with something like WHERE RAND() < 0.001
    And then just ORDER BY RAND() only for the subset?

  • @AlexCernat
    @AlexCernat Před 8 měsíci

    yeah, deferred joins and covering indexes rules! great idea using hashes for "randomness"!

  • @ahmad-murery
    @ahmad-murery Před 8 měsíci

    I liked the bucket method,
    Thanks Aaron!

  • @roganl
    @roganl Před 8 měsíci

    Excellent Video. Picking up where you left off, using the hash column index to expedite the lookup, how would generating rand (n) hex values improve randomness (16^4 space) over just using two hex buckets ? Would it make sense to generate multiple two digit hash columns and use them both?

  • @hannielgutierrez9139
    @hannielgutierrez9139 Před 8 měsíci

    If i try to make a virtual field an index, then Maria DB tells me this:
    1901 - Function or expression 'concat(`fecha`,' ',`hora`)' cannot be used in the GENERATED ALWAYS AS clause of `fecha_hora`
    Does it works just in MySQL?

  • @Kane0123
    @Kane0123 Před 8 měsíci

    If its fast enough, its fast enough... relevant in so many places...

  • @kriansa
    @kriansa Před 8 měsíci +3

    Would that inner join at 13:14 be effectively replaceable by a where exists keeping the same performance?

    • @mme725
      @mme725 Před 8 měsíci

      I'm wondering this as well.

    • @balduin_b4334
      @balduin_b4334 Před 8 měsíci

      If I had to make a guess, I would say that it is less performant, the inner join is a subquery, so it is executed first, and then it is getting joined on the primary key of the bookmarks table.
      "where exists " I would think is getting looked up (or even executed) for each row. (not an expert btw)

    • @OliverONeill
      @OliverONeill Před 8 měsíci

      He did explain this around 6:00. RAND() is a volatile function, so it being in the WHERE clause calls RAND() over and over again for each row.

  • @tylerslater
    @tylerslater Před 6 měsíci +1

    What client GUI are you using?

  • @mamad-dev
    @mamad-dev Před 8 měsíci

    i have a question, what is the application that he's using ?

  • @daniel_petrica
    @daniel_petrica Před 8 měsíci

    I got a similar problem with my plantescale free db where my bad code hanered the rows read the first month with the order by rand().
    I ended up doing a less random solution using cache and code.
    In php I generate a random int inside the count of table rows (so i dont care about id gaps), and i get n rows from the db with that int as an offset. Then I randomly order them in php and return an array. As I increase the amount of rows extracted, I'm caching that query in Laravel for a x amount of time and on each request I pick from the cache as with a big enough amount of rows the limited randomnes is good enough in my application.
    To reduce the time in the wire, I only select the columns I need for the use case, and I have an index on the columns needed.
    The cache can be generated by a worker to avoid having a slow request every x second to the web user.

  • @Voltra_
    @Voltra_ Před 8 měsíci

    Reservoir Sampling. 'nough said

    • @PlanetScale
      @PlanetScale  Před 8 měsíci +1

      Perhaps not quite enough, I've never heard of Reservoir Sampling! Will have to look that up.

    • @Voltra_
      @Voltra_ Před 8 měsíci

      @@PlanetScale here's the cppcon talk I've learned about it from: czcams.com/video/YA-nB2wjVcI/video.htmlsi=H40wG1Iha9yOygHZ (I resisted the urge to make it a Rick roll)

  • @youtubehotdogdogyoutube
    @youtubehotdogdogyoutube Před 8 měsíci

    Yeah, maybe in your SQL

  • @RayHorn5128088056
    @RayHorn5128088056 Před 8 měsíci

    Redis the rows and choose randomly from cache. Problem solved.

    • @PlanetScale
      @PlanetScale  Před 8 měsíci

      Redis the rows? I'm not sure what that means

    • @RayHorn5128088056
      @RayHorn5128088056 Před 8 měsíci

      @@PlanetScale Cache = Performance. Kudos on doing all that work to get your database to cough up the data but caching is always faster.

    • @PlanetScale
      @PlanetScale  Před 8 měsíci

      I understand that 🤗 But are you saying store all the rows in Redis? "Redis the rows" is unclear to me

    • @RayHorn5128088056
      @RayHorn5128088056 Před 8 měsíci

      @@PlanetScale I am always saying cache just enough data to speed up the process. This should always be the goal. Trust me, there is an optimal solution involving a cache.