The TRUTH About High Performance Data Partitioning

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 5. 07. 2024
  • Welcome back to our comprehensive series on Apache Spark performance optimization techniques! In today's episode, we dive deep into the world of partitioning in Spark - a crucial concept for anyone looking to master Apache Spark for big data processing.
    đŸ”„ What's Inside:
    1. Partitioning Basics in Spark: Understand the fundamental principles of partitioning in Apache Spark and why it's essential for performance tuning.
    2. Coding Partitioning in Spark: Step-by-step guide on implementing partitioning in your Spark applications using Python. Perfect for both beginners and experienced developers.
    3. How Partitioning Enhances Performance: Discover how strategic partitioning leads to faster and easier access to data, improving overall application performance.
    4. Smart Resource Allocation: Learn how partitioning in Spark allocates resources for optimised execution.
    5. Choosing the Right Partition Key: A comprehensive guide to selecting the most effective partition key for your Spark application.
    🌟 Whether you're preparing for Spark interview questions, starting your journey with our Apache Spark beginner tutorial, or looking to enhance your skills in Apache Spark, this video is for you.
    📚 Keep Learning:
    📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
    đŸŽ„ Full Spark Performance Tuning Playlist: ‱ Apache Spark Performan...
    🔗 LinkedIn: / afaque-ahmad-5a5847129
    Chapters:
    00:00 Introduction
    02:22 Code for understanding partitioning
    05:44 Problems that partitioning solves
    09:48 Factors to consider when choosing a partition column
    13:36 Code to show single/multi level partitioning
    18:19 Understanding spark.sql.files.maxPartitionBytes
    22:09 Thank you
    #ApacheSparkTutorial #SparkPerformanceTuning #ApacheSparkPython #LearnApacheSpark #SparkInterviewQuestions #ApacheSparkCourse #PerformanceTuningInPySpark #ApacheSparkPerformanceOptimization

Komentáƙe • 29

  • @user-dx9qw3cl8w
    @user-dx9qw3cl8w Pƙed 7 měsĂ­ci +2

    super super super detailed way thanks for uploading. i was unable understand it before but now could understand clearly..... thanks a lot. .... please do this kind indeapth topic videos when ever you are free to do. (u may not get view and money like other entertainment vidoes. but you are helping people to grow in this field surly there are so many people benifitting from you're content. please continue to do this kind of videos)

  • @sayedsamimahamed5324
    @sayedsamimahamed5324 Pƙed 3 měsĂ­ci

    Till now the best explaination in youtube. Thank you very much.

  • @anandchandrashekhar2933
    @anandchandrashekhar2933 Pƙed 29 dny +1

    Thank you so much again! I have one follow up question about partition during writes. If I use a df.write but specify no partitioning column or use repartition, could you pls let me know how many partitions does spark write to by default?
    Does it simply take the number of input partitions (total input size / 128m) or assuming if shuffling was involved and the default shuffle partitions being used were 200 , does it use that shuffled partition number ?
    Thank you

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 19 dny

      Hey @anandchandrashekhar2933, so basically it should fall into 2 categories:
      1. If shuffling is performed: Spark will use the value of `spark.sql.shuffle.partitions` (defaults to 200) for the number of partitions during the write operation.
      2. If shuffling is not performed: Spark will use the current number of partitions in the DataFrame, which could be based on the input data's size or previous operations.
      Hope this clarifies :)

  • @Fullon2
    @Fullon2 Pƙed 7 měsĂ­ci +1

    tranks for share your knowlegde, your videos are amazing.

  • @user-fm2cb2yt5c
    @user-fm2cb2yt5c Pƙed 3 měsĂ­ci

    Thanks for the detailed video. I have few questions here on partitioning. 1. How does it decide the number of partitions if we dont specify the properties and is it good to do repartition(some 400) after read. Is it good practice? 2. How does we decide the number for repartition value before writing to disk? If we put large number to repartition method, will that be optimal?

  • @EverythingDudes
    @EverythingDudes Pƙed 7 měsĂ­ci +1

    Superb knowledge

  • @vamsikrishnabhadragiri402
    @vamsikrishnabhadragiri402 Pƙed 3 měsĂ­ci +1

    Thank for the informative videos. I have a question regarding repatiton(4).partitionby(key)
    Does it mean 4 part files in each of the partition will be a separate partition while reading ?
    Or it considers the maxpartitonbytes specified and depending upon the size it creates partition (combining two or more part files) if the both size is within the maxpartitionbytes limit

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 3 měsĂ­ci +1

      Hey @vamsikrishnabhadragiri402, `spark.sql.files.maxPartitionBytes` will be taken into consideration when reading the files.
      If each of the 4 part files is smaller than `spark.sql.files.maxPartitionBytes` e.g. each part is 64 MB and `spark.sql.files.maxPartitionBytes` is defined to be 128MB, then 4 files (partitions) will be read separately. Spark does not go into the overhead of merging files to bring it to 128MB.
      Consider another example where each part is greater than `spark.sql.files.maxPartitionBytes` (as discussed in the video), each of those parts will be broken down into sizes defined by `spark.sql.files.maxPartitionBytes` :)

  • @iamexplorer6052
    @iamexplorer6052 Pƙed 7 měsĂ­ci +1

    very great detailed way understandable way so... Great ...

  • @danieldigital9333
    @danieldigital9333 Pƙed 2 měsĂ­ci

    Hello, thanks for this video and for the whole course. I have a question about high cardinality columns: Say you have table A and table B with customer_id on both. You want to perform a join on this column, how do you alleviate the performance issue that occurs?

  • @utsavchanda4190
    @utsavchanda4190 Pƙed 6 měsĂ­ci +1

    Good video. In fact, all of your videos are. One thing, in this video majorly you were talking about actual physical partitions on the disk. But towards the end, when you were talking about "maxpartitionbytes" and doing only a READ operation, you were talking about shuffle partitions which is in-memory and not disk partitions. I had found that hard to grasp for a very long time, so wanted to confirm if my understanding is right here.

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 6 měsĂ­ci

      Hey @utsavchanda4190, many thanks for the appreciation. To clarify, when talking about "maxpartitionbytes", I'm referring to partitions that Spark reads from files into memory. These are not shuffle partitions, shuffling will only come in picture in cases of wide transformations (e.g. groupby, joins). Therefore, "maxpartitionbytes" will dictate how many partitions will be read by Spark from the files into Dataframes in memory.

    • @utsavchanda4190
      @utsavchanda4190 Pƙed 6 měsĂ­ci +1

      @@afaqueahmad7117 that's right. And that is still in memory and not physical partitions, right? I think this video covers both, physical disk partitions as well as in-memory partitions.

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 6 měsĂ­ci

      Yes, those are partitions in memory :)

  • @lunatyck05
    @lunatyck05 Pƙed 6 měsĂ­ci +1

    great video as always - when can we get a video to set up our IDE like yours? really nice UI - visual studio I believe?

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 6 měsĂ­ci

      Thanks for the appreciation. Yep, it's VS Code. It's quite simple, not a lot of stuff on top except the terminal. Can share the Medium article I referred to, for setting it up :)

  • @ComedyXRoad
    @ComedyXRoad Pƙed 2 měsĂ­ci

    thank you

  • @kvin007
    @kvin007 Pƙed 6 měsĂ­ci +1

    Thanks for the content Afaque. Question regarding spark.sql.files.maxPartitionBytes. I was thinking about this would be beneficial when reading a file that you know the size of upfront. What about files you don’t know the size. Do you recommend repartition or coalesce in those cases to adjust the number of partitions for the Dataframe?

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 6 měsĂ­ci +1

      Hey @kvin007, you could use a technique to determine the size of a DataFrame explained here czcams.com/video/1kWl6d1yeKA/video.html at 23:30. The link used in the video is umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/parallelism/sparksqlshufflepartitions_draft

    • @kvin007
      @kvin007 Pƙed 6 měsĂ­ci +1

      @@afaqueahmad7117 awesome, thanks for the response!

  • @Amarjeet-fb3lk
    @Amarjeet-fb3lk Pƙed měsĂ­cem

    At 16.39 , when u use repartition(3) , why there are 6 files?

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed měsĂ­cem

      Hey @Amarjeet-fb3lk, Good question, I should have pulled the editor sidebar to the right for clarity. It's 3 files actually, the remaining 3 files are `.crc` files which is created for data integrity by Spark - to make sure the written file is not corrupted.

  • @retenim28
    @retenim28 Pƙed 7 měsĂ­ci

    I am a little bit confused: at minute 15:17 in a specific folder relating to a specific value of listen_date you say that there is only 1 file that corresponds to 1 partition. But I thought that partitions are created depending on the values of listen_date, so as far as I can see, I would say there are more than 30 partitions (each one corresponding to a specific value of listen_date). After that you used repartition function to change the number of partitions inside each folder. So the question is: the number of partitions is the number of listen_date folder or the number of file inside each folder?

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 7 měsĂ­ci

      Hey @retenim28, each listen_date folder is a partition. So you're right in saying that each partition corresponds to a specific value of listen_date. Each unique value of listen_date would result in a separate folder (a.k.a partition). Each parquet file (those part-000.. files) inside a partition (folder) will represent the actual physical storage of data rows belonging to that partition.
      Therefore, to answer your question,
      number of partitions = number of listen date folders;

    • @retenim28
      @retenim28 Pƙed 7 měsĂ­ci

      @@afaqueahmad7117 oh thank you sir, just got the point. But I have another question: since Spark is interested in the number of partitions, which is the advantage of creating more files for each partition? The number of partitions remains the same, so the parallelism is just the same in both cases where we consider 10 files inside a partition or 3 files inside.

    • @afaqueahmad7117
      @afaqueahmad7117  Pƙed 6 měsĂ­ci +1

      Good question @retenim28. The level of parallelism during data processing (e.g. number of tasks to be launched, 1 task = 1 partition) is determined by the number of partitions. However, the number of parquet files inside each partition plays a role in read/write, I/O parallelism. Spark, when reading data from storage, would read each of the parquet files in parallel even if they're part of the same partition. It will hence be able to assign more resources to do a faster data load. Same is the case for writes. Just be cautious that we don't end up with too many parquet files (small file problem) or few large files (leading to data skew)

    • @retenim28
      @retenim28 Pƙed 6 měsĂ­ci

      @@afaqueahmad7117 thank you very much sir. I also watched the series about data skew.. very clear explanation