The TRUTH About High Performance Data Partitioning

Bucketing - The One Spark Optimization You're Not Doing

Speed Up Your Spark Jobs Using Caching

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

Jak Vypadají Noví Nejbohatší Youtubeři v Česku?

A teacher captured the cutest moment at the nursery #shorts

Dynamic Partition Pruning: How It Works (And When It Doesn’t)

Afaque Ahmad

zhlédnutí 2 862

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 26. 07. 2024
Dive deep into Dynamic Partition Pruning (DPP) in Apache Spark with this comprehensive tutorial. If you've already explored my previous video on partitioning, you're perfectly set up for this one. In this video, I explain the concept of static partition pruning and then transition into the more advanced and efficient technique of dynamic partition pruning.
You'll learn through practical examples, starting with a listening activity dataset partitioned by date, and then move to a complex scenario involving a join operation between listening activity and songs datasets. The video meticulously explains how DPP optimizes query performance by reducing unnecessary data scans, and the conditions necessary for its effective implementation. I also highlight the differences between static and dynamic partition pruning and the importance of having partitioned data for DPP to work effectively.
Whether you're a data engineering enthusiast or a professional working with Spark, this video will enhance your understanding of optimizing Spark queries using Dynamic Partition Pruning. Don't forget to like, share, and subscribe for more insightful content on Apache Spark and big data analytics!
📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
🎥 Full Spark Performance Tuning Playlist: • Apache Spark Performan...
🔗 LinkedIn: / afaque-ahmad-5a5847129
Chapters
00:00 Introduction
00:23 What is static pruning?
02:47 Dynamic partition pruning
12:07 Caveats when using dynamic partition pruning
14:29 Code to understand dynamic partition pruning
20:28 Thank you
#spark #dataengineering #apachespark #partition #partitioning #dynamicpartitionpruning #staticpruning #pruning #sparkperformancetuning #sparkoptimization #bigdataanalytics #sparktutorial #dataoptimization #sparkinterviewquestions

Komentáře • 16

@gopinathdhanasekar3286 Před 2 měsíci
you deserve more subscribers !! thanks for explaining the concepts
@afaqueahmad7117 Před 2 měsíci
Those words mean a lot, thank you @gopinathdhanasekar328! If you wouldn't mind, a request to kindly share with your friends and colleagues, I would greatly appreciate your help in spreading the word
@Wonderscope1 Před 6 měsíci
Thanks for great video; you make these concept so simple. Thanks
@iamexplorer6052 Před 7 měsíci
Thank you sharing , new thing I learned from you
@user-dx9qw3cl8w Před 7 měsíci ⁺¹
thanks for another indeapth video yes we need how spark uses it's memory executors and on what basis it split data to multiple executors
@afaqueahmad7117 Před 7 měsíci
Resource level optimisation videos upcoming in the next few weeks, stay tuned! :)
@iamkiri_ Před 7 měsíci
Loving ur videos Bro !
@sathyamoorthy2362 Před 2 měsíci
All videos are great and nicely explained , video clarity is bad even for 4k.
@afaqueahmad7117 Před 2 měsíci
Thanks, @sathyamoorthy2362, for the kind words. On the video quality, I was trying out a new tool and it didn't work out, but hope the other ones are good and you like them :)
@anandchandrashekhar2933 Před měsícem
Thanks Afaque. Terminology wise, Is this the same as Filter pushdown which you explained during the Query Plan video?
@afaqueahmad7117 Před měsícem
Hey @anandchandrashekhar2933 Appreciate it :)
On the question - DPP is different from "filter pushdown", although it uses filter pushdown to prune the large dataset based on the filters from the smaller dataset. It's effective when you have a large and a small dataset (which can be broadcasted) and want to use the small dataset to filter records from the large dataset at scan-time
@plearns4551 Před 6 měsíci
Hello, I think one correction, I think even if the dimension table(songs) don't have filter condition on release date still DPP would work right?? as it will forward the release date selected after the filter, irrespective of the filter condition. eg even if we apply filter on songID in songs table is there and after filter few record are selected in those records whatever the release dates are it will be forwarded.
@roksig3823 Před 7 měsíci ⁺¹
Can you make a video on how to decide driver/executor memory size, no of executor based file size like 100 GB in Spark ?
@afaqueahmad7117 Před 7 měsíci ⁺¹
Resource level optimisation videos upcoming in the next few weeks, stay tuned! :)
@rohitshingare5352 Před 6 měsíci
What if both datasets are too big , so in that case broadcast exchange is still happens?
@afaqueahmad7117 Před měsícem
Hey @rohitshingare5352, Good question. DPP generally works best when one table is large and the other table is small enough to be broadcasted. The most significant reason for this if the two tables are large, the filters being moved will also be large (in the worst case) and this filter propagation mechanism over the network is the biggest bottleneck

Další v pořadí

Automatické přehrávání

The TRUTH About High Performance Data Partitioning

The TRUTH About High Performance Data Partitioning

Bucketing - The One Spark Optimization You're Not Doing

Bucketing - The One Spark Optimization You're Not Doing

Speed Up Your Spark Jobs Using Caching

Speed Up Your Spark Jobs Using Caching

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

Jak Vypadají Noví Nejbohatší Youtubeři v Česku?

Jak Vypadají Noví Nejbohatší Youtubeři v Česku?

A teacher captured the cutest moment at the nursery #shorts

A teacher captured the cutest moment at the nursery #shorts

Cool Items! New Gadgets, Smart Appliances 🌟 By 123 GO! House

Cool Items! New Gadgets, Smart Appliances 🌟 By 123 GO! House

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Partitioning vs Bucketing | Interview Question | PySpark #pyspark #bigdata #pwc #interview

Partitioning vs Bucketing | Interview Question | PySpark #pyspark #bigdata #pwc #interview

Shuffle Partition Spark Optimization: 10x Faster!

Shuffle Partition Spark Optimization: 10x Faster!

How Salting Can Reduce Data Skew By 99%

How Salting Can Reduce Data Skew By 99%

Master Reading Spark DAGs

Master Reading Spark DAGs

Master Reading Spark Query Plans

Master Reading Spark Query Plans

Processing 25GB of data in Spark | How many Executors and how much Memory per Executor is required.

Processing 25GB of data in Spark | How many Executors and how much Memory per Executor is required.

Why Data Skew Will Ruin Your Spark Performance

Why Data Skew Will Ruin Your Spark Performance

Mom's Unique Approach to Teaching Kids Hygiene #shorts

Mom's Unique Approach to Teaching Kids Hygiene #shorts

Not all men’s character are the same🙏🏾🗣️ #men #loyalty #marriage

Not all men’s character are the same🙏🏾🗣️ #men #loyalty #marriage

Crossing the Most Dangerous Crosswalk

Crossing the Most Dangerous Crosswalk

She blended SPAGHETTI @anasofiafehn

She blended SPAGHETTI @anasofiafehn

This pasta HACK is almost approved

This pasta HACK is almost approved

Barbie Style Gear Knob Makeover: Glamour for Your Drive! 💅🏻🚗

Barbie Style Gear Knob Makeover: Glamour for Your Drive! 💅🏻🚗

Could Ancient Armor Stop Bullets? 🤔

Could Ancient Armor Stop Bullets? 🤔

World’s smallest 4K headset 😎 #tech #vr #technology #virtualreality #insideout2

World’s smallest 4K headset 😎 #tech #vr #technology #virtualreality #insideout2