How Salting Can Reduce Data Skew By 99%

Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong

Spark Parallelism using JDBC similar to Sqoop

ВОДА В СОЛО

ŠKODA FABIA HELLCAT 🔥🔥🔥 #ukazkaru #kk24 #realityshow

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

How to handle Data skewness in Apache Spark using Key Salting Technique

Tech Island

zhlédnutí 26 218

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 22. 06. 2020
Handling the Data Skewness using Key Salting Technique. One of the biggest problem in parallel computational systems is data skewness. Data Skewness in Spark happens due to joining on a key that is not evenly distributed across the cluster, causing some partitions to be very large and not allowing Spark to process data in parallel.
GitHub Link - github.com/gjeevanm/SparkData...
Content By - Jeevan Madhur [LinkedIn - / jeevan-madhur-225a3a86 ]
Editing By - Sivaraman Ravi [LinkedIn - / sivaraman-ravi-791838114 ]
Věda a technologie

Komentáře • 26

@pariksheetde4573 Před 4 lety ⁺⁴
Excellent. Thank you
@gautamyadav-cx7zx Před 2 lety
Well, I must say, thanks a lot.....have been searching for this kind of explaination.
@someshchandra007 Před 3 lety
This really great and crystal clear explanations....thanks a lot for sharing and spreading knowledge!
@arunsundar3739 Před 3 měsíci
beautifully explained, thank you very much :)
@ashwinc9867 Před 3 lety ⁺¹
Excellent video..thanks for the explanation and sharing the code
@soumyadipdas1406 Před 4 lety ⁺¹
amazing sir! thanks a lot
@joeturkington1304 Před 2 lety
Excellent Description
@chetansp912 Před 3 lety ⁺¹
Amazing video..!!
@gurumoorthysivakolunthu9878 Před rokem ⁺²
Hi Sir... Perfect Great Explanation... Thank you for your effort...
I have a doubt :--
After joining The Salting step should be - unsalted and then grouped by has to be applied, Right...?
.....
@vijeandran Před 3 lety
Amazing video.... How can we use the salting technique in PySpark for data skew?
@savage_su Před 2 lety
Good work, its better you show the ourput after the salting dataframes and explain udf more detail.
@shwetanandwani9059 Před 2 lety
Hey great video, could you also link the associated resources you referred to while making this video?
@SpiritOfIndiaaa Před 2 lety ⁺¹
Thanks but if we have multiple columns as KEY how to handle it ?
@MahmoudHanafy1992 Před 3 lety ⁺¹
Great Explanation, Thanks for sharing this.
I think there is off by 1 error.
You are using (0 to 3) which will have (0, 1, 2, 3)
but random number range will be (0, 1, 2)
@tanushreenagar3116 Před 2 lety
best
@rishigc Před 3 lety ⁺¹
amazing video.. however, i don't know scala. So can you please give an example on how to implement the salting technique with Spark SQL queries ? that'll be of great help..
@jeevanmadhur3732 Před 3 lety
Will update SQL query
@ashwinc9867 Před 3 lety
@@jeevanmadhur3732 waiting for the query
@balajia8376 Před 2 lety
@@ashwinc9867 did you get it?
@akashhudge5735 Před 2 lety
but the join output will not be correct because in previous scenario it would have joined with all the matching ids but with new salting method it will join with only newly slated key, that's weird
@aravindkumar4411 Před 4 lety
Can u please explain how to take the random number count
@jeevanmadhur3732 Před 4 lety ⁺¹
Hi Aravind, If I understand your question correctly you wanted to take the first data frame count where we are appending a random number
var df1 = leftTable
.withColumn(leftCol, concat(
leftTable.col(leftCol), lit("_"), lit(floor(rand(123456) * 10))))
We can simply do
df1.select(col("id")).count()
This should give the count of the first data frame column
For more details, you can refer below git link
github.com/gjeevanm/SparkDataSkewness/blob/master/src/main/scala/com/gjeevan/DataSkew/RemoveDataSkew.scala
@thomashass1 Před 2 lety
I have 2 questions:
First one: I think that is wrong on your visual presentation of table 2 after salting. Why don't you have z_2 und z_3 there? Also why are you using capital letters sometimes, that's confusing.
Secone question: I don't get the benefit of Key Salting in general. How is this different from broadcasting you second table? Because you explode it and then you will end up with sending the whole table to every executor anyway? No one can give an answer to this question.
@NishaKumari-op2ek Před 3 lety
Hi, are you missing something in code ?? I used your code but its throwing an exception for the below code of lines
//join after elminating data skewness
df3.join(
df4,
df3.col("id") df4.col("id")
)
.show(100,false)
}
@jeevanmadhur3732 Před 3 lety ⁺¹
Hi,
Thanks for highlighting, there is small issue with checked-in join code which I fixed now. Please pull latest code and try out
@NishaKumari-op2ek Před 3 lety ⁺²
@@jeevanmadhur3732 Thank you Jeevan. your videos helps us a lot :)

Další v pořadí

Automatické přehrávání

How Salting Can Reduce Data Skew By 99%

How Salting Can Reduce Data Skew By 99%

Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong

Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong

Spark Parallelism using JDBC similar to Sqoop

Spark Parallelism using JDBC similar to Sqoop

ŠKODA FABIA HELLCAT 🔥🔥🔥 #ukazkaru #kk24 #realityshow

ŠKODA FABIA HELLCAT 🔥🔥🔥 #ukazkaru #kk24 #realityshow

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

Tento Fotbalista Vyhrál NEJVÍCE Trofejí ve FOTBALE!

Ochutnáváme Nejsmradlavější Vajíčka na světě - Stoleté Vejce @Duklock @Vidrail

Ochutnáváme Nejsmradlavější Vajíčka na světě - Stoleté Vejce @Duklock @Vidrail

34. Databricks - Spark: Data Skew Optimization

34. Databricks - Spark: Data Skew Optimization

Correcting Skewed Data with Scipy and Numpy

Correcting Skewed Data with Scipy and Numpy

24 Fix Skewness and Spillage with Salting in Spark

24 Fix Skewness and Spillage with Salting in Spark

Advancing Spark - Understanding the Spark UI

Advancing Spark - Understanding the Spark UI

Spark Performance Tuning | Handling DATA Skewness | Interview Question

Spark Performance Tuning | Handling DATA Skewness | Interview Question

Issues in Big Data Projects | Interview Question | 10 Issues Answered

Issues in Big Data Projects | Interview Question | 10 Issues Answered

How to Read Spark DAGs | Rock the JVM

How to Read Spark DAGs | Rock the JVM

Top 5 Mistakes When Writing Spark Applications

Top 5 Mistakes When Writing Spark Applications

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

Nvidia Has A Very Unique Problem #funfact

Nvidia Has A Very Unique Problem #funfact

POCO X6 PRO😈 Vs iPHONE 15 PRO💀Vs POCO F6 PRO😱 VsiQOO 12Vs 8GBvs4GBVs-PUBG TEST #pocox6pro #iPhone

POCO X6 PRO😈 Vs iPHONE 15 PRO💀Vs POCO F6 PRO😱 VsiQOO 12Vs 8GBvs4GBVs-PUBG TEST #pocox6pro #iPhone

Он придумал гениальную идею, как исправить разбитый экран! 🤯 | Credit : gertieinar (TT)

Он придумал гениальную идею, как исправить разбитый экран! 🤯 | Credit : gertieinar (TT)

World’s smallest 4K headset 😎 #tech #vr #technology #virtualreality #insideout2

World’s smallest 4K headset 😎 #tech #vr #technology #virtualreality #insideout2

Welcome to the Rat's Nest - The Setup Doctor Ep 2

Welcome to the Rat's Nest - The Setup Doctor Ep 2

#best PLAYSTATION CONSOLE #collection #shortvideos #gaming #foryou

#best PLAYSTATION CONSOLE #collection #shortvideos #gaming #foryou

Telefonu Parçaladım!😱

Telefonu Parçaladım!😱

The first two iPads are imitations, just for demonstration purposes, don't worry#ipadkeyboard #ipad

The first two iPads are imitations, just for demonstration purposes, don't worry#ipadkeyboard #ipad