139
887 880

IPL Final 2024 Data Analysis: Building the Ultimate Scorecard with Pyspark

44:27

Salting in Apache Spark - Part II

15:09

Salting in Apache Spark - Part I

17:46

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

44:45

How to read from APIs in PySpark codebase...

25:30

Data Engineering Interview at top product based company | First Round

40:07

Interview Question on Cache v/s Persist - Part 1

During a Data Engineering interview, you may be asked about concepts related to #apachespark
When working with large-scale data processing in Apache Spark, efficient data management is key to achieving optimal performance. Two commonly used methods for improving the efficiency of Spark jobs are cache() and persist(). While they might seem similar at first glance, they serve slightly different purposes and have distinct use cases. Let’s delve into the differences between these two methods and their use case along with the #interviewquestions
𝐀𝐫𝐭𝐢𝐜𝐥𝐞 & 𝐌𝐂𝐐 𝐋𝐢𝐧𝐤:
czcams.com/users/postUgkxtHWo5HNdQomArpA1qOeXi_szZzVfhJHP
In the video(part 1 & part 2) , you will learn:
🔅 Definition
🔅 Cache vs Persist
🔅 Use case in Spark Optimization
🔅 Storage Level of Persist
🔅 Serialization vs Deserialization
🔅 Demo using Spark UI
🔅 MCQ Question
𝐂𝐡𝐚𝐩𝐭𝐞𝐫𝐬:
- 0:00 Introduction
- 1:00 Definition
- 6:34 Persist Storage Level
- 09:51 Difference between cache and persist
- 11:37 Real-time Usecase & other important question
🔅 For scheduling a call for mentorship, mock interview preparation, 1:1 connect, collaboration - topmate.io/ankur_ranjan
🔅 LinkedIn - www.linkedin.com/in/thebigdatashow/
🔅 Instagram - ranjan_anku
🔅 Nisha's LinkedIn profile -
www.linkedin.com/in/engineer-nisha/
🔅 Ankur's LinkedIn profile - www.linkedin.com/in/thebigdatashow/
#dataengineering #datascience #bigdata #pyspark #dataanalytics #spark #interviewquestions #interview

zhlédnutí: 494

Video

IPL Final 2024 Data Analysis: Building the Ultimate Scorecard with Pyspark

44:27

IPL Final 2024 Data Analysis: Building the Ultimate Scorecard with Pyspark

zhlédnutí 1,6KPřed 14 dny

Have you ever wondered how crucial data analysis is for a cricket team's success? Thousands of Data Engineers, Data Analysts, and Data Scientists work tirelessly behind the scenes to craft winning strategies. In this session, we'll dive into an exciting IPL dataset and perform a transformation to build the Scorecard of the IPL Final 2024 featuring #SRHvKKR. In this video, you'll learn how to pe...

15:09

Salting in Apache Spark - Part II

zhlédnutí 583Před 21 dnem

In this video, we dive deep into the salting technique, a powerful method to tackle data skew issues in Spark. Data skew can significantly impact the performance of your Spark jobs by creating bottlenecks during data processing. Salting helps to evenly distribute the data across partitions, ensuring a smoother and more efficient processing flow. What You’ll Learn: 🔹 What is data skewness 🔹 How ...

17:46

Salting in Apache Spark - Part I

zhlédnutí 1,3KPřed 21 dnem

During a Data Engineering interview, you may be asked about concepts related to #apachespark This video explains the Salting technique. We will go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. The salting technique in Apache Spark is a method used to address data skew. Data skew happens when certain keys have more data than others...

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

44:45

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

zhlédnutí 4,1KPřed 21 dnem

Data Engineering Mock Interview Join Nisha, an experienced Data Engineering professional with over 5 years of experience, and Sai Varun Kumar Namburi for an exciting and informative Data Engineering mock interview session. If you're preparing for a Data Engineering interview, this is the perfect opportunity to enhance your skills and increase your chances of success. The mock interview simulate...

How to read from APIs in PySpark codebase...

25:30

How to read from APIs in PySpark codebase...

zhlédnutí 1,6KPřed měsícem

PySpark mini project: Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges. Not all the inputs come from storage files like JSON, CSV and other formats. There can be cases where you are given a scenario ...

Data Engineering Interview at top product based company | First Round

40:07

Data Engineering Interview at top product based company | First Round

zhlédnutí 6KPřed měsícem

Data Engineering Mock Interview In top product-based companies like #meta #amazon #google #netflix etc, the first round of Data Engineering Interviews checks problem-solving skills. It mostly consists of screen-sharing sessions, where candidates are expected to solve multiple SQL and DSA problems, particularly in #python. We have tried to replicate the same things by asking multiple good SQL an...

What is topic, partition and offset in Kafka?

27:24

What is topic, partition and offset in Kafka?

zhlédnutí 539Před 2 měsíci

This is the third video of our "Kafka for Data Engineers" playlist. In this video, we have tried understanding the topic, partition and offset Apache Kafka in depth. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduli...

Brokers in Apache Kafka | Replication factor & ISR in Kafka

21:22

Brokers in Apache Kafka | Replication factor & ISR in Kafka

zhlédnutí 349Před 2 měsíci

This is the fourth video of our "Kafka for Data Engineers" playlist. In this video, we have tried to understand the brokers, replication factor and ISR. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduling calls) - t...

Job, Stage and Task in Apache Spark | PySpark interview questions

21:39

Job, Stage and Task in Apache Spark | PySpark interview questions

zhlédnutí 1,2KPřed 2 měsíci

In this video, we explain the concept of Job, Stage and Task in Apache Spark or PySpark. We have gone in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created many problems for you to practice on the same topic in the community section of our CZcams channel. You can find a link to all the questions ...

Unlocking Apache Kafka: The Secret Sauce of Event Streaming

17:17

Unlocking Apache Kafka: The Secret Sauce of Event Streaming

zhlédnutí 698Před 2 měsíci

This is the second video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried understanding Apache Kafka in brief and then we have tried understanding the real meaning of event & event streaming. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼...

Unleashing #kafka Magic: What Data Engineers Do with Apache Kafka?

25:45

Unleashing #kafka Magic: What Data Engineers Do with Apache Kafka?

zhlédnutí 1,5KPřed 2 měsíci

This is the first video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried discussing one real use case or big data pipeline involving Kafka which is often used in the E-Commerce industry like Amazon, Walmart etc. It is very important to understand some of the real use cases of Apache Kafka in the Data Engineering domain. I hope this video will set up the tone for t...

Repartition vs. Coalesce in Apache Spark | PySpark interview questions

19:22

Repartition vs. Coalesce in Apache Spark | PySpark interview questions

zhlédnutí 625Před 2 měsíci

During a Data Engineering interview, you may be asked about concepts related to #apachespark . In this video, we explain the difference between Repartition and Coalece in Apache Spark or PySpark. We go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created over ten problems for you to practice on ...

Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

3:01:19

Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

zhlédnutí 27KPřed 2 měsíci

Apache Spark End-To-End Data Engineering Project | Apple Data Analysis

Sports Data Analysis using PySpark - Part 02

41:47

Sports Data Analysis using PySpark - Part 02

zhlédnutí 1,1KPřed 2 měsíci

Sports Data Analysis using PySpark - Part 02

Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions

29:21

Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions

zhlédnutí 736Před 2 měsíci

Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions

Sports Data Analysis using PySpark - Part 01

41:14

Sports Data Analysis using PySpark - Part 01

zhlédnutí 1,4KPřed 2 měsíci

Sports Data Analysis using PySpark - Part 01

51:35

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

zhlédnutí 6KPřed 2 měsíci

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

42:46

Data Engineering Interview

zhlédnutí 4,7KPřed 3 měsíci

Data Engineering Interview

1:01:50

Data Engineering Interview | PySpark Questions | Manager behavioural questions

zhlédnutí 7KPřed 3 měsíci

Data Engineering Interview | PySpark Questions | Manager behavioural questions

50:25

Data Engineering Interview at top product based company | First Round

zhlédnutí 11KPřed 3 měsíci

Data Engineering Interview at top product based company | First Round

51:59

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

zhlédnutí 7KPřed 3 měsíci

Big Data Mock Interview | Data Engineering Interview | First Round of Interview

46:54

Big Data Mock Interview | Data Engineering Interview

zhlédnutí 16KPřed 4 měsíci

Big Data Mock Interview | Data Engineering Interview

55:13

AWS Data Engineering Interview

zhlédnutí 23KPřed 4 měsíci

AWS Data Engineering Interview

Data Engineering Interview | System Design

1:00:00

Data Engineering Interview | System Design

zhlédnutí 23KPřed 4 měsíci

Data Engineering Interview | System Design

System Design round of #dataengineering interview

50:14

System Design round of #dataengineering interview

zhlédnutí 15KPřed 4 měsíci

System Design round of #dataengineering interview

First round of Big Data Engineering #interview

1:18:51

First round of Big Data Engineering #interview

zhlédnutí 2,6KPřed 4 měsíci

First round of Big Data Engineering #interview

System Design round of Data Engineering #interview at top product-based company

46:25

System Design round of Data Engineering #interview at top product-based company

zhlédnutí 41KPřed 5 měsíci

System Design round of Data Engineering #interview at top product-based company

52:23

Big Data Mock Interview | First Round

zhlédnutí 27KPřed 5 měsíci

Big Data Mock Interview | First Round

Data Engineering Mock Interview at Top Product Based Companies

1:24:50

Data Engineering Mock Interview at Top Product Based Companies

zhlédnutí 10KPřed 6 měsíci

Data Engineering Mock Interview at Top Product Based Companies

Komentáře

@AK-zs3we Před 20 hodinami
Very informative ! 🤝
@SillyLittleMe Před dnem
Does anybody have any idea what this error means: DBFS file browser StorageContext com.databricks.backend.storage.StorageContextType$DbfsRoot$@5c512926 for workspace 3667672304132597 is not set in the CustomerStorageInfo. I am not sure what this means. For context, I had uploaded some files yesterday on DBFS for practice purposes. Those files are still available if I try to find them through notebooks, however, the DBFS tab can't show them and throws this error. Any help will be much appreciated!
@kolodacool Před dnem
Hey Manoj, great session on Data extraction via APIs. Few points I'd like to share from my experience working on this: 1) While dealing with huge volumes of data from source it's crucial to involve Pagination to iteratively collect all data. 2) Admins who manage these end_point URLs usually discourage having multiple API calls within a certain timeframe which would cause a dead-lock on your batch id. Like you suggested either have bulk data pulled all at once or optimize our framework.
@footballalchemist Před 2 dny
Just completed this amazing project 😍 Can i add this in my portfolio?
@DharmajiD Před 2 dny
I see this when trying to upload files using DBFS file browser Missing credentials to access AWS bucket
@shouviksharma7621 Před 3 dny
Thanks for the content, really beneficial.
@AnjaliH-wo4hm Před 4 dny
Good efforts however please minimize the usage of words like "perfect" and "well & good" after every sentence
@TheBigDataShow Před 4 dny
@@AnjaliH-wo4hm will try to improve
@AkshayBaishander Před 9 dny
Great explanation thanks
@shobhitsharma2137 Před 9 dny
subtitles are not fully visible
@MrTejasreddy Před 10 dny
super manoj..txs for u time as you said its really tough usecase if possible do it with easy data set next time bcz every one can understand easliy keep going....👌
@manojt3164 Před 9 dny
Thanks a lot for your kind words. Sure I'll keep that in mind
@DataEngineerPratik Před 11 dny
My Doubt is --> In spark we have cache and persist, used to save the RDD. As per my understanding cache and persist/MEMORY_AND_DISK both perform same action for DataFrames. If this is the case why should I prefer using cache at all, I can always use persist [with different parameters] and ignore cache. Could you please let me know, when to use cache, or if my understanding is wrong.
@manojt3164 Před 9 dny
that's a good question. with cache, it stores data in memory in deserialized form it's also the same case with persist(MEMORY_ONLY). cache(), it is just a synonym for using persist() with the default storage level (StorageLevel.MEMORY_ONLY), which means, cache the data in memory and not to the disk. So, both of them store the data in-memory deserialized format.
@DataEngineerPratik Před 9 dny
@@manojt3164 ok manoj thanks
@RaviShankarPoosaRaviKumar Před 12 dny
with tab1 as( SELECT order_id, max(quantity) as max_value, avg(quantity) as avg_quantity FROM orders GROUP BY order_id) SELECT order_id FROM tab1 WHERE max_value > ALL(SELECT avg_quantity FROM tab1 )
@shafimahmed7711 Před 12 dny
Thank you for your time and efforts. Its not a easy job 👏👏👏
@akhilsingh3801 Před 13 dny
important content is Questions asked 😅😅
@mohinraffik5222 Před 16 dny
Appreciate your great effort and share your knowledge brother!👍
@rationalthinker3706 Před 18 dny
please add the dataset
@manojt7012 Před 18 dny
drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link
@TheBigDataShow Před 18 dny
Kindly check in the description.
@Aman-lv2ee Před 18 dny
can you add the dataset link
@manojt7012 Před 18 dny
Thanks for the response. drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link
@manishkumartiwari420 Před 19 dny
Can you please help us with the dataset?
@TheBigDataShow Před 18 dny
Kindly check in the description.
@rationalthinker3706 Před 19 dny
awesome sir
@TheBigDataShow Před 13 dny
Thank you for you kind words. Keep learning :)
@payalbhatia6927 Před 19 dny
which pentab/device is used for video , can you please share ?
@nishabansal2978 Před 15 dny
Wacom
@AshishDukare-vr6xb Před 20 dny
with cte as ( select order_id,AVG(quantity) as avg_quantity from t1 group by order_id ) select order_id from t1 where quantity > cte. avg_quantity and order_id=cte.order_id;
@AshishDukare-vr6xb Před 20 dny
how come the project discussion is happening in the first round. Usually, they asked about Python, and SQL questions in the first round to check the basic foundation. Correct me if I am wrong here
@TheBigDataShow Před 13 dny
This dependes upon the company.
@AshishDukare-vr6xb Před 20 dny
Don't you think his intro was too long and the interviewer has to cut him in between to ask questions quickly?
@TheBigDataShow Před 13 dny
Not necessarily.
@atharvagaikwad9619 Před 20 dny
Why would you setup a session in spark when you already get it ?
@manojt7012 Před 18 dny
That's right. With notebook, spark session would be already created it could have been coded with creating sparksession itself
@maazahmedansari4334 Před 21 dnem
Replied in my previous question but it seems not visible so making again. Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you Please find the ap code I am trying to follow along here: github.com/maaz-ahmed-ansari/apple-product-analysis/tree/main
@maazahmedansari4334 Před 17 dny
2nd pipeline is working as expected. Still bashing my mind around 1st pipeline. Can someone suggest how to resolve above error?
@maazahmedansari4334 Před 21 dnem
Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you
@TheBigDataShow Před 21 dnem
@@maazahmedansari4334 please share your some more code snippet for debugging and have your created some GitHub repo for same.
@yashbhosle3582 Před 22 dny
SELECT name, department_name, MAX(DATEDIFF(promotion_date, hire_date)) AS longest_time FROM employee JOIN department ON employee.dept_id = department.dept_id JOIN promotion ON employee.employee_id = promotion.employee_id GROUP BY department_name ORDER BY longest_time DESC;
@mufaddalrampurawala247 Před 23 dny
This also increases the data size of the second dataset as we explode it, so is it still optimized as the data scan will be increased a lot and lot of shuffle will be involved?
@nishabansal2978 Před 22 dny
While salting can increase the data size and shuffle overhead in Spark, its benefits in mitigating data skewness and improving workload distribution often outweigh these drawbacks. The other important thing is to decide on salting factor to choose for your workload as that will again impact the overall distribution
@TheBigDataShow Před 24 dny
A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.
@rationalthinker3706 Před 24 dny
Thank you , waiting
@TheBigDataShow Před 24 dny
A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.
@arpanmitra1994 Před 25 dny
nums = [0,0,2,3,3,3,3,5,5] k = 2 new_nums = [] for i in nums: if nums.count(i) == k: if i not in new_nums: new_nums.append(i) print(new_nums)
@DataCoholic Před 25 dny
@The Big Data Show Do you develop restful webservices or rest API as well as a data engineer?
@manojt3164 Před 24 dny
Hello, yes there can be cases where engineers are asked to develop solutions to expose data outside of the organization. In such cases, you might need to build REST APIs to expose your dat securely. Ofcourse there can also be other approaches to expose data as well!
@DataCoholic Před 24 dny
@@manojt3164 Did you worked on spring and springboot?
@tanayvaswani-24blue Před 25 dny
Can you do even a small project using Kafka?
@TheBigDataShow Před 25 dny
Give me some time. I am already planning but I am currently getting less time due to my startup initial days. Please give me some time, I will upload it. I have already uploaded some of the Kafka videos. Please check the "Kafka for Data Engineers" playlist
@adityeshchaturvedi6553 Před 27 dny
Great explanation Ankur !!
@adityeshchaturvedi6553 Před 28 dny
Great video Ankur. Being following your content and blogs via linked-In Congrats !!
@dhruvingandhi1114 Před 28 dny
Hello I am getting error to read delta table that is on default at 01:21:50 IllegalArgumentException: Path must be absolute: default.customer_delta_table_persist.Please help me through that
@unknown_fact1586 Před 28 dny
Please mention the experience of the interviewee either in caption or in thumbnail. It would be helpful
@Ravi-oy8zl Před 29 dny
The best playlist who wants to learn Kafka in data engineering domain. Every video has a clear cur explanation. Hope it will be continued. It can be a one stop tutorial for those who wants to learn kafka.
@TheBigDataShow Před 29 dny
Thank you for your kind words. I will continue this in a few days. Stuck in my work for many days. Hope I get free time soon

The Big Data Show

Komentáře