The Big Data Show
The Big Data Show
  • 139
  • 887 880
Interview Question on Cache v/s Persist - Part 1
During a Data Engineering interview, you may be asked about concepts related to #apachespark
When working with large-scale data processing in Apache Spark, efficient data management is key to achieving optimal performance. Two commonly used methods for improving the efficiency of Spark jobs are cache() and persist(). While they might seem similar at first glance, they serve slightly different purposes and have distinct use cases. Let’s delve into the differences between these two methods and their use case along with the #interviewquestions
𝐀𝐫𝐭𝐢𝐜𝐥𝐞 & 𝐌𝐂𝐐 𝐋𝐢𝐧𝐤:
czcams.com/users/postUgkxtHWo5HNdQomArpA1qOeXi_szZzVfhJHP
In the video(part 1 & part 2) , you will learn:
🔅 Definition
🔅 Cache vs Persist
🔅 Use case in Spark Optimization
🔅 Storage Level of Persist
🔅 Serialization vs Deserialization
🔅 Demo using Spark UI
🔅 MCQ Question
𝐂𝐡𝐚𝐩𝐭𝐞𝐫𝐬:
- 0:00 Introduction
- 1:00 Definition
- 6:34 Persist Storage Level
- 09:51 Difference between cache and persist
- 11:37 Real-time Usecase & other important question
🔅 For scheduling a call for mentorship, mock interview preparation, 1:1 connect, collaboration - topmate.io/ankur_ranjan
🔅 LinkedIn - www.linkedin.com/in/thebigdatashow/
🔅 Instagram - ranjan_anku
🔅 Nisha's LinkedIn profile -
www.linkedin.com/in/engineer-nisha/
🔅 Ankur's LinkedIn profile - www.linkedin.com/in/thebigdatashow/
#dataengineering #datascience #bigdata #pyspark #dataanalytics #spark #interviewquestions #interview
zhlédnutí: 494

Video

IPL Final 2024 Data Analysis: Building the Ultimate Scorecard with Pyspark
zhlédnutí 1,6KPřed 14 dny
Have you ever wondered how crucial data analysis is for a cricket team's success? Thousands of Data Engineers, Data Analysts, and Data Scientists work tirelessly behind the scenes to craft winning strategies. In this session, we'll dive into an exciting IPL dataset and perform a transformation to build the Scorecard of the IPL Final 2024 featuring #SRHvKKR. In this video, you'll learn how to pe...
Salting in Apache Spark - Part II
zhlédnutí 583Před 21 dnem
In this video, we dive deep into the salting technique, a powerful method to tackle data skew issues in Spark. Data skew can significantly impact the performance of your Spark jobs by creating bottlenecks during data processing. Salting helps to evenly distribute the data across partitions, ensuring a smoother and more efficient processing flow. What You’ll Learn: 🔹 What is data skewness 🔹 How ...
Salting in Apache Spark - Part I
zhlédnutí 1,3KPřed 21 dnem
During a Data Engineering interview, you may be asked about concepts related to #apachespark This video explains the Salting technique. We will go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. The salting technique in Apache Spark is a method used to address data skew. Data skew happens when certain keys have more data than others...
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
zhlédnutí 4,1KPřed 21 dnem
Data Engineering Mock Interview Join Nisha, an experienced Data Engineering professional with over 5 years of experience, and Sai Varun Kumar Namburi for an exciting and informative Data Engineering mock interview session. If you're preparing for a Data Engineering interview, this is the perfect opportunity to enhance your skills and increase your chances of success. The mock interview simulate...
How to read from APIs in PySpark codebase...
zhlédnutí 1,6KPřed měsícem
PySpark mini project: Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges. Not all the inputs come from storage files like JSON, CSV and other formats. There can be cases where you are given a scenario ...
Data Engineering Interview at top product based company | First Round
zhlédnutí 6KPřed měsícem
Data Engineering Mock Interview In top product-based companies like #meta #amazon #google #netflix etc, the first round of Data Engineering Interviews checks problem-solving skills. It mostly consists of screen-sharing sessions, where candidates are expected to solve multiple SQL and DSA problems, particularly in #python. We have tried to replicate the same things by asking multiple good SQL an...
What is topic, partition and offset in Kafka?
zhlédnutí 539Před 2 měsíci
This is the third video of our "Kafka for Data Engineers" playlist. In this video, we have tried understanding the topic, partition and offset Apache Kafka in depth. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduli...
Brokers in Apache Kafka | Replication factor & ISR in Kafka
zhlédnutí 349Před 2 měsíci
This is the fourth video of our "Kafka for Data Engineers" playlist. In this video, we have tried to understand the brokers, replication factor and ISR. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduling calls) - t...
Job, Stage and Task in Apache Spark | PySpark interview questions
zhlédnutí 1,2KPřed 2 měsíci
In this video, we explain the concept of Job, Stage and Task in Apache Spark or PySpark. We have gone in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created many problems for you to practice on the same topic in the community section of our CZcams channel. You can find a link to all the questions ...
Unlocking Apache Kafka: The Secret Sauce of Event Streaming
zhlédnutí 698Před 2 měsíci
This is the second video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried understanding Apache Kafka in brief and then we have tried understanding the real meaning of event & event streaming. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼...
Unleashing #kafka Magic: What Data Engineers Do with Apache Kafka?
zhlédnutí 1,5KPřed 2 měsíci
This is the first video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried discussing one real use case or big data pipeline involving Kafka which is often used in the E-Commerce industry like Amazon, Walmart etc. It is very important to understand some of the real use cases of Apache Kafka in the Data Engineering domain. I hope this video will set up the tone for t...
Repartition vs. Coalesce in Apache Spark | PySpark interview questions
zhlédnutí 625Před 2 měsíci
During a Data Engineering interview, you may be asked about concepts related to #apachespark . In this video, we explain the difference between Repartition and Coalece in Apache Spark or PySpark. We go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created over ten problems for you to practice on ...
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis
zhlédnutí 27KPřed 2 měsíci
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis
Sports Data Analysis using PySpark - Part 02
zhlédnutí 1,1KPřed 2 měsíci
Sports Data Analysis using PySpark - Part 02
Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions
zhlédnutí 736Před 2 měsíci
Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions
Sports Data Analysis using PySpark - Part 01
zhlédnutí 1,4KPřed 2 měsíci
Sports Data Analysis using PySpark - Part 01
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
zhlédnutí 6KPřed 2 měsíci
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Data Engineering Interview
zhlédnutí 4,7KPřed 3 měsíci
Data Engineering Interview
Data Engineering Interview | PySpark Questions | Manager behavioural questions
zhlédnutí 7KPřed 3 měsíci
Data Engineering Interview | PySpark Questions | Manager behavioural questions
Data Engineering Interview at top product based company | First Round
zhlédnutí 11KPřed 3 měsíci
Data Engineering Interview at top product based company | First Round
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
zhlédnutí 7KPřed 3 měsíci
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Big Data Mock Interview | Data Engineering Interview
zhlédnutí 16KPřed 4 měsíci
Big Data Mock Interview | Data Engineering Interview
AWS Data Engineering Interview
zhlédnutí 23KPřed 4 měsíci
AWS Data Engineering Interview
Data Engineering Interview | System Design
zhlédnutí 23KPřed 4 měsíci
Data Engineering Interview | System Design
System Design round of #dataengineering interview
zhlédnutí 15KPřed 4 měsíci
System Design round of #dataengineering interview
First round of Big Data Engineering #interview
zhlédnutí 2,6KPřed 4 měsíci
First round of Big Data Engineering #interview
System Design round of Data Engineering #interview at top product-based company
zhlédnutí 41KPřed 5 měsíci
System Design round of Data Engineering #interview at top product-based company
Big Data Mock Interview | First Round
zhlédnutí 27KPřed 5 měsíci
Big Data Mock Interview | First Round
Data Engineering Mock Interview at Top Product Based Companies
zhlédnutí 10KPřed 6 měsíci
Data Engineering Mock Interview at Top Product Based Companies

Komentáře

  • @AK-zs3we
    @AK-zs3we Před 20 hodinami

    Very informative ! 🤝

  • @SillyLittleMe
    @SillyLittleMe Před dnem

    Does anybody have any idea what this error means: DBFS file browser StorageContext com.databricks.backend.storage.StorageContextType$DbfsRoot$@5c512926 for workspace 3667672304132597 is not set in the CustomerStorageInfo. I am not sure what this means. For context, I had uploaded some files yesterday on DBFS for practice purposes. Those files are still available if I try to find them through notebooks, however, the DBFS tab can't show them and throws this error. Any help will be much appreciated!

  • @kolodacool
    @kolodacool Před dnem

    Hey Manoj, great session on Data extraction via APIs. Few points I'd like to share from my experience working on this: 1) While dealing with huge volumes of data from source it's crucial to involve Pagination to iteratively collect all data. 2) Admins who manage these end_point URLs usually discourage having multiple API calls within a certain timeframe which would cause a dead-lock on your batch id. Like you suggested either have bulk data pulled all at once or optimize our framework.

  • @footballalchemist
    @footballalchemist Před 2 dny

    Just completed this amazing project 😍 Can i add this in my portfolio?

  • @DharmajiD
    @DharmajiD Před 2 dny

    I see this when trying to upload files using DBFS file browser Missing credentials to access AWS bucket

  • @shouviksharma7621
    @shouviksharma7621 Před 3 dny

    Thanks for the content, really beneficial.

  • @AnjaliH-wo4hm
    @AnjaliH-wo4hm Před 4 dny

    Good efforts however please minimize the usage of words like "perfect" and "well & good" after every sentence

  • @AkshayBaishander
    @AkshayBaishander Před 9 dny

    Great explanation thanks

  • @shobhitsharma2137
    @shobhitsharma2137 Před 9 dny

    subtitles are not fully visible

  • @MrTejasreddy
    @MrTejasreddy Před 10 dny

    super manoj..txs for u time as you said its really tough usecase if possible do it with easy data set next time bcz every one can understand easliy keep going....👌

    • @manojt3164
      @manojt3164 Před 9 dny

      Thanks a lot for your kind words. Sure I'll keep that in mind

  • @DataEngineerPratik
    @DataEngineerPratik Před 11 dny

    My Doubt is --> In spark we have cache and persist, used to save the RDD. As per my understanding cache and persist/MEMORY_AND_DISK both perform same action for DataFrames. If this is the case why should I prefer using cache at all, I can always use persist [with different parameters] and ignore cache. Could you please let me know, when to use cache, or if my understanding is wrong.

    • @manojt3164
      @manojt3164 Před 9 dny

      that's a good question. with cache, it stores data in memory in deserialized form it's also the same case with persist(MEMORY_ONLY). cache(), it is just a synonym for using persist() with the default storage level (StorageLevel.MEMORY_ONLY), which means, cache the data in memory and not to the disk. So, both of them store the data in-memory deserialized format.

    • @DataEngineerPratik
      @DataEngineerPratik Před 9 dny

      @@manojt3164 ok manoj thanks

  • @RaviShankarPoosaRaviKumar

    with tab1 as( SELECT order_id, max(quantity) as max_value, avg(quantity) as avg_quantity FROM orders GROUP BY order_id) SELECT order_id FROM tab1 WHERE max_value > ALL(SELECT avg_quantity FROM tab1 )

  • @shafimahmed7711
    @shafimahmed7711 Před 12 dny

    Thank you for your time and efforts. Its not a easy job 👏👏👏

  • @akhilsingh3801
    @akhilsingh3801 Před 13 dny

    important content is Questions asked 😅😅

  • @mohinraffik5222
    @mohinraffik5222 Před 16 dny

    Appreciate your great effort and share your knowledge brother!👍

  • @rationalthinker3706
    @rationalthinker3706 Před 18 dny

    please add the dataset

    • @manojt7012
      @manojt7012 Před 18 dny

      drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link

    • @TheBigDataShow
      @TheBigDataShow Před 18 dny

      Kindly check in the description.

  • @Aman-lv2ee
    @Aman-lv2ee Před 18 dny

    can you add the dataset link

    • @manojt7012
      @manojt7012 Před 18 dny

      Thanks for the response. drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link

  • @manishkumartiwari420
    @manishkumartiwari420 Před 19 dny

    Can you please help us with the dataset?

  • @rationalthinker3706
    @rationalthinker3706 Před 19 dny

    awesome sir

    • @TheBigDataShow
      @TheBigDataShow Před 13 dny

      Thank you for you kind words. Keep learning :)

  • @payalbhatia6927
    @payalbhatia6927 Před 19 dny

    which pentab/device is used for video , can you please share ?

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Před 20 dny

    with cte as ( select order_id,AVG(quantity) as avg_quantity from t1 group by order_id ) select order_id from t1 where quantity > cte. avg_quantity and order_id=cte.order_id;

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Před 20 dny

    how come the project discussion is happening in the first round. Usually, they asked about Python, and SQL questions in the first round to check the basic foundation. Correct me if I am wrong here

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Před 20 dny

    Don't you think his intro was too long and the interviewer has to cut him in between to ask questions quickly?

  • @atharvagaikwad9619
    @atharvagaikwad9619 Před 20 dny

    Why would you setup a session in spark when you already get it ?

    • @manojt7012
      @manojt7012 Před 18 dny

      That's right. With notebook, spark session would be already created it could have been coded with creating sparksession itself

  • @maazahmedansari4334
    @maazahmedansari4334 Před 21 dnem

    Replied in my previous question but it seems not visible so making again. Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you Please find the ap code I am trying to follow along here: github.com/maaz-ahmed-ansari/apple-product-analysis/tree/main

    • @maazahmedansari4334
      @maazahmedansari4334 Před 17 dny

      2nd pipeline is working as expected. Still bashing my mind around 1st pipeline. Can someone suggest how to resolve above error?

  • @maazahmedansari4334
    @maazahmedansari4334 Před 21 dnem

    Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you

    • @TheBigDataShow
      @TheBigDataShow Před 21 dnem

      @@maazahmedansari4334 please share your some more code snippet for debugging and have your created some GitHub repo for same.

  • @yashbhosle3582
    @yashbhosle3582 Před 22 dny

    SELECT name, department_name, MAX(DATEDIFF(promotion_date, hire_date)) AS longest_time FROM employee JOIN department ON employee.dept_id = department.dept_id JOIN promotion ON employee.employee_id = promotion.employee_id GROUP BY department_name ORDER BY longest_time DESC;

  • @mufaddalrampurawala247

    This also increases the data size of the second dataset as we explode it, so is it still optimized as the data scan will be increased a lot and lot of shuffle will be involved?

    • @nishabansal2978
      @nishabansal2978 Před 22 dny

      While salting can increase the data size and shuffle overhead in Spark, its benefits in mitigating data skewness and improving workload distribution often outweigh these drawbacks. The other important thing is to decide on salting factor to choose for your workload as that will again impact the overall distribution

  • @TheBigDataShow
    @TheBigDataShow Před 24 dny

    A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.

  • @rationalthinker3706
    @rationalthinker3706 Před 24 dny

    Thank you , waiting

    • @TheBigDataShow
      @TheBigDataShow Před 24 dny

      A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.

  • @arpanmitra1994
    @arpanmitra1994 Před 25 dny

    nums = [0,0,2,3,3,3,3,5,5] k = 2 new_nums = [] for i in nums: if nums.count(i) == k: if i not in new_nums: new_nums.append(i) print(new_nums)

  • @DataCoholic
    @DataCoholic Před 25 dny

    @The Big Data Show Do you develop restful webservices or rest API as well as a data engineer?

    • @manojt3164
      @manojt3164 Před 24 dny

      Hello, yes there can be cases where engineers are asked to develop solutions to expose data outside of the organization. In such cases, you might need to build REST APIs to expose your dat securely. Ofcourse there can also be other approaches to expose data as well!

    • @DataCoholic
      @DataCoholic Před 24 dny

      @@manojt3164 Did you worked on spring and springboot?

  • @tanayvaswani-24blue
    @tanayvaswani-24blue Před 25 dny

    Can you do even a small project using Kafka?

    • @TheBigDataShow
      @TheBigDataShow Před 25 dny

      Give me some time. I am already planning but I am currently getting less time due to my startup initial days. Please give me some time, I will upload it. I have already uploaded some of the Kafka videos. Please check the "Kafka for Data Engineers" playlist

  • @adityeshchaturvedi6553

    Great explanation Ankur !!

  • @adityeshchaturvedi6553

    Great video Ankur. Being following your content and blogs via linked-In Congrats !!

  • @dhruvingandhi1114
    @dhruvingandhi1114 Před 28 dny

    Hello I am getting error to read delta table that is on default at 01:21:50 IllegalArgumentException: Path must be absolute: default.customer_delta_table_persist.Please help me through that

  • @unknown_fact1586
    @unknown_fact1586 Před 28 dny

    Please mention the experience of the interviewee either in caption or in thumbnail. It would be helpful

  • @Ravi-oy8zl
    @Ravi-oy8zl Před 29 dny

    The best playlist who wants to learn Kafka in data engineering domain. Every video has a clear cur explanation. Hope it will be continued. It can be a one stop tutorial for those who wants to learn kafka.

    • @TheBigDataShow
      @TheBigDataShow Před 29 dny

      Thank you for your kind words. I will continue this in a few days. Stuck in my work for many days. Hope I get free time soon