Ease With Data
Ease With Data
  • 87
  • 135 154
30 Data Skipping and Z-Ordering in Delta Lake Tables | Optimize & Data Compaction Delta Lake Tables
Video explains - What is the impact of data skipping on jobs? How z-ordering in delta lake works ? How to optimize delta lake tables?
Chapters
00:00 - Introduction
00:31 - What is Data Skipping and Z-Ordering in Delta Lake?
03:34 - Z-Ordering for more than 1 column/Multidimensional Z-ORDER
04:38 - Delta Lake Table Optimization with Example
11:59 - Multi Column Z-Ordering in Delta Lake Table
14:43 - Impact of Partitioning with Z-Ordering
16:24 - Selective Z-Ordering with Partition filters
17:57 - Auto Compaction in Delta Lake Table
For Local PySpark Jupyter Lab setup just run the command - docker pull jupyter/pyspark-notebook
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pyspark-zero-to-hero/blob/master/25_delta_lake_optimization_and_z_ordering.ipynb
Delta Lake Optimization Documentation - docs.delta.io/latest/optimizations-oss.html#language-sql
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering
zhlédnutí: 296

Video

29 Optimize Data Scanning with Partitioning in Spark | How Partitioning data works | Optimize Jobs
zhlédnutí 462Před dnem
Video explains - What is the impact of data scanning on jobs? How partitioning works ? How to avoid un-necessary data scanning? How to optimize jobs using Partitioning? Chapters 00:00 - Introduction 00:35 - Why avoiding un-necessary Data Scanning in Important? 02:12 - Impacat of Data Partitioning 03:07 - Imapct of partitioning column missing from Query 03:33 - Impact of parititioning on High Ca...
Get 100% more Interview calls from Naukri Portal | Boost your Naukri Profile |Optimize Naukri Search
zhlédnutí 1,1KPřed 21 dnem
Video explains - How to Optimize Naukri Profile to get more calls or opportunities? How to boost Naukri Profile? What are the relevant changes to get more Interview calls ? How to hack Naukri Portal to get more calls ? Naukri Job Search help ? Chapters 00:00 - Introduction 01:19 - Naukri Profile Changes 02:52 - Changes for Resume 04:32 - Keywords Optimization 05:45 - Serving Notice Period 07:15...
28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel
zhlédnutí 788Před 2 měsíci
Video explains - What is Delta Lake and why is it important? What are the Key Features of Delta Lake? How Delta table manages versions? How Delta table manages Time Travel? What is Schema Evolution in Delta Lake? Chapters 00:00 - Introduction 00:22 - Key Features of Delta Lake 01:27 - Sign Up and Login into Databricks Community Edition for Free 03:22 - Get Started with Databricks basics 10:34 -...
27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases
zhlédnutí 414Před 2 měsíci
Video explains - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and ...
17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases
zhlédnutí 464Před 3 měsíci
Video covers - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and th...
01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System
zhlédnutí 301Před 3 měsíci
Understand - What is Distributed Computing and How it works? What is Big Data? What is Hadoop and what are its important component? What is Horizontal and Vertical Scaling ? Chapters: 01:10 - History and Why Big Data? 03:00 - What is Big Data? 04:49 - What is Hadoop ? 06:36 - What is Distributed Computing and How it Works? 06:46 - Horizontal vs Vertical scaling 11:11 - Components of Hadoop Lang...
16 Late Data Processing | Watermarks | Tumbling and Sliding Window Operations in Spark Streaming
zhlédnutí 690Před 3 měsíci
Video covers - What are Watermarks in Spark? What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Difference between Update and Complete modes ? Chapters: 00:00 - Introduction 01:15 - Fixed Window Code Implementation 03:28 - Fixed Window with Watermark 07:39 - Late Events with Wate...
15 Tumbling, Sliding and Session Window Operations in Spark Streaming | Grouped Window Aggregations
zhlédnutí 748Před 3 měsíci
Video covers - What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Chapters: 00:00 - Introduction 00:40 - Tumbling or Fixed Window 02:42 - Sliding or Overlapping Window 04:23 - Late data scenario and Importance of Watermark 06:37 - Session Window URLs: Github Code - github.com/sub...
14 Spark Streaming Event vs Processing Time | Late Arrival of Data | Stateful Processing |Watermarks
zhlédnutí 619Před 4 měsíci
Video covers - How to handle Late Arrival of Data ? What is the difference between Event and Processing Time? How Spark handles Stateful Processing ? What are Watermarks in Spark Streaming ? Chapters: 00:00 - Introduction 00:29 - Event Time vs Processing Time 02:34 - Stateful Processing 03:38 - How Spark handles Late Data ? URLs: Github Code - github.com/subhamkharwal/spark-streaming-with-pyspa...
13 Spark Streaming Handling Errors and Exceptions | Handle Exception for data re-processing in Spark
zhlédnutí 830Před 4 měsíci
Video covers - How to handle error in Spark Streaming? How to handle Exception in Spark Streaming Application? How to store error data for re-processing in Spark Streaming? How to write data to JDBC Postgres table? Chapters: 00:00 - Introduction 00:41 - Error vs Exception in Spark Streaming 04:17 - Handling Error/Malformed data in Spark Streaming 10:02 - Handling Exception in Spark Streaming UR...
12 Spark Streaming Writing data to Multiple Sinks | foreachBatch | Writing data to JDBC(Postgres)
zhlédnutí 1,3KPřed 4 měsíci
Video covers - How to write data to multiple sinks in Spark Streaming? What is the issue with multiple writeStream command? How to use foreachBatch in Spark Streaming? How to write data to Postgres/JDBC ? Chapters: 00:00 - Introduction 00:46 - Issues with Using Multiple WriteStream 01:37 - foreachBatch command 04:03 - Code Implementation 04:51 - Writing data to Multiple Sink using foreachBatch ...
11 Spark Streaming Triggers - Once, Processing Time & Continuous | Tune Kafka Streaming Performance
zhlédnutí 1KPřed 4 měsíci
Video covers - What are different triggers available for Spark Streaming? How Processing Time and Once trigger are different? How we can tune Kafka jobs with Partitions? Chapters: 00:00 - Introduction 00:58 - Automating Device data for Kafka 02:37 - Trigger mode Once/AvailableNow 04:36 - Trigger mode ProcessingTime 05:49 - Tune Kafka Streaming Job 08:23 - Trigger mode Continuous URLs: Github Co...
10 Spark Streaming Read from Kafka | Real time streaming from Kafka
zhlédnutí 2,1KPřed 5 měsíci
Video covers - How to read streaming data from Kafka? How to read real time data from Kafka? How to use Kafka as a Source for Real time Spark Streaming? Chapters: 00:00 - Introduction 00:34 - Example Device JSON Payload 01:09 - Import Kafka JAR Libraries 03:08 - Read from Kafka Source 06:27 - Extract JSON data from column using from_json URLs: Github Code - github.com/subhamkharwal/spark-stream...
09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min
zhlédnutí 1,2KPřed 5 měsíci
09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min
08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory
zhlédnutí 1,3KPřed 5 měsíci
08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory
07 Spark Streaming Read from Files | Flatten JSON data
zhlédnutí 1,5KPřed 5 měsíci
07 Spark Streaming Read from Files | Flatten JSON data
06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data
zhlédnutí 972Před 5 měsíci
06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data
05 Spark Streaming Output Modes, Optimization and Background
zhlédnutí 1,4KPřed 5 měsíci
05 Spark Streaming Output Modes, Optimization and Background
04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code
zhlédnutí 1,9KPřed 5 měsíci
04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code
03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka
zhlédnutí 3,1KPřed 5 měsíci
03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka
02 How Spark Streaming Works
zhlédnutí 2,1KPřed 5 měsíci
02 How Spark Streaming Works
01 Spark Streaming with PySpark - Agenda
zhlédnutí 3,7KPřed 6 měsíci
01 Spark Streaming with PySpark - Agenda
26 Spark SQL, Hints, Spark Catalog and Metastore
zhlédnutí 1,4KPřed 6 měsíci
26 Spark SQL, Hints, Spark Catalog and Metastore
25 AQE aka Adaptive Query Execution in Spark
zhlédnutí 2KPřed 6 měsíci
25 AQE aka Adaptive Query Execution in Spark
24 Fix Skewness and Spillage with Salting in Spark
zhlédnutí 3KPřed 6 měsíci
24 Fix Skewness and Spillage with Salting in Spark
23 Static vs Dynamic Resource Allocation in Spark
zhlédnutí 1,3KPřed 6 měsíci
23 Static vs Dynamic Resource Allocation in Spark
22 Optimize Joins in Spark & Understand Bucketing for Faster joins
zhlédnutí 3,8KPřed 6 měsíci
22 Optimize Joins in Spark & Understand Bucketing for Faster joins
21 Broadcast Variable and Accumulators in Spark
zhlédnutí 1,4KPřed 7 měsíci
21 Broadcast Variable and Accumulators in Spark
20 Data Caching in Spark
zhlédnutí 1,5KPřed 7 měsíci
20 Data Caching in Spark

Komentáře

  • @user-dj4ht7rg2f
    @user-dj4ht7rg2f Před dnem

    It would have been Great if each step in cell was explained instead of whole cell. Thanks anyways

  • @MuzicForSoul
    @MuzicForSoul Před 3 dny

    EaseWithData content is amazing. Thank you. Please help on this one. has anybody able to execute this successfully, I am getting java.lang.ClassNotFoundException: org.postgresql.Driver exception, few questions, 1) Will the jar download automatically in the path when spark code cell is run? is the jar version very old which is not found in it's site or is this version of postgres db still relevant? 2) Do we need to establish the network bridge manually one time so that two containers spark and postgres can talk 3) if the bridge network needs to be created then is it done in adhoc inline script before running the container in cmd screen or it needs to be done in the docker compose file? 4)I am not able to go further until I clear this section, since most of the stuff after depends on postgres db, please help. Thanks

  • @Bijuthtt
    @Bijuthtt Před 4 dny

    Hi Subham, @Ease With Data- I followed the installation procedures in other playlist. But seems it is not working as expected. Can you help me to fix it? I can share more details. Basic chapters are fine with any of the code repository. but to run code with below line "18Optimize join" is not working. even jobs UI with 4040 port also not working spark = ( SparkSession .builder .appName("Optimizing Joins") .master("spark://f6d8b23a8515:7077") .config("spark.cores.max", 16) .config("spark.executor.cores", 4) .config("spark.executor.memory", "512M") .getOrCreate() )

  • @rakeshpanigrahi577
    @rakeshpanigrahi577 Před 5 dny

    Nice :)

  • @karthikinu
    @karthikinu Před 5 dny

    Nice explanation 👍🏼

  • @user-dj4ht7rg2f
    @user-dj4ht7rg2f Před 7 dny

    Love your content :) I have one small question.. At 4:10 Spill memory is of 137MB and Spill Disk is of 77.2MB. If 137MB is spilled from memory why only 77.2MB is written in disk? Shouldn't it be 137MB? Can you please clarify this?

    • @easewithdata
      @easewithdata Před 7 dny

      Data written on disk are serialized and the data in memory is in deserialized format. Thus the amount will be less on disk. This is majir tradeoff when you are reading data from disks. Please make sure to share with your network if you love this content ❤️

    • @user-dj4ht7rg2f
      @user-dj4ht7rg2f Před 5 dny

      @@easewithdata Thanks for the quick response!! Sure, will recommend my mates.

  • @gyanaranjannayak3333

    I am using your pyspark-jupyter-lab docker file, But when creating spark session ,I am getting java runtime error. Java gateway process exited

    • @easewithdata
      @easewithdata Před 7 dny

      Please use the pyspark notebook using the below docker command docker pull jupyter/pyspark-notebook

  • @mihirkudale8512
    @mihirkudale8512 Před 7 dny

    how to optimize naukri if you are doing career transition into data field Civil Experience: 3.5 Education: MCA I am a bit confused with resume summary also. as recruiters only consider relevant experience, it litrally feels like have i wasted my career or 3.5 yrs. I am stuck can you please guide?

    • @easewithdata
      @easewithdata Před 7 dny

      Unfortunately it becomes very difficult to hide 3.5 years of exp, you need prepare youself for data engineering and apply for roles where they are looking for position will less or no relevant exp in data engineering

  • @mohdsaeedafri3314
    @mohdsaeedafri3314 Před 10 dny

    Nice explanation thanks for the knowledge sharing 🙂 please continue this playlist it is very helpful.

    • @easewithdata
      @easewithdata Před 7 dny

      Thank you so much ❤️ Please make sure to share with your network 🛜

  • @atonxment2868
    @atonxment2868 Před 11 dny

    Got the "scram authentication is not supported by this driver" error while trying to connect to postgres. This is driving me nuts.

    • @easewithdata
      @easewithdata Před 11 dny

      Please make sure to use the correct driver version for the postgres you are using

    • @atonxment2868
      @atonxment2868 Před 10 dny

      ​@@easewithdata I solved this by setting up the Postgres and the Jupyter all with the same compose file. Before I was using a docker network to connect the two, didn't work no matter what. Everything breaks after I removed the network group so I tried setting it up again.

  • @vipulsarode2722
    @vipulsarode2722 Před 11 dny

    Hello, what if I wanted to do this in VSCode instead of Jupyter Notebook with docker as shown in the video?

    • @easewithdata
      @easewithdata Před 11 dny

      You can write the complete code in scripts using vs code and then trigger them using spark submit command

  • @IswaryaPydimarri
    @IswaryaPydimarri Před 11 dny

    What is invited to apply in naukri?and how to reply?

    • @easewithdata
      @easewithdata Před 7 dny

      Invitations are to apply for, click on the invitation and fill up the applications shared.

  • @Kevin-nt4eb
    @Kevin-nt4eb Před 12 dny

    so in deployement mode the driver program is submitted inside a executer which is present inside a cluster. am I rignt?

    • @easewithdata
      @easewithdata Před 7 dny

      The spark submit command on the driver not on executors

  • @anveshkonda8334
    @anveshkonda8334 Před 12 dny

    Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo

    • @easewithdata
      @easewithdata Před 7 dny

      Some data files are too big to be uploaded in github. Most of the data is uploaded at - github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets

  • @SharadSonwane-xk1ht
    @SharadSonwane-xk1ht Před 12 dny

    Great 👍

    • @easewithdata
      @easewithdata Před 12 dny

      Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

  • @mohammadaftab7002
    @mohammadaftab7002 Před 14 dny

    thanks for this valuable insight, expecting the same video for apache iceberg and hudi in future

    • @easewithdata
      @easewithdata Před 12 dny

      Sure and Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

  • @SonuKumar-fn1gn
    @SonuKumar-fn1gn Před 15 dny

    Very nice video

    • @easewithdata
      @easewithdata Před 12 dny

      Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

  • @irannamented9296
    @irannamented9296 Před 17 dny

    need to understand one thing why yyyy and dd not in capital letter is there any reason for that

    • @easewithdata
      @easewithdata Před 16 dny

      Spark follows the following datetime pattern format (mostly resembles to Unix formats) spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

  • @ruantenorio4442
    @ruantenorio4442 Před 18 dny

    Thanks for your lessons! You covered all the gaps between the main concepts.

    • @easewithdata
      @easewithdata Před 16 dny

      Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️

  • @RxLocum
    @RxLocum Před 18 dny

    Thanks for such a detailed work. You're a Hero.

    • @easewithdata
      @easewithdata Před 16 dny

      Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️

  • @sumitrawall
    @sumitrawall Před 18 dny

    Sir can u make a video to setup spark and hadoop with docker

    • @easewithdata
      @easewithdata Před 16 dny

      For Spark setup Hadoop is not mandatory. You can have Spark standalone setup using docker. But if you still want the same, you can clone and follow the steps from below Github repo github.com/Marcel-Jan/docker-hadoop-spark

  • @sumitrawall
    @sumitrawall Před 18 dny

    Can a fresher become data engineer?

    • @easewithdata
      @easewithdata Před 16 dny

      Yes absolutely, please start with SQL, Atleast one programming language and Spark

  • @Aman-lv2ee
    @Aman-lv2ee Před 18 dny

    please make a video on creating resume for senior data engineers and please share the template thanks

  • @Learn2Share786
    @Learn2Share786 Před 20 dny

    Thanks, pls share the senior data engineer resume template.. will help

  • @shivakant4698
    @shivakant4698 Před 20 dny

    spark's standalone cluster is where on docker or any where please tell me my cluster execution codes are not running why?

    • @easewithdata
      @easewithdata Před 19 dny

      Standalone cluster used in this tutorial is on docker. You can set it up yourself. For notebook - hub.docker.com/r/jupyter/pyspark-notebook You can use the below docker file to setup cluster github.com/subhamkharwal/docker-images/tree/master/spark-cluster-new

  • @rakeshpanigrahi577
    @rakeshpanigrahi577 Před 20 dny

    Thanks bro :)

    • @easewithdata
      @easewithdata Před 19 dny

      Thanks, Please make sure to share with your network over LinkedIn ❤️

  • @user-br6oe3kf9k
    @user-br6oe3kf9k Před 23 dny

    how to contact you sir

    • @easewithdata
      @easewithdata Před 21 dnem

      You can connect with me over topmate topmate.io/subham_khandelwal/

    • @user-br6oe3kf9k
      @user-br6oe3kf9k Před 18 dny

      @@easewithdata Hi sir will you be available today please since I dont have time till weekend

  • @akshaykadam1260
    @akshaykadam1260 Před 23 dny

    great work

    • @easewithdata
      @easewithdata Před 21 dnem

      Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍

  • @ComedyXRoad
    @ComedyXRoad Před 23 dny

    thank you for your efforts

    • @easewithdata
      @easewithdata Před 21 dnem

      Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍

  • @ComedyXRoad
    @ComedyXRoad Před 24 dny

    thanks for your efforts it helps lot

    • @easewithdata
      @easewithdata Před 24 dny

      Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜

  • @sushantashow000
    @sushantashow000 Před 24 dny

    can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.

    • @easewithdata
      @easewithdata Před 24 dny

      Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜

  • @ComedyXRoad
    @ComedyXRoad Před 26 dny

    thank you in real time do we use cluster node or cline mode which you are using now?

  • @shivakant4698
    @shivakant4698 Před 27 dny

    localhost:4040 is not working when I done ".master("spark://e75727ddf432:7077")" how can be solved?

  • @Amarjeet-fb3lk
    @Amarjeet-fb3lk Před 28 dny

    Why you made 32 shuffle partition if you have 8core, If one partition is going to process on single core, from where it will get other remaining 24 cores?

    • @easewithdata
      @easewithdata Před 24 dny

      The 8 cores will process all the 32 partitions in 4 iterations each. (8X4 = 32)

  • @shivakant4698
    @shivakant4698 Před 29 dny

    when I am refreshing my spark ui is giving error how can be solved giving this "spark://6b16b66805db:7077" and on "localhost:4040" also not working I give this".master("spark://6b16b66805db:7077")" how can be solved please.

  • @SanthoshKumar-sl7zc
    @SanthoshKumar-sl7zc Před měsícem

    Thanks for the Explanation, Very useful

    • @easewithdata
      @easewithdata Před 24 dny

      Glad it was helpful! Please make sure to share with your network over LinkedIn ❤️

  • @vaibhavkumar38
    @vaibhavkumar38 Před měsícem

    From the vidoe : Select, where, group by etc ate transformations. We have narrow transformation and wide transformation. Wide transformation are those when data has to move or interact with data of other partitions in next stages

  • @vaibhavkumar38
    @vaibhavkumar38 Před měsícem

    Again from video itself: executors are jvm processes, 1 core can do 1 task at a time, above pic we have 6 cores, so 6 tasks were possible

  • @vaibhavkumar38
    @vaibhavkumar38 Před měsícem

    Shuffle is the boundary which divides job into stages

  • @vaibhavkumar38
    @vaibhavkumar38 Před měsícem

    Great explanation.. liked the illustration that 2 counts happened and the fact that after local count and before global count, some shuffling happened

    • @easewithdata
      @easewithdata Před měsícem

      Thanks 👍 Please make sure to share with your Network over LinkedIn ❤️

  • @Bijuthtt
    @Bijuthtt Před měsícem

    Awesome. Super explanation . I love it

    • @easewithdata
      @easewithdata Před měsícem

      Thanks, Please share with your Network over LinkedIn ❤️

  • @sovikguhabiswas8838
    @sovikguhabiswas8838 Před měsícem

    why are we not using withCloumn instead of expr?

    • @easewithdata
      @easewithdata Před měsícem

      Just to show all possible options. withColumn is also used in later videos.

  • @anveshkonda8334
    @anveshkonda8334 Před měsícem

    You are awesome bro.. Thanks a lot

    • @easewithdata
      @easewithdata Před měsícem

      Glad to hear that ☺️ Please make sure to share with your network over LinkedIn ❤️

  • @deepanshuaggarwal7042
    @deepanshuaggarwal7042 Před měsícem

    What is the use case of slide duration in streaming app... any real world example ?

    • @easewithdata
      @easewithdata Před měsícem

      Its basically used for cumulative aggregations

  • @Bijuthtt
    @Bijuthtt Před měsícem

    You are awesome man. I was trying to setup spark and kafka in docker for long time. done today. thank you very much

    • @easewithdata
      @easewithdata Před měsícem

      Glad I could help. Please make to share with your network over LinkedIn ❤️

  • @deepanshuaggarwal7042
    @deepanshuaggarwal7042 Před měsícem

    Hi, I have a doubt: How can we check if a stream has multiple sink from spark UI?

    • @easewithdata
      @easewithdata Před měsícem

      Allow me sometime to search the exact screenshot for you.

  • @DataEngineerPratik
    @DataEngineerPratik Před měsícem

    what if both the tables are very small like one is 5 MB and other is 9 MB then which df is broadcasted across executor?

    • @easewithdata
      @easewithdata Před měsícem

      In that case it doesn't matter, however AQE always prefer to broadcast the smaller table.

    • @DataEngineerPratik
      @DataEngineerPratik Před měsícem

      @@easewithdata Thanks & I'm following you for more than a month its been a great learning experience , we want you to make End to End Project in Pyspark

  • @nishantsoni9330
    @nishantsoni9330 Před měsícem

    one of the best explanation in depth, Thanks :) Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.

    • @easewithdata
      @easewithdata Před měsícem

      Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜

  • @hamedtamadon6520
    @hamedtamadon6520 Před měsícem

    Hello, and thanks for the sharing of these useful videos. How to handle the writing in delta tables: Because the best practice is that the size of each parquet file should be between 128 MB to 1 GB. How to handle this situation while each batch has very less than the size that is mentioned? or how to handle to collect the number of batches and to reach the mentioned size and finally to write in deltalake.

    • @easewithdata
      @easewithdata Před měsícem

      Usually microbatch execution in Spark can write multiple small files. This requires a later stage to read all those files and write a compacted file (say for each day) of bigger size to avoid small file issue. You can use this compacted file to read data in your downstream systems.