87
135 154

29 Optimize Data Scanning with Partitioning in Spark | How Partitioning data works | Optimize Jobs

13:42

Get 100% more Interview calls from Naukri Portal | Boost your Naukri Profile |Optimize Naukri Search

9:32

28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel

34:15

27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

21:17

17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

21:17

01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System

13:37

30 Data Skipping and Z-Ordering in Delta Lake Tables | Optimize & Data Compaction Delta Lake Tables

Video explains - What is the impact of data skipping on jobs? How z-ordering in delta lake works ? How to optimize delta lake tables?
Chapters
00:00 - Introduction
00:31 - What is Data Skipping and Z-Ordering in Delta Lake?
03:34 - Z-Ordering for more than 1 column/Multidimensional Z-ORDER
04:38 - Delta Lake Table Optimization with Example
11:59 - Multi Column Z-Ordering in Delta Lake Table
14:43 - Impact of Partitioning with Z-Ordering
16:24 - Selective Z-Ordering with Partition filters
17:57 - Auto Compaction in Delta Lake Table
For Local PySpark Jupyter Lab setup just run the command - docker pull jupyter/pyspark-notebook
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pyspark-zero-to-hero/blob/master/25_delta_lake_optimization_and_z_ordering.ipynb
Delta Lake Optimization Documentation - docs.delta.io/latest/optimizations-oss.html#language-sql
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering

zhlédnutí: 296

Video

29 Optimize Data Scanning with Partitioning in Spark | How Partitioning data works | Optimize Jobs

13:42

29 Optimize Data Scanning with Partitioning in Spark | How Partitioning data works | Optimize Jobs

zhlédnutí 462Před dnem

Video explains - What is the impact of data scanning on jobs? How partitioning works ? How to avoid un-necessary data scanning? How to optimize jobs using Partitioning? Chapters 00:00 - Introduction 00:35 - Why avoiding un-necessary Data Scanning in Important? 02:12 - Impacat of Data Partitioning 03:07 - Imapct of partitioning column missing from Query 03:33 - Impact of parititioning on High Ca...

9:32

Get 100% more Interview calls from Naukri Portal | Boost your Naukri Profile |Optimize Naukri Search

zhlédnutí 1,1KPřed 21 dnem

Video explains - How to Optimize Naukri Profile to get more calls or opportunities? How to boost Naukri Profile? What are the relevant changes to get more Interview calls ? How to hack Naukri Portal to get more calls ? Naukri Job Search help ? Chapters 00:00 - Introduction 01:19 - Naukri Profile Changes 02:52 - Changes for Resume 04:32 - Keywords Optimization 05:45 - Serving Notice Period 07:15...

28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel

34:15

28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel

zhlédnutí 788Před 2 měsíci

Video explains - What is Delta Lake and why is it important? What are the Key Features of Delta Lake? How Delta table manages versions? How Delta table manages Time Travel? What is Schema Evolution in Delta Lake? Chapters 00:00 - Introduction 00:22 - Key Features of Delta Lake 01:27 - Sign Up and Login into Databricks Community Edition for Free 03:22 - Get Started with Databricks basics 10:34 -...

27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

21:17

27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

zhlédnutí 414Před 2 měsíci

Video explains - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and ...

17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

21:17

17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

zhlédnutí 464Před 3 měsíci

Video covers - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and th...

01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System

13:37

01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System

zhlédnutí 301Před 3 měsíci

Understand - What is Distributed Computing and How it works? What is Big Data? What is Hadoop and what are its important component? What is Horizontal and Vertical Scaling ? Chapters: 01:10 - History and Why Big Data? 03:00 - What is Big Data? 04:49 - What is Hadoop ? 06:36 - What is Distributed Computing and How it Works? 06:46 - Horizontal vs Vertical scaling 11:11 - Components of Hadoop Lang...

16 Late Data Processing | Watermarks | Tumbling and Sliding Window Operations in Spark Streaming

10:08

16 Late Data Processing | Watermarks | Tumbling and Sliding Window Operations in Spark Streaming

zhlédnutí 690Před 3 měsíci

Video covers - What are Watermarks in Spark? What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Difference between Update and Complete modes ? Chapters: 00:00 - Introduction 01:15 - Fixed Window Code Implementation 03:28 - Fixed Window with Watermark 07:39 - Late Events with Wate...

15 Tumbling, Sliding and Session Window Operations in Spark Streaming | Grouped Window Aggregations

7:32

15 Tumbling, Sliding and Session Window Operations in Spark Streaming | Grouped Window Aggregations

zhlédnutí 748Před 3 měsíci

Video covers - What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Chapters: 00:00 - Introduction 00:40 - Tumbling or Fixed Window 02:42 - Sliding or Overlapping Window 04:23 - Late data scenario and Importance of Watermark 06:37 - Session Window URLs: Github Code - github.com/sub...

14 Spark Streaming Event vs Processing Time | Late Arrival of Data | Stateful Processing |Watermarks

4:35

14 Spark Streaming Event vs Processing Time | Late Arrival of Data | Stateful Processing |Watermarks

zhlédnutí 619Před 4 měsíci

Video covers - How to handle Late Arrival of Data ? What is the difference between Event and Processing Time? How Spark handles Stateful Processing ? What are Watermarks in Spark Streaming ? Chapters: 00:00 - Introduction 00:29 - Event Time vs Processing Time 02:34 - Stateful Processing 03:38 - How Spark handles Late Data ? URLs: Github Code - github.com/subhamkharwal/spark-streaming-with-pyspa...

13 Spark Streaming Handling Errors and Exceptions | Handle Exception for data re-processing in Spark

16:53

13 Spark Streaming Handling Errors and Exceptions | Handle Exception for data re-processing in Spark

zhlédnutí 830Před 4 měsíci

Video covers - How to handle error in Spark Streaming? How to handle Exception in Spark Streaming Application? How to store error data for re-processing in Spark Streaming? How to write data to JDBC Postgres table? Chapters: 00:00 - Introduction 00:41 - Error vs Exception in Spark Streaming 04:17 - Handling Error/Malformed data in Spark Streaming 10:02 - Handling Exception in Spark Streaming UR...

12 Spark Streaming Writing data to Multiple Sinks | foreachBatch | Writing data to JDBC(Postgres)

11:19

12 Spark Streaming Writing data to Multiple Sinks | foreachBatch | Writing data to JDBC(Postgres)

zhlédnutí 1,3KPřed 4 měsíci

Video covers - How to write data to multiple sinks in Spark Streaming? What is the issue with multiple writeStream command? How to use foreachBatch in Spark Streaming? How to write data to Postgres/JDBC ? Chapters: 00:00 - Introduction 00:46 - Issues with Using Multiple WriteStream 01:37 - foreachBatch command 04:03 - Code Implementation 04:51 - Writing data to Multiple Sink using foreachBatch ...

11 Spark Streaming Triggers - Once, Processing Time & Continuous | Tune Kafka Streaming Performance

10:31

11 Spark Streaming Triggers - Once, Processing Time & Continuous | Tune Kafka Streaming Performance

zhlédnutí 1KPřed 4 měsíci

Video covers - What are different triggers available for Spark Streaming? How Processing Time and Once trigger are different? How we can tune Kafka jobs with Partitions? Chapters: 00:00 - Introduction 00:58 - Automating Device data for Kafka 02:37 - Trigger mode Once/AvailableNow 04:36 - Trigger mode ProcessingTime 05:49 - Tune Kafka Streaming Job 08:23 - Trigger mode Continuous URLs: Github Co...

10 Spark Streaming Read from Kafka | Real time streaming from Kafka

10:41

10 Spark Streaming Read from Kafka | Real time streaming from Kafka

zhlédnutí 2,1KPřed 5 měsíci

Video covers - How to read streaming data from Kafka? How to read real time data from Kafka? How to use Kafka as a Source for Real time Spark Streaming? Chapters: 00:00 - Introduction 00:34 - Example Device JSON Payload 01:09 - Import Kafka JAR Libraries 03:08 - Read from Kafka Source 06:27 - Extract JSON data from column using from_json URLs: Github Code - github.com/subhamkharwal/spark-stream...

09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min

14:37

09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min

zhlédnutí 1,2KPřed 5 měsíci

09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min

08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory

8:44

08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory

zhlédnutí 1,3KPřed 5 měsíci

08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory

07 Spark Streaming Read from Files | Flatten JSON data

14:26

07 Spark Streaming Read from Files | Flatten JSON data

zhlédnutí 1,5KPřed 5 měsíci

07 Spark Streaming Read from Files | Flatten JSON data

06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data

3:06

06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data

zhlédnutí 972Před 5 měsíci

06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data

05 Spark Streaming Output Modes, Optimization and Background

12:32

05 Spark Streaming Output Modes, Optimization and Background

zhlédnutí 1,4KPřed 5 měsíci

05 Spark Streaming Output Modes, Optimization and Background

04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code

13:15

04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code

zhlédnutí 1,9KPřed 5 měsíci

04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code

03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka

7:07

03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka

zhlédnutí 3,1KPřed 5 měsíci

03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka

6:46

02 How Spark Streaming Works

zhlédnutí 2,1KPřed 5 měsíci

02 How Spark Streaming Works

01 Spark Streaming with PySpark - Agenda

1:32

01 Spark Streaming with PySpark - Agenda

zhlédnutí 3,7KPřed 6 měsíci

01 Spark Streaming with PySpark - Agenda

26 Spark SQL, Hints, Spark Catalog and Metastore

19:20

26 Spark SQL, Hints, Spark Catalog and Metastore

zhlédnutí 1,4KPřed 6 měsíci

26 Spark SQL, Hints, Spark Catalog and Metastore

25 AQE aka Adaptive Query Execution in Spark

11:52

25 AQE aka Adaptive Query Execution in Spark

zhlédnutí 2KPřed 6 měsíci

25 AQE aka Adaptive Query Execution in Spark

24 Fix Skewness and Spillage with Salting in Spark

21:17

24 Fix Skewness and Spillage with Salting in Spark

zhlédnutí 3KPřed 6 měsíci

24 Fix Skewness and Spillage with Salting in Spark

23 Static vs Dynamic Resource Allocation in Spark

10:30

23 Static vs Dynamic Resource Allocation in Spark

zhlédnutí 1,3KPřed 6 měsíci

23 Static vs Dynamic Resource Allocation in Spark

22 Optimize Joins in Spark & Understand Bucketing for Faster joins

28:17

22 Optimize Joins in Spark & Understand Bucketing for Faster joins

zhlédnutí 3,8KPřed 6 měsíci

22 Optimize Joins in Spark & Understand Bucketing for Faster joins

21 Broadcast Variable and Accumulators in Spark

12:35

21 Broadcast Variable and Accumulators in Spark

zhlédnutí 1,4KPřed 7 měsíci

21 Broadcast Variable and Accumulators in Spark

13:19

20 Data Caching in Spark

zhlédnutí 1,5KPřed 7 měsíci

20 Data Caching in Spark

Komentáře

@user-dj4ht7rg2f Před dnem
It would have been Great if each step in cell was explained instead of whole cell. Thanks anyways
@MuzicForSoul Před 3 dny
EaseWithData content is amazing. Thank you. Please help on this one. has anybody able to execute this successfully, I am getting java.lang.ClassNotFoundException: org.postgresql.Driver exception, few questions, 1) Will the jar download automatically in the path when spark code cell is run? is the jar version very old which is not found in it's site or is this version of postgres db still relevant? 2) Do we need to establish the network bridge manually one time so that two containers spark and postgres can talk 3) if the bridge network needs to be created then is it done in adhoc inline script before running the container in cmd screen or it needs to be done in the docker compose file? 4)I am not able to go further until I clear this section, since most of the stuff after depends on postgres db, please help. Thanks
@Bijuthtt Před 4 dny
Hi Subham, @Ease With Data- I followed the installation procedures in other playlist. But seems it is not working as expected. Can you help me to fix it? I can share more details. Basic chapters are fine with any of the code repository. but to run code with below line "18Optimize join" is not working. even jobs UI with 4040 port also not working spark = ( SparkSession .builder .appName("Optimizing Joins") .master("spark://f6d8b23a8515:7077") .config("spark.cores.max", 16) .config("spark.executor.cores", 4) .config("spark.executor.memory", "512M") .getOrCreate() )
@rakeshpanigrahi577 Před 5 dny
Nice :)
@karthikinu Před 5 dny
Nice explanation 👍🏼
@user-dj4ht7rg2f Před 7 dny
Love your content :) I have one small question.. At 4:10 Spill memory is of 137MB and Spill Disk is of 77.2MB. If 137MB is spilled from memory why only 77.2MB is written in disk? Shouldn't it be 137MB? Can you please clarify this?
@easewithdata Před 7 dny
Data written on disk are serialized and the data in memory is in deserialized format. Thus the amount will be less on disk. This is majir tradeoff when you are reading data from disks. Please make sure to share with your network if you love this content ❤️
@user-dj4ht7rg2f Před 5 dny
@@easewithdata Thanks for the quick response!! Sure, will recommend my mates.
@gyanaranjannayak3333 Před 7 dny
I am using your pyspark-jupyter-lab docker file, But when creating spark session ,I am getting java runtime error. Java gateway process exited
@easewithdata Před 7 dny
Please use the pyspark notebook using the below docker command docker pull jupyter/pyspark-notebook
@mihirkudale8512 Před 7 dny
how to optimize naukri if you are doing career transition into data field Civil Experience: 3.5 Education: MCA I am a bit confused with resume summary also. as recruiters only consider relevant experience, it litrally feels like have i wasted my career or 3.5 yrs. I am stuck can you please guide?
@easewithdata Před 7 dny
Unfortunately it becomes very difficult to hide 3.5 years of exp, you need prepare youself for data engineering and apply for roles where they are looking for position will less or no relevant exp in data engineering
@mohdsaeedafri3314 Před 10 dny
Nice explanation thanks for the knowledge sharing 🙂 please continue this playlist it is very helpful.
@easewithdata Před 7 dny
Thank you so much ❤️ Please make sure to share with your network 🛜
@atonxment2868 Před 11 dny
Got the "scram authentication is not supported by this driver" error while trying to connect to postgres. This is driving me nuts.
@easewithdata Před 11 dny
Please make sure to use the correct driver version for the postgres you are using
@atonxment2868 Před 10 dny
@@easewithdata I solved this by setting up the Postgres and the Jupyter all with the same compose file. Before I was using a docker network to connect the two, didn't work no matter what. Everything breaks after I removed the network group so I tried setting it up again.
@vipulsarode2722 Před 11 dny
Hello, what if I wanted to do this in VSCode instead of Jupyter Notebook with docker as shown in the video?
@easewithdata Před 11 dny
You can write the complete code in scripts using vs code and then trigger them using spark submit command
@IswaryaPydimarri Před 11 dny
What is invited to apply in naukri?and how to reply?
@easewithdata Před 7 dny
Invitations are to apply for, click on the invitation and fill up the applications shared.
@Kevin-nt4eb Před 12 dny
so in deployement mode the driver program is submitted inside a executer which is present inside a cluster. am I rignt?
@easewithdata Před 7 dny
The spark submit command on the driver not on executors
@anveshkonda8334 Před 12 dny
Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo
@easewithdata Před 7 dny
Some data files are too big to be uploaded in github. Most of the data is uploaded at - github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets
@SharadSonwane-xk1ht Před 12 dny
Great 👍
@easewithdata Před 12 dny
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@mohammadaftab7002 Před 14 dny
thanks for this valuable insight, expecting the same video for apache iceberg and hudi in future
@easewithdata Před 12 dny
Sure and Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@SonuKumar-fn1gn Před 15 dny
Very nice video
@easewithdata Před 12 dny
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@irannamented9296 Před 17 dny
need to understand one thing why yyyy and dd not in capital letter is there any reason for that
@easewithdata Před 16 dny
Spark follows the following datetime pattern format (mostly resembles to Unix formats) spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
@ruantenorio4442 Před 18 dny
Thanks for your lessons! You covered all the gaps between the main concepts.
@easewithdata Před 16 dny
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
@RxLocum Před 18 dny
Thanks for such a detailed work. You're a Hero.
@easewithdata Před 16 dny
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
@sumitrawall Před 18 dny
Sir can u make a video to setup spark and hadoop with docker
@easewithdata Před 16 dny
For Spark setup Hadoop is not mandatory. You can have Spark standalone setup using docker. But if you still want the same, you can clone and follow the steps from below Github repo github.com/Marcel-Jan/docker-hadoop-spark
@sumitrawall Před 18 dny
Can a fresher become data engineer?
@easewithdata Před 16 dny
Yes absolutely, please start with SQL, Atleast one programming language and Spark
@Aman-lv2ee Před 18 dny
please make a video on creating resume for senior data engineers and please share the template thanks
@easewithdata Před 16 dny
Will definitely try. Thanks.
@Learn2Share786 Před 20 dny
Thanks, pls share the senior data engineer resume template.. will help
@easewithdata Před 19 dny
Sure will try to share the same.
@shivakant4698 Před 20 dny
spark's standalone cluster is where on docker or any where please tell me my cluster execution codes are not running why?
@easewithdata Před 19 dny
Standalone cluster used in this tutorial is on docker. You can set it up yourself. For notebook - hub.docker.com/r/jupyter/pyspark-notebook You can use the below docker file to setup cluster github.com/subhamkharwal/docker-images/tree/master/spark-cluster-new
@rakeshpanigrahi577 Před 20 dny
Thanks bro :)
@easewithdata Před 19 dny
Thanks, Please make sure to share with your network over LinkedIn ❤️
@user-br6oe3kf9k Před 23 dny
how to contact you sir
@easewithdata Před 21 dnem
You can connect with me over topmate topmate.io/subham_khandelwal/
@user-br6oe3kf9k Před 18 dny
@@easewithdata Hi sir will you be available today please since I dont have time till weekend
@akshaykadam1260 Před 23 dny
great work
@easewithdata Před 21 dnem
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
@ComedyXRoad Před 23 dny
thank you for your efforts
@easewithdata Před 21 dnem
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
@ComedyXRoad Před 24 dny
thanks for your efforts it helps lot
@easewithdata Před 24 dny
Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜
@sushantashow000 Před 24 dny
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
@easewithdata Před 24 dny
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
@ComedyXRoad Před 26 dny
thank you in real time do we use cluster node or cline mode which you are using now?
@easewithdata Před 24 dny
I am using the client mode
@shivakant4698 Před 27 dny
localhost:4040 is not working when I done ".master("spark://e75727ddf432:7077")" how can be solved?
@Amarjeet-fb3lk Před 28 dny
Why you made 32 shuffle partition if you have 8core, If one partition is going to process on single core, from where it will get other remaining 24 cores?
@easewithdata Před 24 dny
The 8 cores will process all the 32 partitions in 4 iterations each. (8X4 = 32)
@shivakant4698 Před 29 dny
when I am refreshing my spark ui is giving error how can be solved giving this "spark://6b16b66805db:7077" and on "localhost:4040" also not working I give this".master("spark://6b16b66805db:7077")" how can be solved please.
@SanthoshKumar-sl7zc Před měsícem
Thanks for the Explanation, Very useful
@easewithdata Před 24 dny
Glad it was helpful! Please make sure to share with your network over LinkedIn ❤️
@vaibhavkumar38 Před měsícem
From the vidoe : Select, where, group by etc ate transformations. We have narrow transformation and wide transformation. Wide transformation are those when data has to move or interact with data of other partitions in next stages
@vaibhavkumar38 Před měsícem
Again from video itself: executors are jvm processes, 1 core can do 1 task at a time, above pic we have 6 cores, so 6 tasks were possible
@vaibhavkumar38 Před měsícem
Shuffle is the boundary which divides job into stages
@vaibhavkumar38 Před měsícem
Great explanation.. liked the illustration that 2 counts happened and the fact that after local count and before global count, some shuffling happened
@easewithdata Před měsícem
Thanks 👍 Please make sure to share with your Network over LinkedIn ❤️
@Bijuthtt Před měsícem
Awesome. Super explanation . I love it
@easewithdata Před měsícem
Thanks, Please share with your Network over LinkedIn ❤️
@sovikguhabiswas8838 Před měsícem
why are we not using withCloumn instead of expr?
@easewithdata Před měsícem
Just to show all possible options. withColumn is also used in later videos.
@anveshkonda8334 Před měsícem
You are awesome bro.. Thanks a lot
@easewithdata Před měsícem
Glad to hear that ☺️ Please make sure to share with your network over LinkedIn ❤️
@deepanshuaggarwal7042 Před měsícem
What is the use case of slide duration in streaming app... any real world example ?
@easewithdata Před měsícem
Its basically used for cumulative aggregations
@Bijuthtt Před měsícem
You are awesome man. I was trying to setup spark and kafka in docker for long time. done today. thank you very much
@easewithdata Před měsícem
Glad I could help. Please make to share with your network over LinkedIn ❤️
@deepanshuaggarwal7042 Před měsícem
Hi, I have a doubt: How can we check if a stream has multiple sink from spark UI?
@easewithdata Před měsícem
Allow me sometime to search the exact screenshot for you.
@DataEngineerPratik Před měsícem
what if both the tables are very small like one is 5 MB and other is 9 MB then which df is broadcasted across executor?
@easewithdata Před měsícem
In that case it doesn't matter, however AQE always prefer to broadcast the smaller table.
@DataEngineerPratik Před měsícem
@@easewithdata Thanks & I'm following you for more than a month its been a great learning experience , we want you to make End to End Project in Pyspark
@nishantsoni9330 Před měsícem
one of the best explanation in depth, Thanks :) Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.
@easewithdata Před měsícem
Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜
@hamedtamadon6520 Před měsícem
Hello, and thanks for the sharing of these useful videos. How to handle the writing in delta tables: Because the best practice is that the size of each parquet file should be between 128 MB to 1 GB. How to handle this situation while each batch has very less than the size that is mentioned? or how to handle to collect the number of batches and to reach the mentioned size and finally to write in deltalake.
@easewithdata Před měsícem
Usually microbatch execution in Spark can write multiple small files. This requires a later stage to read all those files and write a compacted file (say for each day) of bigger size to avoid small file issue. You can use this compacted file to read data in your downstream systems.

Ease With Data

Komentáře