![Ease With Data](/img/default-banner.jpg)
- 87
- 135 154
Ease With Data
India
Registrace 13. 01. 2023
30 Data Skipping and Z-Ordering in Delta Lake Tables | Optimize & Data Compaction Delta Lake Tables
Video explains - What is the impact of data skipping on jobs? How z-ordering in delta lake works ? How to optimize delta lake tables?
Chapters
00:00 - Introduction
00:31 - What is Data Skipping and Z-Ordering in Delta Lake?
03:34 - Z-Ordering for more than 1 column/Multidimensional Z-ORDER
04:38 - Delta Lake Table Optimization with Example
11:59 - Multi Column Z-Ordering in Delta Lake Table
14:43 - Impact of Partitioning with Z-Ordering
16:24 - Selective Z-Ordering with Partition filters
17:57 - Auto Compaction in Delta Lake Table
For Local PySpark Jupyter Lab setup just run the command - docker pull jupyter/pyspark-notebook
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pyspark-zero-to-hero/blob/master/25_delta_lake_optimization_and_z_ordering.ipynb
Delta Lake Optimization Documentation - docs.delta.io/latest/optimizations-oss.html#language-sql
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering
Chapters
00:00 - Introduction
00:31 - What is Data Skipping and Z-Ordering in Delta Lake?
03:34 - Z-Ordering for more than 1 column/Multidimensional Z-ORDER
04:38 - Delta Lake Table Optimization with Example
11:59 - Multi Column Z-Ordering in Delta Lake Table
14:43 - Impact of Partitioning with Z-Ordering
16:24 - Selective Z-Ordering with Partition filters
17:57 - Auto Compaction in Delta Lake Table
For Local PySpark Jupyter Lab setup just run the command - docker pull jupyter/pyspark-notebook
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pyspark-zero-to-hero/blob/master/25_delta_lake_optimization_and_z_ordering.ipynb
Delta Lake Optimization Documentation - docs.delta.io/latest/optimizations-oss.html#language-sql
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering
zhlédnutí: 296
Video
29 Optimize Data Scanning with Partitioning in Spark | How Partitioning data works | Optimize Jobs
zhlédnutí 462Před dnem
Video explains - What is the impact of data scanning on jobs? How partitioning works ? How to avoid un-necessary data scanning? How to optimize jobs using Partitioning? Chapters 00:00 - Introduction 00:35 - Why avoiding un-necessary Data Scanning in Important? 02:12 - Impacat of Data Partitioning 03:07 - Imapct of partitioning column missing from Query 03:33 - Impact of parititioning on High Ca...
Get 100% more Interview calls from Naukri Portal | Boost your Naukri Profile |Optimize Naukri Search
zhlédnutí 1,1KPřed 21 dnem
Video explains - How to Optimize Naukri Profile to get more calls or opportunities? How to boost Naukri Profile? What are the relevant changes to get more Interview calls ? How to hack Naukri Portal to get more calls ? Naukri Job Search help ? Chapters 00:00 - Introduction 01:19 - Naukri Profile Changes 02:52 - Changes for Resume 04:32 - Keywords Optimization 05:45 - Serving Notice Period 07:15...
28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel
zhlédnutí 788Před 2 měsíci
Video explains - What is Delta Lake and why is it important? What are the Key Features of Delta Lake? How Delta table manages versions? How Delta table manages Time Travel? What is Schema Evolution in Delta Lake? Chapters 00:00 - Introduction 00:22 - Key Features of Delta Lake 01:27 - Sign Up and Login into Databricks Community Edition for Free 03:22 - Get Started with Databricks basics 10:34 -...
27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases
zhlédnutí 414Před 2 měsíci
Video explains - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and ...
17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases
zhlédnutí 464Před 3 měsíci
Video covers - How to read and write data from Azure Cosmos DB? What are NO-SQL databases? Why is Azure Cosmos DB so important? How to create a Azure Cosmos DB Account? What are the differences between SQL and NOSQL databases? What are different Write strategies in Azure Cosmos DB? Why are NO-SQL databases so popular? Chapters: 00:00 - Introduction 00:45 - What is NoSQL and SQL Databases and th...
01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System
zhlédnutí 301Před 3 měsíci
Understand - What is Distributed Computing and How it works? What is Big Data? What is Hadoop and what are its important component? What is Horizontal and Vertical Scaling ? Chapters: 01:10 - History and Why Big Data? 03:00 - What is Big Data? 04:49 - What is Hadoop ? 06:36 - What is Distributed Computing and How it Works? 06:46 - Horizontal vs Vertical scaling 11:11 - Components of Hadoop Lang...
16 Late Data Processing | Watermarks | Tumbling and Sliding Window Operations in Spark Streaming
zhlédnutí 690Před 3 měsíci
Video covers - What are Watermarks in Spark? What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Difference between Update and Complete modes ? Chapters: 00:00 - Introduction 01:15 - Fixed Window Code Implementation 03:28 - Fixed Window with Watermark 07:39 - Late Events with Wate...
15 Tumbling, Sliding and Session Window Operations in Spark Streaming | Grouped Window Aggregations
zhlédnutí 748Před 3 měsíci
Video covers - What are Tumbling, Sliding and Session Windows in Spark Streaming? What are different Window Operations in Spark Streaming? How to handle Late Data in Spark Streaming? Chapters: 00:00 - Introduction 00:40 - Tumbling or Fixed Window 02:42 - Sliding or Overlapping Window 04:23 - Late data scenario and Importance of Watermark 06:37 - Session Window URLs: Github Code - github.com/sub...
14 Spark Streaming Event vs Processing Time | Late Arrival of Data | Stateful Processing |Watermarks
zhlédnutí 619Před 4 měsíci
Video covers - How to handle Late Arrival of Data ? What is the difference between Event and Processing Time? How Spark handles Stateful Processing ? What are Watermarks in Spark Streaming ? Chapters: 00:00 - Introduction 00:29 - Event Time vs Processing Time 02:34 - Stateful Processing 03:38 - How Spark handles Late Data ? URLs: Github Code - github.com/subhamkharwal/spark-streaming-with-pyspa...
13 Spark Streaming Handling Errors and Exceptions | Handle Exception for data re-processing in Spark
zhlédnutí 830Před 4 měsíci
Video covers - How to handle error in Spark Streaming? How to handle Exception in Spark Streaming Application? How to store error data for re-processing in Spark Streaming? How to write data to JDBC Postgres table? Chapters: 00:00 - Introduction 00:41 - Error vs Exception in Spark Streaming 04:17 - Handling Error/Malformed data in Spark Streaming 10:02 - Handling Exception in Spark Streaming UR...
12 Spark Streaming Writing data to Multiple Sinks | foreachBatch | Writing data to JDBC(Postgres)
zhlédnutí 1,3KPřed 4 měsíci
Video covers - How to write data to multiple sinks in Spark Streaming? What is the issue with multiple writeStream command? How to use foreachBatch in Spark Streaming? How to write data to Postgres/JDBC ? Chapters: 00:00 - Introduction 00:46 - Issues with Using Multiple WriteStream 01:37 - foreachBatch command 04:03 - Code Implementation 04:51 - Writing data to Multiple Sink using foreachBatch ...
11 Spark Streaming Triggers - Once, Processing Time & Continuous | Tune Kafka Streaming Performance
zhlédnutí 1KPřed 4 měsíci
Video covers - What are different triggers available for Spark Streaming? How Processing Time and Once trigger are different? How we can tune Kafka jobs with Partitions? Chapters: 00:00 - Introduction 00:58 - Automating Device data for Kafka 02:37 - Trigger mode Once/AvailableNow 04:36 - Trigger mode ProcessingTime 05:49 - Tune Kafka Streaming Job 08:23 - Trigger mode Continuous URLs: Github Co...
10 Spark Streaming Read from Kafka | Real time streaming from Kafka
zhlédnutí 2,1KPřed 5 měsíci
Video covers - How to read streaming data from Kafka? How to read real time data from Kafka? How to use Kafka as a Source for Real time Spark Streaming? Chapters: 00:00 - Introduction 00:34 - Example Device JSON Payload 01:09 - Import Kafka JAR Libraries 03:08 - Read from Kafka Source 06:27 - Extract JSON data from column using from_json URLs: Github Code - github.com/subhamkharwal/spark-stream...
09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min
zhlédnutí 1,2KPřed 5 měsíci
09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min
08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory
zhlédnutí 1,3KPřed 5 měsíci
08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory
07 Spark Streaming Read from Files | Flatten JSON data
zhlédnutí 1,5KPřed 5 měsíci
07 Spark Streaming Read from Files | Flatten JSON data
06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data
zhlédnutí 972Před 5 měsíci
06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data
05 Spark Streaming Output Modes, Optimization and Background
zhlédnutí 1,4KPřed 5 měsíci
05 Spark Streaming Output Modes, Optimization and Background
04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code
zhlédnutí 1,9KPřed 5 měsíci
04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code
03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka
zhlédnutí 3,1KPřed 5 měsíci
03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka
01 Spark Streaming with PySpark - Agenda
zhlédnutí 3,7KPřed 6 měsíci
01 Spark Streaming with PySpark - Agenda
26 Spark SQL, Hints, Spark Catalog and Metastore
zhlédnutí 1,4KPřed 6 měsíci
26 Spark SQL, Hints, Spark Catalog and Metastore
25 AQE aka Adaptive Query Execution in Spark
zhlédnutí 2KPřed 6 měsíci
25 AQE aka Adaptive Query Execution in Spark
24 Fix Skewness and Spillage with Salting in Spark
zhlédnutí 3KPřed 6 měsíci
24 Fix Skewness and Spillage with Salting in Spark
23 Static vs Dynamic Resource Allocation in Spark
zhlédnutí 1,3KPřed 6 měsíci
23 Static vs Dynamic Resource Allocation in Spark
22 Optimize Joins in Spark & Understand Bucketing for Faster joins
zhlédnutí 3,8KPřed 6 měsíci
22 Optimize Joins in Spark & Understand Bucketing for Faster joins
21 Broadcast Variable and Accumulators in Spark
zhlédnutí 1,4KPřed 7 měsíci
21 Broadcast Variable and Accumulators in Spark
It would have been Great if each step in cell was explained instead of whole cell. Thanks anyways
EaseWithData content is amazing. Thank you. Please help on this one. has anybody able to execute this successfully, I am getting java.lang.ClassNotFoundException: org.postgresql.Driver exception, few questions, 1) Will the jar download automatically in the path when spark code cell is run? is the jar version very old which is not found in it's site or is this version of postgres db still relevant? 2) Do we need to establish the network bridge manually one time so that two containers spark and postgres can talk 3) if the bridge network needs to be created then is it done in adhoc inline script before running the container in cmd screen or it needs to be done in the docker compose file? 4)I am not able to go further until I clear this section, since most of the stuff after depends on postgres db, please help. Thanks
Hi Subham, @Ease With Data- I followed the installation procedures in other playlist. But seems it is not working as expected. Can you help me to fix it? I can share more details. Basic chapters are fine with any of the code repository. but to run code with below line "18Optimize join" is not working. even jobs UI with 4040 port also not working spark = ( SparkSession .builder .appName("Optimizing Joins") .master("spark://f6d8b23a8515:7077") .config("spark.cores.max", 16) .config("spark.executor.cores", 4) .config("spark.executor.memory", "512M") .getOrCreate() )
Nice :)
Nice explanation 👍🏼
Love your content :) I have one small question.. At 4:10 Spill memory is of 137MB and Spill Disk is of 77.2MB. If 137MB is spilled from memory why only 77.2MB is written in disk? Shouldn't it be 137MB? Can you please clarify this?
Data written on disk are serialized and the data in memory is in deserialized format. Thus the amount will be less on disk. This is majir tradeoff when you are reading data from disks. Please make sure to share with your network if you love this content ❤️
@@easewithdata Thanks for the quick response!! Sure, will recommend my mates.
I am using your pyspark-jupyter-lab docker file, But when creating spark session ,I am getting java runtime error. Java gateway process exited
Please use the pyspark notebook using the below docker command docker pull jupyter/pyspark-notebook
how to optimize naukri if you are doing career transition into data field Civil Experience: 3.5 Education: MCA I am a bit confused with resume summary also. as recruiters only consider relevant experience, it litrally feels like have i wasted my career or 3.5 yrs. I am stuck can you please guide?
Unfortunately it becomes very difficult to hide 3.5 years of exp, you need prepare youself for data engineering and apply for roles where they are looking for position will less or no relevant exp in data engineering
Nice explanation thanks for the knowledge sharing 🙂 please continue this playlist it is very helpful.
Thank you so much ❤️ Please make sure to share with your network 🛜
Got the "scram authentication is not supported by this driver" error while trying to connect to postgres. This is driving me nuts.
Please make sure to use the correct driver version for the postgres you are using
@@easewithdata I solved this by setting up the Postgres and the Jupyter all with the same compose file. Before I was using a docker network to connect the two, didn't work no matter what. Everything breaks after I removed the network group so I tried setting it up again.
Hello, what if I wanted to do this in VSCode instead of Jupyter Notebook with docker as shown in the video?
You can write the complete code in scripts using vs code and then trigger them using spark submit command
What is invited to apply in naukri?and how to reply?
Invitations are to apply for, click on the invitation and fill up the applications shared.
so in deployement mode the driver program is submitted inside a executer which is present inside a cluster. am I rignt?
The spark submit command on the driver not on executors
Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo
Some data files are too big to be uploaded in github. Most of the data is uploaded at - github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets
Great 👍
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
thanks for this valuable insight, expecting the same video for apache iceberg and hudi in future
Sure and Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
Very nice video
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
need to understand one thing why yyyy and dd not in capital letter is there any reason for that
Spark follows the following datetime pattern format (mostly resembles to Unix formats) spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Thanks for your lessons! You covered all the gaps between the main concepts.
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
Thanks for such a detailed work. You're a Hero.
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
Sir can u make a video to setup spark and hadoop with docker
For Spark setup Hadoop is not mandatory. You can have Spark standalone setup using docker. But if you still want the same, you can clone and follow the steps from below Github repo github.com/Marcel-Jan/docker-hadoop-spark
Can a fresher become data engineer?
Yes absolutely, please start with SQL, Atleast one programming language and Spark
please make a video on creating resume for senior data engineers and please share the template thanks
Will definitely try. Thanks.
Thanks, pls share the senior data engineer resume template.. will help
Sure will try to share the same.
spark's standalone cluster is where on docker or any where please tell me my cluster execution codes are not running why?
Standalone cluster used in this tutorial is on docker. You can set it up yourself. For notebook - hub.docker.com/r/jupyter/pyspark-notebook You can use the below docker file to setup cluster github.com/subhamkharwal/docker-images/tree/master/spark-cluster-new
Thanks bro :)
Thanks, Please make sure to share with your network over LinkedIn ❤️
how to contact you sir
You can connect with me over topmate topmate.io/subham_khandelwal/
@@easewithdata Hi sir will you be available today please since I dont have time till weekend
great work
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
thank you for your efforts
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
thanks for your efforts it helps lot
Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
thank you in real time do we use cluster node or cline mode which you are using now?
I am using the client mode
localhost:4040 is not working when I done ".master("spark://e75727ddf432:7077")" how can be solved?
Why you made 32 shuffle partition if you have 8core, If one partition is going to process on single core, from where it will get other remaining 24 cores?
The 8 cores will process all the 32 partitions in 4 iterations each. (8X4 = 32)
when I am refreshing my spark ui is giving error how can be solved giving this "spark://6b16b66805db:7077" and on "localhost:4040" also not working I give this".master("spark://6b16b66805db:7077")" how can be solved please.
Thanks for the Explanation, Very useful
Glad it was helpful! Please make sure to share with your network over LinkedIn ❤️
From the vidoe : Select, where, group by etc ate transformations. We have narrow transformation and wide transformation. Wide transformation are those when data has to move or interact with data of other partitions in next stages
Again from video itself: executors are jvm processes, 1 core can do 1 task at a time, above pic we have 6 cores, so 6 tasks were possible
Shuffle is the boundary which divides job into stages
Great explanation.. liked the illustration that 2 counts happened and the fact that after local count and before global count, some shuffling happened
Thanks 👍 Please make sure to share with your Network over LinkedIn ❤️
Awesome. Super explanation . I love it
Thanks, Please share with your Network over LinkedIn ❤️
why are we not using withCloumn instead of expr?
Just to show all possible options. withColumn is also used in later videos.
You are awesome bro.. Thanks a lot
Glad to hear that ☺️ Please make sure to share with your network over LinkedIn ❤️
What is the use case of slide duration in streaming app... any real world example ?
Its basically used for cumulative aggregations
You are awesome man. I was trying to setup spark and kafka in docker for long time. done today. thank you very much
Glad I could help. Please make to share with your network over LinkedIn ❤️
Hi, I have a doubt: How can we check if a stream has multiple sink from spark UI?
Allow me sometime to search the exact screenshot for you.
what if both the tables are very small like one is 5 MB and other is 9 MB then which df is broadcasted across executor?
In that case it doesn't matter, however AQE always prefer to broadcast the smaller table.
@@easewithdata Thanks & I'm following you for more than a month its been a great learning experience , we want you to make End to End Project in Pyspark
one of the best explanation in depth, Thanks :) Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.
Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜
Hello, and thanks for the sharing of these useful videos. How to handle the writing in delta tables: Because the best practice is that the size of each parquet file should be between 128 MB to 1 GB. How to handle this situation while each batch has very less than the size that is mentioned? or how to handle to collect the number of batches and to reach the mentioned size and finally to write in deltalake.
Usually microbatch execution in Spark can write multiple small files. This requires a later stage to read all those files and write a compacted file (say for each day) of bigger size to avoid small file issue. You can use this compacted file to read data in your downstream systems.