10 Spark Streaming Read from Kafka | Real time streaming from Kafka
Vložit
- čas přidán 27. 07. 2024
- Video covers - How to read streaming data from Kafka? How to read real time data from Kafka? How to use Kafka as a Source for Real time Spark Streaming?
Chapters:
00:00 - Introduction
00:34 - Example Device JSON Payload
01:09 - Import Kafka JAR Libraries
03:08 - Read from Kafka Source
06:27 - Extract JSON data from column using from_json
URLs:
Github Code - github.com/subhamkharwal/spar...
Device data samples - github.com/subhamkharwal/spar...
To setup Kafka with Spark in Local environment - • 03 Spark Streaming Loc...
JSON data Flattening and reading from files - • 07 Spark Streaming Rea...
Keywords: Apache Spark, PySpark, Spark Streaming, Real-time Data Processing, Data Streaming, Big Data Analytics, PySpark Tutorial, Apache Spark Tutorial, Streaming Analytics, Spark Structured Streaming, PySpark Streaming, Big Data Processing.
New video in every 3 days ❤️
Make sure to like and Subscribe. - Věda a technologie
Superb playlist
Glad you like it
Thanks for sharing this fantastic list.
I only realized something: you are using it to run Kafka brokers port:9092, but you could also retrieve the information from 29092. What did I miss :)
Many thanks
Thanks. For external application in docker 9092 port is open but for internal containers we can use 29092. But both are correct.
excellent tutorial
Here are the questions that i faced in hewlett packard interview on spark streaming application, probably you can create a video on these too
1. Suppose you read a message from kafka and our application fails to process it, how do you ensure that the same message is processed again successfully. What he was trying to refer was that out of 1000's of messages being read from kafka, how do we ensure we can process the ones successfully that failed, as our application will continue to read the new messages that are coming in, and this unsuccessful message was read once and it failed.
2. How do you ensure parallelism with a kafka producer and a spark streaming read API, there will be 100's of messages incoming at any given point and naturally our spark application cannot process them one at a time, instead spark can process them parallely by reading from multiple partitions. How do you configure your app to do that.
I think these questions can give us a much better idea as to how a prod spec spark streaming application will work. Appreciate if you can create some content on these questions. Thanks
Thank you for posting the questions. Answer for Question 2 can be found in today's video.
thank you @@easewithdata
if you can cast it to a string, why not able to cast it to json, or map it out and use a json parser?
Yes you can definitely cast as per your wish.
I can't see output on docker console. but started the consumer console and can see the data. thanks for the video. can you also make a video on how to read streaming jobs details on spark UI?
Please make sure to share with your network over LinkedIn
getting below error message while reading from kafka
Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
How to fix this
Did you import the Jar file for Kafka in the Spark Session ??
@@easewithdata Yes i did , here is the sample
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("Streaming from Kafka")
.config("spark.streaming.stopGracefullyOnShutdown", True)
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0')
.config("spark.sql.shuffle.partitions", 3)
.master("local[*]")
.getOrCreate()
)
@@easewithdata Yes i did
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("Streaming from Kafka")
.config("spark.streaming.stopGracefullyOnShutdown", True)
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0')
.config("spark.sql.shuffle.partitions", 3)
.master("local[*]")
.getOrCreate()
)
No comments are deleted. Once your spark session os created, check in the Spark UI environment section if the kafka jar is downloaded and attached to cluster
getting below error message while reading from kafka
Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Hello Prathmesh,
Did you import the jar to support kafka ?
import means separately needs to be import ?
You need to import kafka jar file in the SparkSession
it's sometimes occur because of mismatch between downloaded jar file and spark version . you should find appropriate version of jar file which is compatible with your spark version