YouTube Data Analysis | END TO END DATA ENGINEERING PROJECT | Part 2
Vložit
- čas přidán 26. 06. 2024
- In this video, you will execute the END TO END DATA ENGINEERING PROJECT using Kaggle CZcams Trending Dataset.
If you are someone who wants to learn Data Engineering by doing hands-on projects then this video is for you!
👉🏻Watch Part 1 Of This Video Here - • CZcams Data Analysis ...
✨Visit ProjectPro for more projects - bit.ly/3uBzam5
✨ Tags ✨
data engineering projects, big data project, data engineering project hands-on, hands-on data engineering projects, learn data engineering, data engineering roadmap, how to become data engineer, data engineering free projects, big data engineering, big data
✨ Hashtags ✨
#dataengineer #project #darshil
Darshil is a great teacher! Great project.
Thanks for bringing me close to the real use case scenario of Data Engineering.
Darshil I learned alot. I believe this is helping many persons. Thanks for all the effort you put into this.
simple and to the point explanation. Great work bro 👍🏻
You are doing such a great work, please should learn from you how to teach by this learning by doing method…
Please do some more projects like this using real time data, big data also so that we can learn that also.
And thanks again this tutorial is helping a lot🎉❤
best project from scratch Thanks bro☺☺
Too good. Learned alot.Thank you
that was the awesome project, Thank you!
Great work Darshil!
I have only 1 suggestion after finishing the whole project along with the video which took me around a total of 6-8 hours except the dashboard. My suggestion is that you can take a minute extra and explain the code properly so that we viewers can understand what transform actions are we taking in the ETL because that would make more sense to the video overall and why you chose the steps were there before and after ETL step becomes clearer.
Though, thanks for this wonderful project and I am probably moving to the Azure analytics project after this one.
hey Darshil.... I hope the project is complete!!
Finally get this project done. Great project to learn data engineering!!
Hi how did you set up the etl job?
Great job !!!
Great video bro..
Thanks, for this!
Great video Darshil, thank you so much!!
I have parquet file error when i do string to bigint , i also delete the file from s3 bucket but not working , any one can please help me .....
Thank you so much for the Amazing video
Really great project! Just wanted to ask, when more data ends up in the landing area, will the rest of the processes after automatically go through the pipeline you created? Because it seemed like some parts you had to do manually, like using AWS Lambda.
informative
Thanks for the great video Darshil !!! Leant allot of new things :)
I have parquet file error when i do string to bigint , i also delete the file from s3 bucket but not working , any one can please help me .....
hey bro, your videos was very understandable. could you make the video deeply about the quick sight.
Finally completed this project. Thankyou so much for this! You're a gem :)
Heyy, I am facing an error while joining cleaned and raw data. Can you please help?
Hey I am facing same issue. Can you help out?
This is a simple error I ran into gonna post it here incase others have the same.
When trying to run the job @21:25 I was getting "NameError: name 'gluecontext' is not defined.
When adding the line "df_final_output = DynamicFrame.fromDF(datasink1, gluecontext, "df_final_output")" I accidentally forgot to capitalize glueContext, instead I put gluecontext
Thank you for this walkthrough, I start my new Data Engineering job tomorrow and the company uses AWS so this has helped me tremendously. You are doing magic my friend
Hi Robert, Hope you are doing well. it's been over a year since you posted and joined your new company. Just wanted to check if this new job was your first data engineering job or if you were already experienced in DE? And how are things at your new workplace?
00:03 Creating a crawler to understand and analyze data stored in AWS S3 buckets
05:48 Query execution and data type casting
11:43 Preprocessing and Efficiency for Querying
17:16 Writing data into the target bucket and creating partitions
22:05 Create a glue crawler to clean and catalog data
27:25 Data processing pipeline created using AWS Glue Studio
32:23 Created an analytical pipeline using AWS Glue to transform and store data
37:18 Building reporting version of data makes it easier for data scientists to analyze and query the data.
42:01 Create a dashboard to visualize data from CZcams
Hi, Can you explain me how the "raw_statistic" table has been created autamatcally after he created the crawl_1, When I tried same processor, it didint work on me
Thanks a lot Darshil and project pro
not able to see region column in my schema, also all columns showing string as the datatype 16:07
Hi Darshil, Thank you for this video. I have a question for you, when you have created a cleaned version for csv to parquet, why didnt we use lambda function instead of glue job?
Thanks Darshil!! Finally made this cool project after overcoming all those errors. Really good explanation.
@@aishwaryapatel7045facing same issue
I have parquet file error when i do string to bigint , i also delete the file from s3 bucket but not working , any one can please help me .....
great work darshil bro ...can you send the ppt if possible
Thank you Darshil for this wonderful project.. I have been looking for such project for long time.
How did you solve runtime error
@@ishan358 What kind of error are you getting?
@@chayanshrangraj4298 can You help me with an error ? In the step of aws glue to do the join of the tables
@@lguerrero17 Sure! What is the error that you are facing?
@@chayanshrangraj4298 When I try to do create the etl to generate table analytics , it creates the table but doesn't generate columns and rows.
Another great video. Only thing is...AWS has updated Glue console along w/ other consoles. I believe I updated accordingly, except for the schema datatypes (which it looks like I change update after the job is run). But for the script...it does look entirely different. Could you assist w/ an updated vid on using the new Glue consoles?
I am facing the same issue
@Jenith Mehta If you scroll to the bottom of the navigation pane there is "LEGACY" versions. I realized after I posted this but I used that. Hope that helps. 😀
@@ajtam05 can't find that
Good work ,it's a grate project ,helped out learning many things.
can you help me idont understand with new interface of awsglue
Has anyone used ProjectPro before? I'm considering investing into it, but just wanted to see if anyone has experience with it yet? Looks promising.
@darshil thanks for the effort, great job!! I just finished the project and so proud of myself, my very first project switching from DA to DE. thanks a lot
Hey hi did you get lambda timeout error by any chance?
hey iam seeing the new aws glue ui. how did you create the job there? iam facing a lot of confusion on what to select and to navigate in that .the video ui is different.
@@allenclement5672 hey did you solve this issue
@@ybalasaireddy1248if you are getting time error...pls increase time
@@allenclement5672 same problems...
Did u solve them? can u help me?
Anyone who is struggling with trigger ,please make the trigger in s3 bucket . That will work perfectly
is there a project where you used python notebooks or emr for processing data instead of lambda functions ?
To the folks struggling with glue script to filter regions out: try deleting the region files manually from S3 (make sure to enable bucket versioning so that objects are not permanently deleted). By doing this you can check if rest of your code is good, and even go on with rest of the video if its working.
I am getting timeout error
I think it's better if you just move the folders somewhere else so you won't have to upload them again in the future.
Which part of the video ?
Hey hi, can you please share the pyspark code?
Hey thanks for this great video. I want to know, how much does it cost to complete this entire project on AWS?
BEFORE TRIGGER DI D YOU CHANGE LAMBDA TO TAKE ALL RECORDS..LIKE INITIALLY IT WAS [
"RECORDS"][0]
Hey.. How to convert from csv to parquet for other regions like Russia,Korea etc..
why did you re-create the crawler at the start of the video?
👌🏻🙏🏻
Can you provide ETL script shown in the video. I am getting error even after adding predicate_pushdown
You go so fast
But good work
Hi, where to get the ppt you are using?
Excellent !!
How do we showcase this project in our linkedin profile our in our resume
Finally,after 100times of disappointments...i done it...Great Efforts and it's my very first Project in DataEngineering field...Thanks
...Errors are challenging but only whose who have a real interest in DataEngineering.He will definitely achieve it ..by.Done this project completely.
An error occurred while calling o88.getDynamicFrame. User's pushdown predicate: region in ('ca','gb','us') can not be resolved against partition columns: [] in my job 23:00
@@N12SR48SLC sorry bro.
@@N12SR48SLC yes bro
Don't forget close the activated services from AWS..after project got done
@@N12SR48SLC what's the error
Hi Darshil, I have been trying to implement this project. At 13:28 you have created a job, but I am not able to see that option in the current version. All I can see is to create the ETL job visually. Can you please help me with this?
The Athena query works first time on the parquet file, and then I have to delete the unsaved folder in the cleansed bucket, has anyone dealt with this, I am still at the 5 min mark of this video. Really frustrating!!
Is there any one facing issue with the lambda function as when I have added the trigger but the nothing new file is created once i upload the json file to the same bucket.
Do i have to pay anything to complete this project? Or it is completely free?
You didn't answer the initial question as in video 1: How to categorise videos, based n their comments and stats and what factors affect how popular a youtube video will be
@Darshil Parmar - "region=us/" folder is not created for me; only ca and gb folders are created upon running the ETL job. PS: I added "predicate_pushdown = "region in ('ca','gb','us')" as well but floder is missing for "us" region. Can you please take a look at this?
Same thing happened to me. Error occurred when initially using AWS CLI to load data into s3 buckets. After executing the command to upload the csv files, I did not hit enter after the upload was "complete". I just exited the cmd box. To fix I manually uploaded the data and re-ran processes from both videos
edit: this is only valid if you go into your raw data s3 folder and don't find the folder "region=us"
because us is not present in the initial dataset
use aws cli for creating folders using cp command
I have parquet file error when i do string to bigint , i also delete the file from s3 bucket but not working , any one can please help me .....
INTERESTING POINT HERE : at 4:56
How can you know what primary key to choose to do the INNER JOIN ? Before watchin i tried to make a.video_id = b.id
Because it sounds logic that each row is unique, so the video_id should be used and compared to the id of the other table that are also unique video row.
Am i wrong ? Anyone have idea ? Thanks a lot
Go and read data columns description for that
While doing such type of join u will be given proper knowledge of data
where is the link for discoed ?
Thanks ! Why parket file ? Is not it more simple to keep everything in json or csv ?
think it is cause the larger volumes of data, can use spark
Where is the discord link?
Sorry but i dont see any use of kafka, spark, hadoop etc
Its just aws and python and SQL
we created ETL job to join data so that when new data gets added to the bucket it will be automatically joined instead of running an SQL query. But shoudnt we trigger this ETL job for the data addition event in S3 ? Can anyone answer this
No, i think only 1 time lambda trigger from s3 happens for .json file to paruqet --> then cleasend s3 bucket if filled -> from there analystics data picked.. confirm this
I did asked you a question on your channel about the wrangler which didn’t seems to be working for me. I don’t know if it has to do with location?
Yes, it is only available in some locations
@@DarshilParmar oh I see! Thanks for the work you do!! You have been very helpful!!!
@@DarshilParmar why is that ?
Don't forget close the activated services from AWS..
Can you tell how to close it
We have to delete bucket and etl job or something else
@@sivasahoo6980 Delete all the services
Is there an updated version of this? The Legacy Glue UI cannot be accessed now
Iwas also searching for it. I'm wondering what to use now.
@projectpro pls consider monthly subs instead. billed 6 mths/yearly is too much.
Hey, we have some discounts going on and it's valid only for a few days, please share your email id and our team will get in touch with you. thanks
When my lambda function is triggered by an S3 event, the cleaned_statistics_reference_data table is created. But when I check by SQL command "SELECT * FROM cleaned_statistics_reference_data", the result is an empty table. I tested the lambda function with a test event, and everything is OK (there is data in the cleaned_statistics_reference_data table). Please help me with a solution! Thank you!
have you found the solution I am facing the same issue. Please help me
@@drishtihingar2160 facing the same problem have found any solution?
no not yet@@nandinisingh9217
you have to upload json files through cli AFTER creating the trigger.. lambda wont process already existing json files
Hi I am in the last step of building ETL pipeline. I successfully created the glue job named 'de-on-youtube-parquet-analytics-version' . The contents in the de-on-youtube-analytics bucket are getting added but there is no creation of 'final_analytics' table happening. Please help me resolve the issue. Thanks in advance
hi ,I created the glue job but it isnt creating the same files under raw_statistics as shown in the video how did you do it
why you have done analytics on only this three region ? region in ('ca','gb','us')
He's testing his code in the job's native language English first to ensure his ETL job works before going through the trouble of converting foreign languages to udf
S3 rigger is not working for me, I tried many times. The data is not writing into s3 cleansed bucket(json files)
Please check is there any space while defining Prefix: youtube/raw_statistics_reference_data/ . If you are coping from s3 then there may be some space after youtube/
hello brother , have u find the solution i m getting same error
@@harshalshende69 i actually did manual work, by uploading the files.
@@eduhilfe1886 This fixed it for me thank You!!
@@harshalshende69 Check your s3 trigger, make sure CZcams/ doesn't ave a space after it
Add trigger to lambda function, not working for me. Tried many times, Please suggest.
facing same issue. Did you find a solution ?
@@divyakhiani1116 I redid the same steps once again, I guess. I don't remember, though !
Hey all, I am stuck at 40:35. I don't see the Database option for 'New Athena data source'. Not sure if QuickSight had an update since this video was created. Any suggestions?
Answering my own question, had to change the region which was a default selection.
Thanks. Can u plz tell me whether u need to pay anything to complete this project?
thank you so much
I'm stucked at 12:55 I'm unable to go pass the id type error... I deleted parquet several times but still not working
I did everything in us-west-1 (California) region, but this region is not available in quicksight. Can you help please @@kopalsoni4780
Hello Sir, I m not able to convert the id field type to bigint
i tried the steps as according to the video multiple times.
Even looked online for the procedure but got noting as such.
Can you help me sir?
Has your issue resolved converting id cloumn to bigint
No
what to do about creating jobs
@darshit #darshil
should i use the script that is given as now aws has moved to visual etl and simple job creation has became complex for someone who doesnt know how to work with visual etl
Job creation UI has completly changed. I am stuck at that step.
go to the script tab and click edit. Paste the spark code from the githup repo, it will work.
@@russophile9874 could you please explain precisely what script tab and which edit? I am stuck on this step. thanks
not able to see region column in my schema, also all columns showing string as the datatype
sameee. did you find any solution?
I added this code: predicate_pushdown = "region in ('ca','gb','us')"
and got this error
"Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o103.getDynamicFrame. User's pushdown predicate: region in ('ca','gb','us') can not be resolved against partition columns: []"
the source S3 data in my set up is partitioned by the "region" column.
Please how do I resolve this?
Bro have you found the solution ?
Hello, Darshil I am kind of stuck at(22.52) of the video. My job runs successfully but the raw_statistics folder is not created. I have described the region correctly in the code.
any suggestion would be helpful ,
Check the s3 trigger remove the space after youtube/
caught with the same issue. In my case , files were created directly in raw_statistics. There are no sub-folders "region/". Could you please help me? Thanks
Can you please share your script? I have created my job but it is not executing. Please share, it will be a great help.
Actually in my case i am getting confused in creating the job because in the current aws ui it directly shows visual etl there is no option of target and data transform and no option of adding a job manually..if anyone could please help me with that
It seems like for me at the 28:26 the parquet files didn't transformed. I checked the trigger and the region but still no finding solution. Do anyone has any idea ?
Did you remove extra white space in prefix and retry it? I solve the same problem in this way.
@@user-zu3yp3bu5s yes, and i had the same result. Only blank database with only the column names and no parquet file uploaded.
Thank you so much@@user-zu3yp3bu5s . After days of asking, I finally found a solution.
someone plz help, the UI for creating the job completely changed. I am not able to create new jobs.
Here the same @Darshil Parmar
Just finished the project; Amazing work man!!!
hi, I need some help with the project will you be able to help?
@@ybalasaireddy1248how did you solve runtime error in lamba
How did you solved lambda runtime error
@@ybalasaireddy1248 Hi I am trying to do the project , we could support with us
@@lguerrero17 hi, can u help me too, please?
Somebody help, I am getting this error:
TYPE_MISMATCH: Unable to read parquet data. This is most likely caused by a mismatch between the parquet and metastore schema
This query ran against the "de-yt-clean" database, unless qualified by the query.
I have changed the schema still no progress.
you need to create your parquet file again now by running the lambda function again. this was covered in video too
how to get the Jobs ( 13:30 ). Apparently, the Glue console has changed , so not sure how to go ahead
got solution to this ? 💀
In 16:54, I'm not able to see the region source key in my output schema. What should I do?
Can anyone Please explain how to setup the ETL Job , aws glue UI has been changed now there is now option like the one showed in the video, instead there is visual etl,notebook, script editor. many students are facing the same issue but no one is replying. Can anyone please help and write what needs to be done ?
Hello Same issue Facing
Had you solved it then please share
I am stuck on creating Glue job as UI is different. Please anyone help here . Where to change data types. I am able to add source and target
stuck in same
I added data target and source but not able to figure out how to change data type
@@shubhamnikam4759 @use google gemini
I understand why we need to convert JSON to parquet, but why do we convert CSV to parquet it's already clean right?
parquet file format is more optimized and is faster- read more about it on internet
Actually in my case i am getting confused in creating the job because in the current aws ui it directly shows visual et/ there is no option of target and data transform and no option of adding a job manually.if anyone could please help me with that
@@vineetsrivastava4906
Can we tranform json data into parquet through glue?
Heyya, are you doing the project currently hands on? I am looking for someone I can start the project with together.
No i am also looking for someone with whom i can do project
Hey is there someone can help me? The UI of ETL Jobs has changed a lot and I cannot add a job successfully.
were you able to figure this out
Hello! I faced this issue and I figure it out.
You go to ETL Jobs, click on the button "create job from a blanck graph" and go to "job details" on the menu third item
The second part, when it clicks next, you have to go to the visual (first item on the same tab you clicked on job details), add a node, first you choose the S3 bucket to choose the source, than next you add a new node from schema, and than a third node on the target tab
Hey @@gabilinguas I am not able to get what you said in the last comment. Can you please little more😊
Hey @@RonitSagar !
You can follow basically the same steps described in the "Build ETL Pipeline" in 30:33 on the video.
The process is almost the same, you just have to pay attention on the different details.
While creating job i am not able to get region in the option. can you please help me at 16.10 min
because we are getting data from s3. Instead we need to select our source from Data Catalog
Hi, I am getting error while creating Glue ETL job 17:00 the UI is completely different and cannot proceed further, any help?
same here.. stuck there
@@Chandu_Art ive set the job pipeline using the new UI, but the script editing is mismatched
hey parakh did your issue with ETL got resolved? if yes can you please help me with that
Same for me stuck at ETL job creating section
@@srihariraman9409 how did u set the new UI pipeline
Please mention few steps
why my trigger is not invoked when file is uploaded in s3 ,although my test is properly working in lamda function,it is not showing any error also. i am not able to understand the issue
did you get any solution
@@sivasahoo6980 its been time since i posted but if i remember correctly it was somewhere with naming or syntax where there was extra space which i was not able to find then rewatcing evrything i got it, idont exactly remember where but may be in some path
hello guys,you might be getting error at the point of testing that is because of db name has been not changed in environment variable, please take care he has forget to change db name , if you notice in athena database name is db_youtube_cleaned but it should be de_youtube_cleaned, which is giving error in lamda final testing as "Entity not found"
@@banarasi91 thanks a lot
yeah there is a extra space in path
@@banarasi91 not able to see region column in my schema, also all columns showing string as the datatype 16:07. my etl job is also failing
from part 1 of this project i'm facing with the error below,
let me know the solution for it.
Test Event Name
db_amazon
Response
Calling the invoke API action failed with this message: Failed to fetch
Function Logs
Request ID
Hi my Lamda triger for json files is not getting fired dont know whats wrong .
Yeah same. Have you figured it out?
@@aneeqbokhari4611 Same here
Had to stop here too. After deleting all files and re-uploading, trigger does nothing.
@@bukunmiadebanjo9684 The same thing happened with me. And one figured out to solve it?
@@shantanuumrani9163 didn't find a solution. The whole UI also looks different as AWS already made changes so I decided to move to a different course and abandoned this.
Thank you so much for wonderful project. I am getting below error while testing the lamba function. can you please advise?
Test Event Name
s3-put
Response
{
"statusCode": 200,
"body": "\"Hello from Lambda!\""
}
you need to deploy first then test the lambda function.
@@shaikanishmib8391 deploy is disabled
Did trigger work for you
@@prafulbs7216 I m stuck there. It is not working
@@SankarJankoti Yeah, So i just did manually one by for 3 regions(ca, ge, us) with lambda function only and continued.
I facing error in that Pyspark code, please help me out
Try to change bucket name and database name in pyspark script according to your naming conventions you have used.
7:10 correction: the characters are not in russian but in korean script , my gawwddd, indians and their obsession with russian
very fast
Thank you a lot for this project!
It helps me to understand what tools we generally use as Data Engineer to build data pipelines etc. But, I don't feel like to have learned how to do it myself. I mean, i have followed you along, and understand what we made but, I need more explanations on how you process the data, how you get your bucket with AWS Lambda (the code is not explicit to get when doing this " bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')") Need exercises myself
You can go through the test event that we generated. There is a json in the test event that we are using to test the function. Try to navigate that , you will get the understanding how bucket name is captured etc. hope it helps
@@mananyadav6401 Oh yeah, i'll do that, good idea, thanks for your answer :) !
Just completed this project. Thanks for the Content , understanding AWS services and using them for our use case is really crazy thing! @DarshilParmar ❤
#AWS CLI
#S3
#Lambda
#Glue
#Crawler
#Glue Studio
#Glue ETL
#Athena
#Database
#Quicksight
hey manoj can you please help me with new etl job visual editor scripts I am facing trouble to understand that
@@saiganesh5702 even I am facing issues in ETL job creation section due to new UI
how did you set up etl glue job
@@saiganesh5702 were you able to figure this out
@Darshil. KINDLY ASSIST.
Great Job Darshil!!
So far so good, I got stuck on running the de-youtube-parquet-analytics-version job part 2 (minute 35:00) of the tutorial , I keep getting the error below:
Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o114.pyWriteDynamicFrame. Unable to parse file: RUvideos.csv
How did you soved this problem?
I found the solution. This error happens when your csv file have characters different than UTF-8. What you have to do is save the files again on the buckets respecting the UTF-8 format. If you put your csvs on google sheets or excel you can save with utf-8 formatting.
@@gabilinguasHi, I stuck at the same spot 35:00. Can you help me out ?
I have set the Lambda function "timeout" duration to 10 min, but still gives me timeout error.
I have tried to increase the duration to 15 min also & again it got failed.
Before timeout, the function created the parquet file in the destination folder, but no table was created in the glue catalog.
can someone help me to fix this issue?
change the lambda general config memory to 512 instead of 128
increase memory