Cricket Statistics Data Pipeline in Google Cloud using Airflow | Data Engineering Project
Vložit
- čas přidán 17. 12. 2023
- Looking to get in touch?
Drop me a line at vishal.bulbule@gmail.com, or schedule a meeting using the provided link topmate.io/vishal_bulbule Cricket Statistics Data Pipeline in Google Cloud using Airflow,Dataflow,Cloud Function and Looker Studio
Data Retrieval: We fetch data from the Cricbuzz API using Python.
Storing Data in GCS: After fetching the data, we store it in a CSV file in Google Cloud Storage (GCS).
Cloud Function Trigger: Create a Cloud Function that triggers upon file upload to the GCS bucket. The function will execute when a new CSV file is detected and trigger dataflow job.
Cloud Function Execution: Inside the Cloud Function, we will have code that triggers a Dataflow job. Ensure you handle the trigger correctly and pass the required parameters to initiate the Dataflow job.
Dataflow Job: The Dataflow job is triggered by the Cloud Function and loads the data from the CSV file in the GCS bucket into BigQuery. Ensure you have set up the necessary configurations.
Looker Dashboard: BigQuery serves as the data source for your Looker Studio dashboard. Configure Looker to connect to BigQuery and create the dashboard based on the data loaded.
Github Repo for all code used in this project
github.com/vishal-bulbule/cri...
============================================
Associate Cloud Engineer -Complete Free Course
• Associate Cloud Engine...
Google Cloud Data Engineer Certification Course
• Google Cloud Data Engi...
Google Cloud Platform(GCP) Tutorials
• Google Cloud Platform(...
Generative AI
• Generative AI
Getting Started with Duet AI
• Getting started with D...
Google Cloud Projects
• Google Cloud Projects
Python For GCP
• Python for GCP
Terraform Tutorials
• Terraform Associate C...
Linkedin
/ vishal-bulbule
Medium Blog
/ vishalbulbule
Github Repository for Source Code
github.com/vishal-bulbule
Email - vishal.bulbule@techtrapture.com
#dataengineeringessentials #dataengineers #dataengineeringproject #airflow #dataflow #cloudcomposer #bigquery #looker #googlecloud #datapipeline - Věda a technologie
Very good video. would recommend to any one who is new to GCP
I was thinking to build a project on GCP and your video arrived . great work sir! thank you
I sincerely recommend this to people who wants to explore DE pipeline orchestration on GCP
I was looking for this type of video for a long time. Thanks.
Learnt a lot from you. Thank you sir
Good job. Looks like the best video for GCP ELT & other GCP stuff.
Glad it was helpful!
Thanks is a small word to you sir..🙏
This is the Best Explanation I ever seen in youtube. It is very helpful to me. I have completed this project end to end and l have learnt so many things.
Glad that it helped you.
Thank you, learned a lot from you sir
Happy to know. Keep learning brother 🎉
Thanks a lot for such great explanation. Can you please share which video recording/editing tool is being used?
Thanks
Hi Vishal, this is a really great video, but it would be very helpful if you could also explain the code that you have written from 6:01.
I am getting stuck on the airflow code, I think it might be an issue with the filename in the python code, bash_command='python /home/airflow/gcs/dags/scripts/extract_data_and_push_gcs.py', I have uploaded the extract_data_and_push_gcs.py in scripts of dags.
However, is there any way to check the path /home/airflow/gcs/dags/scripts/ ??
/home/airflow/gcs/dags = your dags GCS bucket
It's same path
i have been facing issues invoking the dataflow job, while using the default App engine service account. Could you let me know if you were using a specific service account to work with the cloud function?
No, I am using the same default service account.what error you are getting?
nice video. just one question why do you create a dataflow ? you can insert rows using python?
Yes I agree but as a project I want to show the complete orchestration process and use multiple services
@@techtrapture really thanks for the faster answer. I Will see all your videos
Hi Vishal, in this and your other Composer videos you use standard Airflow operators (for example, Python or Bash). Do you know how to install Google Cloud Airflow package for Google cloud specific operators? I've tried to upload the wheel to /plugins bucket, but nothing happens. Composer can't import Google Cloud operators (like pubsub) and DAGs with these operators are listed as broken.
Thanks!
I usually refer this code sample
airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/index.html
@@techtrapture thanks! But how to use these operators in Composer?
In Airflow I just pip install the package. How to do this in Composer?!
Ohh k got your doubts now...you have to add it in requirements.txt and keep in dags folder. Also other options available here.
cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
@@techtrapture yes, this is exacthly what I needed. I can use both of these options, depending on the DAGs. Great!
I tried the same way as per your video but i got this error when running the data flow job through template. Could you please help me out what exactly the mistake which i have done. I used the same schema which you have used.
Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to serialize json to table row: 1,Babar Azam,Pakistan
Are you using same json files?
yes@@techtrapture
Below is the JSON file
{
"BigQuery Schema": [{
"name": "rank",
"type": "STRING"
},
{
"name": "name",
"type": "STRING"
},
{
"name": "country",
"type": "STRING"
}
]
}
I tried Rank column with both String and INTEGER data types. For both i am getting the same issue.
@sampathgoud8108 I was getting the same error, this got resolved after I put the 'transform' in JavaScript UDF name in Optional Parameters while setting up DataFlow job
This tutorial is 😩a waste of time for beginners. He did not show how to connect python to the GCP before storing data in bucket. There a lot of missing steps.
you didnt show how to connect GCP before storing data in bucket. You have jumped a lot of steps. your video lacks quality. You should also include which dependencies to use and all. Just running your code and uploading to Github is not everything.