Cricket Statistics Data Pipeline in Google Cloud using Airflow | Data Engineering Project

Sdílet
Vložit
  • čas přidán 17. 12. 2023
  • Looking to get in touch?
    Drop me a line at vishal.bulbule@gmail.com, or schedule a meeting using the provided link topmate.io/vishal_bulbule Cricket Statistics Data Pipeline in Google Cloud using Airflow,Dataflow,Cloud Function and Looker Studio
    Data Retrieval: We fetch data from the Cricbuzz API using Python.
    Storing Data in GCS: After fetching the data, we store it in a CSV file in Google Cloud Storage (GCS).
    Cloud Function Trigger: Create a Cloud Function that triggers upon file upload to the GCS bucket. The function will execute when a new CSV file is detected and trigger dataflow job.
    Cloud Function Execution: Inside the Cloud Function, we will have code that triggers a Dataflow job. Ensure you handle the trigger correctly and pass the required parameters to initiate the Dataflow job.
    Dataflow Job: The Dataflow job is triggered by the Cloud Function and loads the data from the CSV file in the GCS bucket into BigQuery. Ensure you have set up the necessary configurations.
    Looker Dashboard: BigQuery serves as the data source for your Looker Studio dashboard. Configure Looker to connect to BigQuery and create the dashboard based on the data loaded.
    Github Repo for all code used in this project
    github.com/vishal-bulbule/cri...
    ============================================
    Associate Cloud Engineer -Complete Free Course
    • Associate Cloud Engine...
    Google Cloud Data Engineer Certification Course
    • Google Cloud Data Engi...
    Google Cloud Platform(GCP) Tutorials
    • Google Cloud Platform(...
    Generative AI
    • Generative AI
    Getting Started with Duet AI
    • Getting started with D...
    Google Cloud Projects
    • Google Cloud Projects
    Python For GCP
    • Python for GCP
    Terraform Tutorials
    • Terraform Associate C...
    Linkedin
    / vishal-bulbule
    Medium Blog
    / vishalbulbule
    Github Repository for Source Code
    github.com/vishal-bulbule
    Email - vishal.bulbule@techtrapture.com
    #dataengineeringessentials #dataengineers #dataengineeringproject #airflow #dataflow #cloudcomposer #bigquery #looker #googlecloud #datapipeline
  • Věda a technologie

Komentáře • 33

  • @shyjukoppayilthiruvoth6568
    @shyjukoppayilthiruvoth6568 Před měsícem +1

    Very good video. would recommend to any one who is new to GCP

  • @dhananjaylakkawar4621
    @dhananjaylakkawar4621 Před 6 měsíci +2

    I was thinking to build a project on GCP and your video arrived . great work sir! thank you

  • @venkatatejanatireddi8018
    @venkatatejanatireddi8018 Před 6 měsíci +1

    I sincerely recommend this to people who wants to explore DE pipeline orchestration on GCP

  • @ajayagrawal7586
    @ajayagrawal7586 Před 2 měsíci

    I was looking for this type of video for a long time. Thanks.

  • @prabhuduttasahoo7802
    @prabhuduttasahoo7802 Před 2 měsíci

    Learnt a lot from you. Thank you sir

  • @brjkumar
    @brjkumar Před 5 měsíci +2

    Good job. Looks like the best video for GCP ELT & other GCP stuff.

  • @balajichakali9293
    @balajichakali9293 Před 4 měsíci

    Thanks is a small word to you sir..🙏
    This is the Best Explanation I ever seen in youtube. It is very helpful to me. I have completed this project end to end and l have learnt so many things.

  • @wreckergta5470
    @wreckergta5470 Před 4 měsíci

    Thank you, learned a lot from you sir

    • @techtrapture
      @techtrapture  Před 4 měsíci

      Happy to know. Keep learning brother 🎉

  • @user-ws9xy6db6y
    @user-ws9xy6db6y Před 4 měsíci

    Thanks a lot for such great explanation. Can you please share which video recording/editing tool is being used?

  • @rishiraj2548
    @rishiraj2548 Před 6 měsíci +1

    Thanks

  • @Anushri_M29
    @Anushri_M29 Před měsícem

    Hi Vishal, this is a really great video, but it would be very helpful if you could also explain the code that you have written from 6:01.

  • @NirvikVermaBCE
    @NirvikVermaBCE Před 3 měsíci

    I am getting stuck on the airflow code, I think it might be an issue with the filename in the python code, bash_command='python /home/airflow/gcs/dags/scripts/extract_data_and_push_gcs.py', I have uploaded the extract_data_and_push_gcs.py in scripts of dags.
    However, is there any way to check the path /home/airflow/gcs/dags/scripts/ ??

    • @techtrapture
      @techtrapture  Před 3 měsíci

      /home/airflow/gcs/dags = your dags GCS bucket
      It's same path

  • @venkatatejanatireddi8018
    @venkatatejanatireddi8018 Před 6 měsíci

    i have been facing issues invoking the dataflow job, while using the default App engine service account. Could you let me know if you were using a specific service account to work with the cloud function?

    • @techtrapture
      @techtrapture  Před 6 měsíci

      No, I am using the same default service account.what error you are getting?

  • @ShigureMuOnline
    @ShigureMuOnline Před měsícem

    nice video. just one question why do you create a dataflow ? you can insert rows using python?

    • @techtrapture
      @techtrapture  Před měsícem +1

      Yes I agree but as a project I want to show the complete orchestration process and use multiple services

    • @ShigureMuOnline
      @ShigureMuOnline Před měsícem

      @@techtrapture really thanks for the faster answer. I Will see all your videos

  • @SwapperTheFirst
    @SwapperTheFirst Před 4 měsíci

    Hi Vishal, in this and your other Composer videos you use standard Airflow operators (for example, Python or Bash). Do you know how to install Google Cloud Airflow package for Google cloud specific operators? I've tried to upload the wheel to /plugins bucket, but nothing happens. Composer can't import Google Cloud operators (like pubsub) and DAGs with these operators are listed as broken.
    Thanks!

    • @techtrapture
      @techtrapture  Před 4 měsíci +1

      I usually refer this code sample
      airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/index.html

    • @SwapperTheFirst
      @SwapperTheFirst Před 4 měsíci

      @@techtrapture thanks! But how to use these operators in Composer?
      In Airflow I just pip install the package. How to do this in Composer?!

    • @techtrapture
      @techtrapture  Před 4 měsíci +1

      Ohh k got your doubts now...you have to add it in requirements.txt and keep in dags folder. Also other options available here.
      cloud.google.com/composer/docs/how-to/using/installing-python-dependencies

    • @SwapperTheFirst
      @SwapperTheFirst Před 4 měsíci

      ​@@techtrapture yes, this is exacthly what I needed. I can use both of these options, depending on the DAGs. Great!

  • @sampathgoud8108
    @sampathgoud8108 Před 2 měsíci

    I tried the same way as per your video but i got this error when running the data flow job through template. Could you please help me out what exactly the mistake which i have done. I used the same schema which you have used.
    Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to serialize json to table row: 1,Babar Azam,Pakistan

    • @techtrapture
      @techtrapture  Před 2 měsíci

      Are you using same json files?

    • @sampathgoud8108
      @sampathgoud8108 Před 2 měsíci

      yes@@techtrapture
      Below is the JSON file
      {
      "BigQuery Schema": [{
      "name": "rank",
      "type": "STRING"
      },
      {
      "name": "name",
      "type": "STRING"
      },
      {
      "name": "country",
      "type": "STRING"
      }
      ]
      }

    • @sampathgoud8108
      @sampathgoud8108 Před 2 měsíci

      I tried Rank column with both String and INTEGER data types. For both i am getting the same issue.

    • @pankajgurbani1484
      @pankajgurbani1484 Před 2 měsíci

      @sampathgoud8108 I was getting the same error, this got resolved after I put the 'transform' in JavaScript UDF name in Optional Parameters while setting up DataFlow job

  • @TechwithRen-Z
    @TechwithRen-Z Před měsícem +1

    This tutorial is 😩a waste of time for beginners. He did not show how to connect python to the GCP before storing data in bucket. There a lot of missing steps.

  • @Rajdeep6452
    @Rajdeep6452 Před 4 měsíci +1

    you didnt show how to connect GCP before storing data in bucket. You have jumped a lot of steps. your video lacks quality. You should also include which dependencies to use and all. Just running your code and uploading to Github is not everything.