Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health

Sdílet
Vložit
  • čas přidán 1. 06. 2024
  • Get the slides: www.datacouncil.ai/talks/scal...
    ABOUT THE TALK:
    This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management.
    We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition.
    ABOUT THE SPEAKERS:
    Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it.
    ABOUT DATA COUNCIL:
    Data Council (www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
    FOLLOW DATA COUNCIL:
    Twitter: / datacouncilai
    LinkedIn: / datacouncil-ai
    Facebook: / datacouncilai
    Eventbrite: www.eventbrite.com/o/data-cou...
  • Věda a technologie

Komentáře • 8

  • @Barnabassteiniger
    @Barnabassteiniger Před rokem +1

    Wow. Great speaker. Learned a lot. Nice to see someone is dealing with the same problem.

  • @dsinghr
    @dsinghr Před 4 lety +2

    Composer vs Airflow: Airflow version upgrade is a nightmare. you won't have to worry about that if you use Composer. Another advantage is ipv4 addresses. As they are limited, you don;t have to think too much about them if you use composer. Imagine you created multiple namespaces for different use cases and each use case has 3-5 different environments, just think about how many IP addresses you would need. You may exhaust you quota pretty quickly that way. So composer is great. But i think it is still in beta.

  • @MarioRugeles
    @MarioRugeles Před 2 lety

    I got a question: Why not use AWS EMR's autoscaling for the spark layer?

  • @supermousedd
    @supermousedd Před 5 lety

    Very Cooooool!

  • @sumitkumarsahoo
    @sumitkumarsahoo Před 3 lety

    Can anyone tell me what is that commonization tool to being in schema or columns for transformation or joining? Curious about it, seems it's inhouse built in that organization

  • @Funfina
    @Funfina Před 2 lety

    What could be a common schema ?

  • @atampanday6085
    @atampanday6085 Před 4 lety

    why not use EKS?

  • @dsinghr
    @dsinghr Před 4 lety

    why won't you use cloud dataflow on GCP instead of Spark? You then won't have to worry about Kubernetes at all as far as etl is concerned. Airflow itself should definitely run inside Kubernetes.