Elegant data pipelining with Apache Airflow - Bolke de Bruin

Functional Data Engineering - A Set of Best Practices | Lyft

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Highlights | Switzerland vs. Czechia | 2024 #MensWorlds

Predstavenie novej kamošky w/ Jan Bendig

Jaké jsou NEJOSTŘEJŠÍ Kolíky ve Fotbale?

Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health

Data Council

zhlédnutí 31 692

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 1. 06. 2024
Get the slides: www.datacouncil.ai/talks/scal...
ABOUT THE TALK:
This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management.
We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition.
ABOUT THE SPEAKERS:
Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it.
ABOUT DATA COUNCIL:
Data Council (www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
FOLLOW DATA COUNCIL:
Twitter: / datacouncilai
LinkedIn: / datacouncil-ai
Facebook: / datacouncilai
Eventbrite: www.eventbrite.com/o/data-cou...
Věda a technologie

Komentáře • 8

@Barnabassteiniger Před rokem ⁺¹
Wow. Great speaker. Learned a lot. Nice to see someone is dealing with the same problem.
@dsinghr Před 4 lety ⁺²
Composer vs Airflow: Airflow version upgrade is a nightmare. you won't have to worry about that if you use Composer. Another advantage is ipv4 addresses. As they are limited, you don;t have to think too much about them if you use composer. Imagine you created multiple namespaces for different use cases and each use case has 3-5 different environments, just think about how many IP addresses you would need. You may exhaust you quota pretty quickly that way. So composer is great. But i think it is still in beta.
@MarioRugeles Před 2 lety
I got a question: Why not use AWS EMR's autoscaling for the spark layer?
@supermousedd Před 5 lety
Very Cooooool!
@sumitkumarsahoo Před 3 lety
Can anyone tell me what is that commonization tool to being in schema or columns for transformation or joining? Curious about it, seems it's inhouse built in that organization
@Funfina Před 2 lety
What could be a common schema ?
@atampanday6085 Před 4 lety
why not use EKS?
@dsinghr Před 4 lety
why won't you use cloud dataflow on GCP instead of Spark? You then won't have to worry about Kubernetes at all as far as etl is concerned. Airflow itself should definitely run inside Kubernetes.

Další v pořadí

Automatické přehrávání

Elegant data pipelining with Apache Airflow - Bolke de Bruin

Elegant data pipelining with Apache Airflow - Bolke de Bruin

Functional Data Engineering - A Set of Best Practices | Lyft

Functional Data Engineering - A Set of Best Practices | Lyft

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Highlights | Switzerland vs. Czechia | 2024 #MensWorlds

Highlights | Switzerland vs. Czechia | 2024 #MensWorlds

Predstavenie novej kamošky w/ Jan Bendig

Predstavenie novej kamošky w/ Jan Bendig

Jaké jsou NEJOSTŘEJŠÍ Kolíky ve Fotbale?

Jaké jsou NEJOSTŘEJŠÍ Kolíky ve Fotbale?

Dynamic #gadgets for math genius! #maths

Dynamic #gadgets for math genius! #maths

Apache Spark Core-Deep Dive-Proper Optimization Daniel Tomes Databricks

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Amundsen: A Data Discovery Platform From Lyft | Lyft

Amundsen: A Data Discovery Platform From Lyft | Lyft

Data Pipelines Explained

Data Pipelines Explained

Testing Airflow workflows - ensuring your DAGs work before going into production

Testing Airflow workflows - ensuring your DAGs work before going into production

Python + Quarto = Reports So Awesome, You'll Want to Frame Them

Python + Quarto = Reports So Awesome, You'll Want to Frame Them

Tech Inclusion 2016 - The Most Powerful Diversity Leaders in Tech - Google, Change Catalyst & SWAG

Tech Inclusion 2016 - The Most Powerful Diversity Leaders in Tech - Google, Change Catalyst & SWAG

Apache Airflow Architecture 101

Apache Airflow Architecture 101

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

I Docked A Nintendo Switch Lite!

I Docked A Nintendo Switch Lite!

Android 15 Hands-On: Top 5 Features!

Android 15 Hands-On: Top 5 Features!

Дени против умной колонки😁

Дени против умной колонки😁

What’s your charging level??

What’s your charging level??

Worlds smallest 4K headset 😎 Visor.com #tech #vr #technology #virtualreality #onepiece

Worlds smallest 4K headset 😎 Visor.com #tech #vr #technology #virtualreality #onepiece

iPhone 15 Pro vs Samsung s24🤣 #shorts

iPhone 15 Pro vs Samsung s24🤣 #shorts

Showing Scammers Their Own CCTV Cameras On My Computer!

Showing Scammers Their Own CCTV Cameras On My Computer!

Two GPT-4os interacting and singing

Two GPT-4os interacting and singing