Data Architecture 101: The Modern Data Warehouse

Sdílet
Vložit
  • čas přidán 5. 09. 2024

Komentáře • 42

  • @KahanDataSolutions
    @KahanDataSolutions  Před rokem +4

    ►► The Starter Guide for Modern Data (Free PDF) → www.kahandatasolutions.com/startermds

  • @shashankemani1609
    @shashankemani1609 Před rokem +1

    its really a great video for someone to understand the high-level architecture of modern data stack. It would be great if you can start a in-depth data modelling playlist as it plays a crucial role in designing data engineering pipelines. Thank you

  • @Austin-dm5bp
    @Austin-dm5bp Před rokem

    Really appreciated seeing the different examples, as it helped to underline how the stages remain the same, regardless of the specific tools being used.

  • @shakedm7256
    @shakedm7256 Před rokem +1

    Just discovered your channel recently and I wanted to say it is a gold mine! Keep making this kind of content!

  • @aliahmad1987
    @aliahmad1987 Před 7 měsíci +1

    Great video! Your pace, presentation and visuals are really on point.
    Keep up the good work :)

  • @johnflanagan6367
    @johnflanagan6367 Před 2 měsíci

    I just discovered your videos. They are excellent. Clear, concise and to the point. Great content! Thanks so much!

  • @jayakrishna8121
    @jayakrishna8121 Před rokem +2

    awesome this is really useful. Keep making these sample architecture videos.

  • @SheranneTan-n1p
    @SheranneTan-n1p Před měsícem

    Great content! I had a question - why would companies choose to use standalone ELT / ETL providers (e.g. Stitch, Matillion) over the native Amazon Glue / Azure data factory? Wouldn’t it be easier to use the cloud provides as it would be more integrated?

  • @thomasbrothaler4963
    @thomasbrothaler4963 Před rokem +1

    How did I not found your channel much earlier. Your videos are extremely concise, well visualized and informative. I am a Data Scientist transitioning to Data Engineering (because in Gaming I am also always the healer/support 😉)

  • @user-ri1ox6ut2y
    @user-ri1ox6ut2y Před rokem

    I think it would be really cool to see how, once the data is landed in the data lake, you bring all the data together, since you wont necessarily have matching IDs from different sources to work with.

  • @colter7
    @colter7 Před 6 měsíci

    Best data modeling videos I've come across so far, great job!

  • @navoabey6047
    @navoabey6047 Před rokem

    Simple and to the point explanation. I think it very important to understand the concepts as well not just tools, very useful for interviews also.

  • @rks.siddhartha
    @rks.siddhartha Před rokem

    These design and architecture videos are great to learn the concepts in bite sizes. Looking forward to more such videos.

  • @sylwiasiniakowska6939
    @sylwiasiniakowska6939 Před 7 měsíci

    Hello! Where does ETL tools like Informatica or Alteryx land in a modern data architecture? Or not at all because we have dbt / azure data factory/ SQL script ?

  • @tomastruchly9484
    @tomastruchly9484 Před rokem

    Heh the concept you presented (collect data from various sources into Snowflake DWH) & transform it via dbt is exactly what we do for customer :) I worked in on-premise where we handled everything via scripts & Jenkins & must say this modern approach is in many aspects better :)

    • @KahanDataSolutions
      @KahanDataSolutions  Před rokem

      Snowflake + dbt is my favorite stack as well. Doesn't have to be overly complex to be effective.

  • @sailedship6530
    @sailedship6530 Před rokem

    What if I want to add local data Marts to that traditional flow ? Would that be a bad idea?
    I just want to set up local data Marts and connect them to a data lake (and somehow make it replace the data warehouse. And if that's not possible, connect them to the warehouse).
    Can you please make a video to show us the disadvantages of this set up? Thank you in advance

  • @AlexKashie
    @AlexKashie Před 11 měsíci

    Wow, great content broken down simply.... Thank you.

  • @jgianan
    @jgianan Před 7 měsíci

    awesome explanation and visuals! Keep it up!

  • @StartDataLate
    @StartDataLate Před 7 měsíci

    Can I ask where do you put the HDFS in the current data architec stack?

  • @michaelhunger6160
    @michaelhunger6160 Před rokem

    I was curious coming from your "simple, small/mid-sided" data email. I expected something on efficient analytics databases like duckdb/motherduck/hydra/firebolt? Do you plan to cover that in a future episode? Basically the other parts of the stack would stay the same just the processing goes from snowflake/synapse/bigquery to one of these more efficient, lower-cost tools.

    • @KahanDataSolutions
      @KahanDataSolutions  Před rokem

      Thanks for reading the email and checking out this video! I actually have not used any of those tools in depth so I can't speak much on them at this time. But perhaps in the future!

  • @pabloortiz_romero2740

    Great content, simple and clear.

  • @muftkuseng5924
    @muftkuseng5924 Před 6 měsíci

    Is it meant to only add the new data to the datalake or the full copy? As an example we are using odoo as our erp system. the sql database has a unpacked size of 6gb. if i would copy it daily the amount of data would get huge. on the one hand every data would be persistend with it and i could have more options to analiye but is this really best practice?

  • @user-ri1ox6ut2y
    @user-ri1ox6ut2y Před rokem

    What is the use case for brining in all of the data into the data lake prior to the data warehouse? Is it possible that you bring some data into your data warehouse from the source systems directly and some data in from s3 buckets?

    • @KahanDataSolutions
      @KahanDataSolutions  Před rokem

      Keeping it in a data lake:
      - Gives a historical log of all source data in it's raw form (before any DW transformations)
      - Allows you to load data faster, and separately from the DW transformation processes
      - Provides a clear location for all source data to be landed, whereas a DW might have other processes involved
      Plus storage is less expensive nowadays so it's less of a problem to storing it all this way (to an extent).
      I'm probably missing other things but that's just off the top of my head. Plus storage is less expensive nowadays so it's less of a problem to storing it all this way (to an extent).
      Hope that helps!

    • @user-ri1ox6ut2y
      @user-ri1ox6ut2y Před rokem

      @@KahanDataSolutions Interesting, thanks for the reply! I guess my only follow up we be around your first point about keeping a historical log. Couldnt this be done just as easily in the data warehouse, assuming you are dealing with structured data? The data could be dumped into the DW just as raw as it could be in the data lake right?

  • @Bravopetwal-wj3iz
    @Bravopetwal-wj3iz Před 5 měsíci

    Superb

  • @Ka_Vin_Da
    @Ka_Vin_Da Před rokem

    Really useful ❤

  • @gatorpika
    @gatorpika Před rokem

    What are some examples of the difference between the data warehouse and data models? So like if you build a star schema data warehouse, couldn't tableau just connect directly to that rather than another layer of models? Or are you building the models to differentiate the data used by different groups (i.e. a marketing mart)? Also would you typically materialize those as OBT views or physical tables? Kind of can't wrap my head around that part. Thanks!

    • @KahanDataSolutions
      @KahanDataSolutions  Před rokem +2

      Great question, and the short answer is both approaches you're mentioning are possible.
      It's typically a matter of how much logic you want to hold in the database/queries vs offloaded to Tableau or reporting tool.
      I find that a lot of companies start with going from Data Warehouse right to reporting tool, but then end up shifting to having a handful of custom data mart models in between that get tied to different reports. Ideally, you can then reuse the same mart for multiple reports.
      The reason teams often struggle when adding a lot of logic directly in reporting tools is that as you add more and more complex logic it becomes really hard to track & troubleshoot. It basically gets lost within reports. It's easier in the short term but results in duplication, conflicting logic and more.
      You also don't have easy version control, testing, transparency, etc. like you would have if you wrote it in sql (and with a tool like dbt) and deployed it to a DB first.
      If you don't want to add an extra mart layer, it's also possible to handle a lot of that complexity still within the warehouse layer. It's really up to your team and how you want to organize it.
      For the second part of your question on materialization, again there's no one-size fits all answer. But I find that the marts layer is typically more of a OBT (table) approach, or closer to it.
      For example, you can tie together a bunch of DW tables to create a common "summary" view or on the granularity of something that can be re-used for multiple reports. But as you said, it's also acceptable to simply create additional custom models simply to separate user groups. I've seen all of the above done and it is often a case by case basis.
      This was a LONG winded response, but hopefully was helpful. Data strategy can be confusing but at the end of the day is just finding a way to organize tables/views/data in ways that work best for you.

    • @gatorpika
      @gatorpika Před rokem

      @@KahanDataSolutions Thanks for spending the time to write that up, that was very helpful. I had not really thought of version control over the mart logic since our current BI tool sort of handles that, so that makes a lot of sense. I guess I got confused since we do all the things you mentioned within what we call the "data warehouse" layer in our architecture and wouldn't call that out separately on a diagram probably, so I was assuming there was something magical happening there that I couldn't figure out after having see that architecture a few times. Makes sense to call it out I guess I just wasn't bright enough to figure out why. Appreciate your content.

    • @kinuthiasteve4505
      @kinuthiasteve4505 Před rokem

      @@gatorpika This is really an amazing question, I am in a situation where management want near real-time dashboards. My manager wants to plug Tableau directly to the DB(it's AWS dynamodb) using ODBC driver. But my thinking is, stream the data from DynamoDB with dynamo streams/kinesis firehose use AWS glue to crawl and maybe change datatypes then load it to redshift or s3 where I can connect with Tableau. I much appreciate your views, thanks.

    • @gatorpika
      @gatorpika Před rokem

      @@kinuthiasteve4505 I'm not a streaming expert but yeah I think you are on the right track. I have not used Tableau in years, but it used to be more of an analysis platform for historical data, not a streaming platform, right? Like you have to manually refresh the data? We are working on something like that now where we use Kafka to consume the source data and that feeds some apps that display the real time stream and also feeds our data platform where history is accumulated and accessible via a BI tool like Tableau.

  • @bananaboydan3642
    @bananaboydan3642 Před rokem

    Amazing video man. As a senior CS student and aspiring data engineer, I get none of this in school! Love the channel man. Are you on instagram / twitter?