Advancing Spark - Databricks Delta Change Feed

Sdílet
Vložit
  • čas přidán 27. 04. 2021
  • The Databricks Change Feed enables CDC, or Change Data Capture, in the spark environment - this is pretty huge. Keeping track of changed records can be a hugely inefficient exercise, comparing masses of records to determine which ones have been changed by upstream events. With the delta table change feed, we can keep an efficient track of exactly which records have changed!
    In this video, Simon looks at the initial release, looking at how the change feed works under the covers, what happens in the transaction log and how you get started with CDC in your Lake!
    For more details on the new change feed functionality - check out the docs here: docs.microsoft.com/en-us/azur...
    As always - don't forget to hit those like & subscribe buttons, and get in touch if we can help with your advanced analytics objectives!

Komentáře • 22

  • @Advancing_Terry
    @Advancing_Terry Před 3 lety +2

    Who doesn't like storage costs? Great video Si.

  • @Sunkarakrwutarth9389
    @Sunkarakrwutarth9389 Před 3 lety

    very nice feature

  • @alexischicoine2072
    @alexischicoine2072 Před 2 lety

    Hi Simon. Love your videos. Do you know if there's a library or framework that exists to propagate changes from a table to an aggregated table or other modifications? I started my own as I couldn't find anything, but I'd rather use something existing.

  • @ArcaLuiNeo
    @ArcaLuiNeo Před 3 lety +1

    Thanks for this excellent video.
    Would you have time to do a video on what are some of the better practices for doing "proper" engineering on your local machine, automated testing, CI, etc. Especially since you mentioned open source delta table...

    • @AdvancingAnalytics
      @AdvancingAnalytics  Před 3 lety +1

      Our very excellent senior engineer Anna has being doing lots of work in this space. Got some more DevOps/CI/CD/Testing focused topics on the roadmap, hopefully in the next month or two!
      Simon

  • @felipecastro3710
    @felipecastro3710 Před rokem +1

    Hey Simon! I usually see people showing examples where the _commit_version is hardcoded. What would be a good way to keep track of the already processed versions of a table, during a silver -> silver flow, for exemple? It could be useful for incrementally loading data to a silver table composed of multiple silver tables joins. Thanks!

    • @realjackofall
      @realjackofall Před rokem +1

      Did you find a solution on where/how to store the last commit version?

    • @chinny4953
      @chinny4953 Před 9 měsíci

      Good question. I'm also trying to figure out how to automatically read last changed commit versions in a pipeline. Hardcoding obviously won't work.@@realjackofall

  • @kuldipjoshi1406
    @kuldipjoshi1406 Před 2 lety +1

    What about parallel upsert happening over partitioned tables, how to find changed data oneach partition seperately. Bcz versions wilo be created in parallel and we won't be able to determine what versions were created for which partition . How to determine either timestamp or version numbers in that case. What about multiple versions created in same operation .like upsert + optimize will create at least 2-3 version on one go

    • @AdvancingAnalytics
      @AdvancingAnalytics  Před 2 lety

      Versioning of the delta table works across partitions - you would normally update multiple partitions as a single batch, rather than running separate parallel jobs for each partition. So you see a single delta table version that covers all partition updates, and can rollback using that logic.
      For upsert & optimize, absolutely you'll see two different versions, one for each batch transaction. Optimize shouldn't affect any downstream tables given there is no actual data change in there!
      Simon

  • @ascensiuneabm2683
    @ascensiuneabm2683 Před rokem

    What is the difference between autoloader and CDF?

  • @pragattiwari5530
    @pragattiwari5530 Před rokem

    Hi My question might be little dumb but please allow me to ask.
    Let's say we have a parquet file with 1000 records and my table is CDF enabled table. If i update just one record out of those 1000 records and for that one specific record i have got my change log dir created and parquet file having the two records about current and previous record/ When i will just query my table how it will know that for just one record we have to look other parquet file also what will happen to the one record which is still present in the latest parquet file? pls answer asap

    • @AdvancingAnalytics
      @AdvancingAnalytics  Před rokem

      It's been a while since I've dug into it, but from memory the standard Delta parquet files separate inserts and updated records. If you query the table looking for table changes, it'll return records from the "inserted" parquet, and records from the change feed directory. If you query the table normally, it just takes all of the normal parquet. I'll double-check that's how it's working in the week!
      Simon

    • @pragattiwari5530
      @pragattiwari5530 Před rokem

      @@AdvancingAnalytics Thanks it would be really helpful.

  • @kuldipjoshi1406
    @kuldipjoshi1406 Před 3 lety

    readstream upserts goona drive data lake crazy on size i think..maybe i can sort of hijack microbatch and do vaccum there(I see risk doint that though)..Useful feature anyways. Looking for video on the example continuation of flights dataset with this feature .

    • @AdvancingAnalytics
      @AdvancingAnalytics  Před 3 lety +2

      Yep, definitely gonna be some considerations if you're always upserting. Splitting it into insert and update transactions will make more efficient, but harder to do in a stream!
      I've not yet looked at whether we can do anything with OPTIMIZE over the change feed... Doubt it but worth a look!
      Simon

    • @kuldipjoshi1406
      @kuldipjoshi1406 Před 3 lety

      @@AdvancingAnalytics I tried doing it on stream, Worked till some time and failed . I have put it to trigger on evey few mins then it works fine. Its a lways a tradeoff i think

  • @baatchus1519
    @baatchus1519 Před 3 lety

    👍👍👍

  • @bigchallenge7987
    @bigchallenge7987 Před 2 lety

    I did not understand the difference between time travel and CDF? Could you please explain?

    • @AdvancingAnalytics
      @AdvancingAnalytics  Před 2 lety

      Time Travel shows you the state of data at a single point in time. Change Feed shows you which records that changed in between two points in time. Change Feed requires storing additional data, but makes propagating change downstream more efficient!

    • @bigchallenge7987
      @bigchallenge7987 Před 2 lety

      @@AdvancingAnalytics Thanks much, It make more sense now for me:) and we should also not confuse with change data capture from source system. As seen the docs, change data feed is a simplification over CDC. Please would like to have your insights on this?

  • @Idea_Center
    @Idea_Center Před 2 lety

    Any comments about inpact of vaccum run over version history. I believe vaccum run updates version history