Advancing Spark - Setting up Databricks Unity Catalog Environments

Advancing Analytics

zhlédnutí 17 882

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 12. 09. 2024
Unity Catalog is a huge part of Databrick's platform, if you're currently using Databricks but you don't have UC enabled, you're going to be missing out on some pretty huge features in the future! But where do you start? Do you really need a Global AD Admin, and how what do they have to do? How do you manage dev, test and prod if they all have to share the same metastore?!?
In this video Simon takes two existing databricks workspaces and builds out Unity Catalog from scratch, provisioning account console access, creating a metastore, allocating access for the managed identity and locking down catalogs to their respective workspaces. Want to get started with UC, check this video out!
For some more detailed step-by-step examples, see the MS Learn docs: learn.microsof...
And for the training Simon mentions at the start, check out advancinganaly...

Komentáře • 31

@user-tq8dr9er2h Před 6 měsíci ⁺²
awesome. Thanks buddy , I was the victim of the never ending loop & after setting up account admin role , I was able to enable the unity catalog
@julius8183 Před 4 měsíci
Finally a good, solid video that explains it well. Thanks! I would love to see a follow-up where you actually land some data in Bronze and transform it to Silver in development. How does the data look like in the containers? How does the tab "catalog" show the data in Databricks? How are they related? I want to know these things but can barely find anyone explaining it well. I basically want to build an enterprise lakehouse from scratch. Thanks!
@knowwhyt Před 11 měsíci ⁺¹
Good Explanation!! I was missing the Global AD access which many of other CZcams did not explain, Thank You
@saugatmukherjee8119 Před rokem ⁺¹
Hi Simon, your videos are always on point and something tangible that I have always found very useful. Just one thing- since you have already assigned the rbac role of storage blob data contributor on the lake, for the access connector, you do not need acl on the container level.Had you not given the rbac on the lake, you would have needed acl.
@MohammadAwad-q4v Před 9 měsíci
Thank you for the clear and practical videos! So far they've been helping me build foundational knowledge in Databricks industry standards and how things are/should be done. Looking forward to watching more of your content!
@akhilannan Před rokem ⁺⁸
Databricks should really come up with a way to enable Unity Catalog by default for a workspace without having to chase down Global Admins.
@user-ln3jj4ip9d Před rokem ⁺²
yeah, this is a big mistake on their part tbh. one of the huge advantages of azure databricks was that a team could just start building - no procurement process, no need for global admins. making a change like this can take months in a large enterprise
@advait2062 Před rokem ⁺²
Databricks SA here; this feature is coming in due course. It is being designed as we speak!
@drummerboi4eva Před 10 měsíci
Incredible video Simon, thanks for making it simple and clear !
@TeeVanBee Před rokem ⁺²
Great video. Thank you! I believe the ACLs are not needed when you've assigned Storage Blob Data Contributor. I've been able to set it up successfully without it.
@AdvancingAnalytics Před rokem
I thought so too, but then got errors during metastore creation! Might just be UI checks but it certainly thought it needed the ACLs!
@arunr2265 Před rokem
@@AdvancingAnalytics , Same here . ACLs are not required.
@LS-rv1lk Před 10 měsíci ⁺²
Hi Simon, why do we need the ACLs as well as the RBAC for the access connector on the metastore ADLS?
@fredrikovesson Před rokem
Thx!! This was just what I needed. And not the first time you read my mind 😂.
Keep up the good work. Very much appreciated
@DGermishuizen Před rokem
Great walkthrough, thanks !
@hellhax Před rokem ⁺¹
I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant. We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.
There are organizational questions that are difficult to answer:
1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.
2) Who will be hosting and managing metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.
3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.
All these problems come down to one question - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having its own? Then everyone would take care of their own stuff and everyone would be happy!
@AlessandroGattolin Před rokem
Thanks Simon, engaging and crystal clear explanation as always!
I would like to ask you a question: how would you design workspaces/catalogs avoiding to replicate 1:1 the data among DEV-STG-PRD environments and at the same time be able to handle 2 different scenarios which are the following:
1 - new requirement driven from data consumer. For example you need to develop a new gold table.
2 - new requirement driven from data producer. For example you need to add one column to the bronze and silver tables
For scenario 1, it could be nice to work in DEV workspace but actually read prd_silver_table in order to be sure the gold logic properly works (if not on 100% of the data, maybe just the data from last month/week with PII anonymized)
For scenario 2 instead since the new column is new, of course it is not present in PRD and so it is necessary to import new data from the producers in DEV workspace into dev_bronze_table and then dev_silver_table. In this case, if you want to run a regression test on a gold_table and once again you would like to run it on subset of PRD data, how would you approach it ?
Thanks anyway for all the material!! :)
@AdvancingAnalytics Před rokem ⁺¹
Most scenarios we come across, this is absolutely not allowed by InfoSec - whilst it would be great to cross-reference prod data for development & testing, we're rarely allowed to open up that access.
If you wanted to, you wouldn't enforce the workspace-catalog bindings, instead you would have to rely on table permissions - your developers can read from the prod schema, but they are denied write access. If only an elevated account can write to prod, you can then model your scenarios.
Pretty rare though, environment separation is much more common!
@AlessandroGattolin Před rokem
@@AdvancingAnalytics thanks for the quick feedback! I understand and I agree this is the common scenario. I am trying to find a new solution that is a win-win: avoid to duplicates data, but at the same time have a secure infrastructure. It could be to do read mode only as you said, with maybe some masking of PII, I’ll will investigating more into this nonetheless and see where the future of the data world goes :)
@IsmaelByrd Před rokem ⁺¹
thanks Simon.
decision question: DB solution architects have recommended we use managed tables in UC. They mention a lot of benefits that are built into UC for query optimizations, AI-driven optimizations, etc.. But the idea of having external tables live in the azure subscriptions of the various data-producing domain teams seems like the best option. How would you decide on which option to use? Or can you do some combination of both?
@fb-gu2er Před rokem ⁺¹
I don’t recommend managed tables. If your data is deleted you have to undo the table drop to not lose it. Migrating to a new workspace or to UC will force to copy all the data. This is not the case of external tables
@AdvancingAnalytics Před rokem ⁺²
So - you absolutely, definitively want your tables stored in external storage rather than the UC metastore location, as it's fairly likely you'll have dev/test/prod environments etc, and having everything in a single lake goes against every infosec rule!
However, you can override the managed table location when creating the catalog or schema by adding the LOCATION clause upon creation. That means you can have managed tables, but in your choice of lake location.
The decision then lies with whether you want UC to be the primary owner of the data. If you drop a managed table, you drop the underlying data. Historically, we've avoided this like the plague, as there's very little benefit, but as UC adds a ton more "we'll optimize it for you" features, I imagine there will be a strong push to switch over to managed tables in future
@fb-gu2er Před rokem ⁺¹
@@AdvancingAnalytics liquid clustering provides optimization for external tables as well. The only part missing would be vacuuming. But that’s simple enough. I have a dedicated job for that purpose that will scan all databases and tables once a day and vacuum them
@wiwiwiii Před 8 měsíci
Don't you think that having DEV, QA, and PROD data all in the same datalake could create some performance issue? Usually DEV QA and PROD data have different licefcycles and data magnitude, workspaces could be bound to different vnets as well as ADLSs targeted as datalakes.
So it would make me more comfortable having a separate ADLS datalake for every env.
If we used external datalakes would be still have all the features of the unity catalog available?
@yuvakarthiking Před 5 měsíci
Hi Simon,
I having trouble is sending data to Domo using pydomo lib. As I am using external location in UC as source path the OS.list or other OS functions are not able to read the files in abfss path. Is there any solution to this ?
@lucaslira5 Před 9 měsíci
In the case of having 2 storage accounts, 1 for dev and 1 for prod, should I create 1 metastore in each storage account and assign the dev workspace to the dev metastore and the prod workspace to the prod workspace?
@user-ww6yf3iq8q Před rokem
Internal Server Error I'm getting this
@fb-gu2er Před rokem
Informative, but for anyone, I don’t recommend ever enabling UC in the UI. Use proper IaC tools like TF, or whatever
@AdvancingAnalytics Před rokem ⁺¹
Absolutely, we TF most of our client deployments - but if there are people out there who haven't come across any of these setup steps, it's important to know what's actually happening before you automate it!
@jacovangelder9700 Před rokem
why on earth can you not add a group as account admin?
@AdvancingAnalytics Před rokem
Who knows? Madness, I agree!

Další v pořadí

Automatické přehrávání

Advancing Spark - Managing Files with Unity Catalog Volumes