Header-detail vs star schema models in Tabular and Power BI

SQLBI

zhlédnutí 25 822

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 9. 09. 2024
Compare header/detail models and star schemas: even though a star schema might require a few ETL steps and some coding, it is always the best modeling option.
Article and download: sql.bi/728493?...
How to learn DAX: www.sqlbi.com/...
The definitive guide to DAX: www.sqlbi.com/...

Komentáře • 66

@Milhouse77BS Před 2 lety ⁺⁶
Nice touch with the expresso drink at 24:10
@MortenHannibalsenOlsen Před 2 lety ⁺⁸
These deep dive videos where you actually prove (or disprove) common data modeling theory are pure gold!
@Milhouse77BS Před 2 lety ⁺³
Love the "No" answer at 1:48
@toniskolik5807 Před 2 lety ⁺¹¹
You deliver by far the best content for the community! Thank you so much!
@alenalihovidova7607 Před 2 lety ⁺⁴
That was really useful! I was using header-detail model structures extensively to save some model size, but I had no idea that it has such a huge impact on performance.
Will rebuild the existing models. It's good that I spot this video right now. Great many thanks!
@beginho2454 Před 2 lety ⁺²
Hello Alberto, Macro, thank your books...
you open a new page to my career and life ;)
@Ralom2000 Před 8 měsíci ⁺¹
Excellent video. Crystal clear and very good practical demonstrations
@ajaaskelainen Před 2 lety ⁺²
I've learned a lot here! I knew that Star schema is better, but never actually investigated how much faster it is. I also liked the most interesting scenarios after around 19:00
@luanacristinalongo2094 Před 7 měsíci ⁺¹
Excellent explanation.
@carloscantu75 Před 2 lety ⁺¹
Thank you Alberto, great content as always and also thank you for taking the time to answer the questions in this section I learned not only from the video but also from you replies.
@mabl4367 Před 2 lety ⁺¹
Fantastic video!
It adressed questions I have had since first coming in contact with Power BI and it did it in the clearest way possible!!!
Thank you!
@bgg732 Před 2 lety ⁺¹
One of the conveniences of having a header table is it makes DAX much simpler for things like average difference between payment due date vs actual payment date. Those are fields at the header level and require you to use table functions like summarizecolumns() in every measure where the granularity stops at the header level. Not a show stopper just something to be aware of.
@SQLBI Před 2 lety ⁺⁵
Nothing stops you from having two fact tables with two different granularities (header and detail). The issue is to create a relationship between header and detail tables.
@rafalg8575 Před 2 lety ⁺¹
As always perfect! I was sure that star schema is the best but now I can prove it.
@gulhermepereira249 Před 2 lety ⁺²
This is gold, thank you Alberto
@ikar2k Před rokem ⁺¹
Brilliant video 👌 I'll go and have a cup of double espresso in honor of you! 😌 Molte grazie!
@aminejerbi1634 Před 2 lety ⁺¹
Lovely and very convincing....Thank you Alberto
@nlagreste Před 2 lety ⁺¹
Excelent as always! Thanks for sharing Alberto.
@sebastiendebosscher Před 2 lety ⁺¹
Fantastic video. Very clear explination! I love you actually prove everything you say 👍👍
@charlesdelanoy Před 2 lety ⁺¹
Thanks so much for making these videos, just great content and much appreciated...
@Milhouse77BS Před 2 lety ⁺¹
I've been wrestling with this for nearly a year, with added twist of having another level of header-detail, say each Line has many Discounts per line. And on the other end, Transactions having many Payments.
@hannesw.8297 Před 2 lety ⁺¹
Perfect explanation, love it! Great work, thank you!
@jacqueskassubeck2338 Před 2 lety ⁺¹
Thank you for the information - really great content!
@downunderdax Před 2 lety ⁺¹
Great Video!
Before I go and fix my model.
I have been playing with a Header Detail Model, and encountered a behavior I don't understand.
My model has customer --> Header and Product --> Detail.
Building a table with
Customer[Name], Product[SKU], filtered for a single Sales Order in the Header table, I get 4 lines in the table (the 4 products for that order).
If I drag SalesHeader[Amount] as an implicit measure (Sum of SalesHeader[Amount]) into the Table visual, I get the expected behavior. 4 rows, repeating the same Amount value in each row.
If however, I use a measure Total Sales = SUM(SalesHeader[Amount]), the table returns a row for every single Product SKU.
I can fix the behavior with CROSSFILTER():
Total Sales Fixed = CALCULATE(SUM(SalesHeader[Amount]), CROSSFILTER(SalesHeader[OrderID], SalesDetail[OrderID],BOTH))
BUT, what I don't understand is, why does the implicit measure deliver 4 rows naturally, but the explicit measure requires CROSSFILTER() to behave the same way?
Do you have a video explaining this?
@SQLBI Před 2 lety ⁺¹
When you use the implicit measure, Power BI creates an additional measure COUNTROWS(Sales) - just in the query and not shown in the results - just to reduce the rows in the visualization. The presence of an explicit measure suggests to Power BI that there is already a business logic to reduce the rows to the relevant ones and there are no additional hidden measures in the query. You can analyze the different queries generated by using Performance Analyzer and running them in DAX Studio (search for related videos about them - there are many - we don't have videos about this specific behavior).
@smartdata678 Před 2 lety ⁺³
hahahahahaha, it is not, no, not no, no no no, nooooooo.
I love your videos!
@nishantkumar9570 Před 2 lety ⁺¹
Well explained. Thank you very much.
@Anthony_Lecoq Před 2 lety ⁺¹
What a beautiful fact checked demonstration. Many thanks for your valuable videos Alberto.
@marilenadonofrio9787 Před 2 lety ⁺¹
Kudos, well done!
@alexkim7270 Před rokem ⁺¹
This is the second time reviewing this video. This time, I have a practical problem. I would need advice.
For e-commerce databases, it is extremely common to have header-detail where discounts could be applied on total order and item level. From here, I have 2 questions.
1. Denormalising HD to Star would create duplicate records of "order discount amt" and it would not be easy to sum the max of discounts (I may be wrong, as I've never done this before)
2. I also "imputed" some relevant "product dimension" attributes to the detail table to avoid that heavy calculation (in your demo, 18 mins). Obviously this approach hinders me from easily deploying basket value analysis that requires 2 product dimension tables as you explained in your insightful articles.
I am caught in a Catch 22 situation where both solutions would be suboptimal. Hence I chose the lesser evil by using HD on very specific use cases (such as tracking order item fulfillment), while mostly relying on Star to do the heavy lifting. What's your advise on this? Would it be optimal for me to denormalise and use Star, and figure ways out to calculate sum of max of discounts? Would love to hear from you. Thanks again for sharing!
@SQLBI Před rokem
1) Allocate the discount - the demo with MAX is an example why star schema is better for analytical purposes.
2) Converting to a star schema has a cost, but also benefits.
The denormalization process can be done with many tools, usually classified as ETL. If you are using Power Query, you use the most accessible tool in Power BI, but it is not the more performant when data volume grows. Depending on your scenario, consider SSIS / ADF / SQL / other ETL tools.
@alexkim7270 Před rokem
By allocating the discount here I assume that I need to perform an INNER JOIN between the header table and details table? I am not able to allocate the order level discount amount to respective line items as this is not how the business logic works. This is the only way I can think of denormalising the warehouse model from HD to Star Schema.
I am thankful that for the time being, the SQL scripts I wrote is still able to handle the complex operations. I totally agree with you that it will be less performant when the data volume grows. I am considering to propose ADF as well.
@shekharkumardas Před 2 lety ⁺¹
Excellent
@Linger92 Před 2 lety
It would be interesting to see the performance in the header detail version with SUMX ( DISTINCT ( SalesHeader[CustomerKey] ), 1 ).
I had an issue with DISTINCTCOUNT in a bi-directional relationship scenario only recently and that solved it for me in this case (because I read it in one of your articles of course).
@SQLBI Před 2 lety
It depends on the number of customers involved. Up to a few thousands is manageable, if it goes to millions it's going to be slower. It also depends on how you use that measure in the report - a matrix repeats the same calculation several times, when you do that for a small number the SUMX could be an advantage, but as usual... it depends!
@kamilkalecinski4115 Před rokem
This is a struggle for my model too: heavy relationships at the lowest fact granularity. However, we have a business case i.e. Our users want to use attributes from Sales to slice Orders and other way round. So, the only solution we have introduced so far is to always split fact tables i.e. split Sales into "Sales" and "Sales Attributes" so that "Sales Attributes" become a dimension table that can slice both "Sales" and "Orders". And then we have "Orders Attributes" that can slice back "Sales". Is there any better solution for this to avoid these heavy relationships, but still follow basic star schema basics?
@andresgangas9532 Před 2 lety
Really clear explanation. But could you please explain or give some tips on what happen with the dimensions tables, and if it is necessary to "normalize" its. For instance, the product table could have a hierarchy schema like groups of materials --> lines of materials --> material. It is necessary to create each one of these levels in different tables? in only one table denormalized? or depends on what we are going to calculate/filter?
Thanks a lot for all your videos
@SQLBI Před 2 lety
In the star schema you should denormalize all the attributes of a business entity (like products) into a single table. Hierarchies are formed by different columns of that dimension. If you normalize the product table in multiple tables, you have a snowflake schema, which works in Power BI but it's not a best practice (you don't get the best performance and you might have issues in certain DAX expressions).
@solangefranco3444 Před rokem ⁺¹
Hi Alberto! I always learn a lot from your videos and the explanations.
But currently I have been facing a huge challenge, and it has been hard to find a better way or design in Power BI. I am working with Data set, and The goal is = creating a full Data set to enable Final users to create their own reports. The Data set must contain header and details data from each step of procuremente, starting from Budget, Purchase request and lines, then the related Request for Quotation, and other specific details. And at the same time the goal is that Data model must to be easy to use for creating reports by inexpert pbi final users. Do you have a video or advice to face a model like that? Thanks
@SQLBI Před rokem ⁺¹
Start with the Introduction to Data Modeling, and then move forward with the full Data Modeling class - that is the first step to create a model that is easy to use!
www.sqlbi.com/p/introduction-to-data-modeling-for-power-bi-video-course/
www.sqlbi.com/p/data-modeling-for-power-bi-video-course/
@porraloka31 Před 2 lety
Alberto, when we need information about how many New customers we have coming from a table with The date that that client was registred and connect with a table with transactions with a different set of dates (date of transaction) are we obliged to have two calendars? Because everytime we need to filter The date to see New customers we might filter, wrongly, The transaction table as well, right?
@ofbirkinsandmen6025 Před 2 lety
what if there is no back end to create the fact table in the star schema? Can I just do a left outer join and use that table?
@ivankhoo93 Před 2 lety ⁺¹
Thanks for the insightful video, is the same thing applicable on the dimension side e.g in a extended snowflake configuration like the product /subcategory/ category where it would face similar issues if the dimension key links are large as well?
@SQLBI Před 2 lety ⁺¹
Yes, you should denormalize a snowflake into a star schema whenever possible.
@craigdubin6325 Před rokem
Toni, great video! All my facts in our business are designed this way. I was wondering what your advice would be in a sales star schema where, the line items aren't all on the same grain. so instead of solely product key in the line items, you have some lines with products and other lines with labor? I could do two different sales tables (fact_sale_products and fact_sales_labor). But the problem with this is that one sales header has both products and labor as part of the sale.
@ikar2k Před rokem
Sorry, but our sensei's name is Alberto. Toni is a brother in arms 🙏👍
@mshparber Před 2 lety
Many times I need to calculate columns and measures at the Header level, for example: AVERAGEX(SalesHeader, ‘SalesHeader’[OrderAmount])
Are you saying it is better to do this instead:
AVERAGEX(VALUES(Sales[SalesOrder]),CALCULATE(SUM(Sales[LineAmount])))
?
@SQLBI Před 2 lety ⁺¹
The first approach is faster at query time because it precalculates the order amount. However, if you have 100K orders, you might not notice the performance difference. If you have tens of millions of orders, you have to evaluate whether the additional cost for a second fact table makes sense for the performance advantage. However, also in that case you should create two fact tables related to shared dimensions instead of creating a relationship between the two fact tables.
@IN2FPA Před 2 lety
Given that distinctcount of one column on a large table takes longer than distinctcount on a smaller column, would it make sense to create a dimension that contains the order header excluding the customer + date columns?
In this case you would still have one fact table but now joined to product, customer, date and order header in a star schema.
PS: Wondering why the distinccount of a large table on a single column is slower - does tabular not store only the distinct values for each column ... ?
@SQLBI Před 2 lety ⁺¹
The compression is by column, so having a separate table wouldn't help. It's not that DISTINCTCOUNT is slow, it's that the other aggregations are super-fast. DISTINCTCOUNT is computationally more expensive by its nature.
@Narses3 Před 2 lety
Obviously not great practice, but would performance be any better with a sales order bridge table? I guess probably not?
@SQLBI Před 2 lety ⁺¹
Do you mean a bridge table between Header and Details? If that is the question, the answer is no.
@sebastiendebosscher Před 2 lety
What to do when you have two sales tables from different ERP systems (different columns) and you want to slice by brand once, to see total sales of both ERP systems?
I was thinking of having two sales tables with each respectively having one related item table and on top of both item tables a third (simplified) huge item table that slices both item tables.
This would violate star schema, but how else can I solve this?
@SQLBI Před 2 lety ⁺¹
You can append the rows from two tables into the same table, or create two independent fact tables. It depends on many details, but in none of these cases you would create a relationship between the two fact tables if you keep them separate.
@Wzxxx Před rokem
Well, but that if I have a shipment cost columns at order header level? I cannot just allocate this for order line as it there would be duplicated values and at the end the sum of shipment will be wrong
@SQLBI Před rokem
In that case you keep two fact tables, one for each granularity, without a relationship between the two fact tables. This is covered in our Data Modeling training (also watch the free Introducing Data modeling: www.sqlbi.com/p/introduction-to-data-modeling-for-power-bi-video-course/ )
@Wzxxx Před rokem
@@SQLBI Thank You for the answer, i thought about this solution but I in this case I will not be able to filter the shipment cost with products dim
@SQLBI Před rokem ⁺¹
You must allocate the cost if you want to see by product, no?
@Wzxxx Před rokem ⁺¹
I would have to. I don’t understand one concept. If i keep both tables and connect them with a dim table(distinct order number) as without it i has no point to work i still need to iterate by same number of order numbers. On the other hand I need to use crossfilter to get shipment cost by product going from order items to sales through order number table. Isn’t it the as connecting both fact tables directly? All examples are always very simple and far from real cases. Why i never see dim table with order number to filter data from many fact tables.
@jjpp1970 Před 2 lety
Alberto,
What application did you use to make the table "Sales" by joining the tables "SalesHeader" and "SalesDetail"?
I mean, for cases with that big amount of rows, what good choices do we have when using Power BI Desktop?
Dax, Power Query, Any other?
Thank you.
Jose Padilla
@SQLBI Před 2 lety ⁺²
SQL is usually a good idea. We used a view in SQL Server.
@jjpp1970 Před 2 lety
@@SQLBI Thank you.
@zentiumx Před 4 měsíci
I tend to disagree a bit with you. Your discussion is great and the topic is well covered, but one thing you need to take into consideration: when using one or the other schema, stick to the schema's best practices and use cases. Using a star schema's development style in a HD schema will of course yield worse results. In H/D, sometimes it is better to denormalize. Example, as an invoice 'HD schema' do not change over time, how about denormalizing the sum of the details and store it in the header? Welcome triggers. In a write/once read/multiple, pre-calculation is godsend, as we tend to emphasize the reading time. Now redo the math!
In addition, when you are summing while grouping two or more tables, as in your products and customers grouping example, do the sum and group directly in the header table, linking only to the dimension tables only once you have the results. This is where writing stored procedures comes handy as they are abke to optimize such queries where DAX might not be able to.
To conclude, when you want to compare efficiency, it would be better to do so using the best practices for each model, and not using the same practice for both models.
Finally, you also need to compare the other pros/cons, such as inserting, deleting, updating data.
Example, you said that one wouldn't require much fetching distinct records, a statement that is not quite correct. When you need to display your orders, which is done very frequently in production environments, the star approach will have to always prove to be very slow.
I personally use a mixture of both, by denormalizing my HD architecture where I need the 'star' performance and cut on high 'waiting lists'.
@NicholasEnslow Před 2 lety
Can we get a video of taking a trash, 250+ column wide de-normalized model and normalizing it? That would be great for people who are dealing with Report output and having to turn it into gold.
@SQLBI Před 2 lety
Look at these courses:
www.sqlbi.com/p/introduction-to-data-modeling-for-power-bi-video-course/
www.sqlbi.com/p/data-modeling-for-power-bi-video-course/
However, they are not about the transformation itself, that can be performed with many different tools (SQL, Power Query, other tools... look for ETL tools around).
@Phoenixspin Před 2 lety
"I'm sorry, that's on you."

Další v pořadí

Automatické přehrávání