What is the best architecture to build a data warehouse?

Rafael Resendiz Ramírez @Rafael_Resendiz_Ramirez

06 June 2014 10 10K Report

Inmon, Kimball, Hefesto or another? I'm currently building a data warehouse to pave the way for data mining, the goal of this work is to improve the process of decision-making in education policy. This requires knowing what the best architecture is.

Harco Leslie Hendric Spits Warnars Popular answer

Its depend on what do you need, no matter what the architecture, its go back to the purpose to build datawarehouse where we need to deliver database environment which can create best sql performance when access data from datawarehouse

you can create star schema or snowflake, top down or bottom up, its depend on how best performance your sql to access datawarehouse rather than from oltp.

Harco Leslie Hendric Spits Warnars

you can create star schema or snowflake, top down or bottom up, its depend on how best performance your sql to access datawarehouse rather than from oltp.

Dony Saputra

Based on experience inmon hub-spoke architecture is the better than kimball data mart architecture methods as it architecture more success to intended purpose identified by information quality, system quality, individual and organizational impact that have been research by Hugh J.Watson and Thilini A.

you can find the article on :

https://cours.etsmtl.ca/mti820/public_docs/lectures/WhichDWArchitectureIsMostSuccessful.pdf

Most enterprise use kimball methods using bottom up approach where they build datamarts that can tan be integrated into datawarehouse, from my experience problem came on integrating all datamarts as it goes with physical and ego's of every department that can take time and continuous big cost at the end

While inmon methods using top down approach is expensive at the beginning but have a clearer define vision on target and applied to big data analytic today. Create first datawarehouse than the datamart can follow and develop under datawarehouse rule.

As My Friend Mr. Harco said OLAP is better approach and view for organization/enterprise architecture/information system than OLTP as Kimball.

Better water fall falling down than capture the water drops using loseless umbrella isn't.

How about combining the two approach, take both best part and remove all the bad part.

Sami Laine

The article by Ariyachandra and Watson states:

"The bus, hub-and-spoke, and centralized architectures earned similar scores on the success metrics. This finding helps explain why these competing architectures have survived over time—they are equally successful for their intended purposes."

According to article, on average, actually Kimball bus architecture is slightly better than hub-and-spoke. However, I wouldn't pay too much attention this kind of survey results that are based on too general concepts and subjective answer scales. Just like the article states - all architectures can be successful.

The issue isn't architectures like these.

For example, the disadvantage of normalized hub-spoke is that it is the most expensive and also most likely to fail project. It 's a fusion architecture and combines the efforts of centralized AND bus-architecture. Therefore also their risks and challenges.

The wicked problem in centralized data warehousing (centralized or hub) is that normalizing heterogeneous data from heterogeneous sources to a unified data model means creating a false illusion. That's the wicked problem of all standardized data models although normalized centralized hub seems first like a dream solution. On the contrary, it easily becomes a nightmare disguised under the hood of "polished truth" = false illusions. Data looks cleansed, solid and unified. But it is actually biased and false because inherent subtle heterogeneity is hidden from the sight. Yet, the data in sources remain heterogeneous and is different. You just do not see the differences anymore, since they are transformed to a unified data model.

Of course, the same problem can be built-in to datamart architecture also simply by building unified data marts from heterogeneous sources...

Considering above, my ex-practitioner view is that all these five generic architectures are illusions and misconceptions. These conceptual architectures are too general concepts that first fail to identify and then also mix-up actual factors that affect the quality of data warehouse. For example, hub-spoke architecture does not have to be normalized! Actually, I wouldn't suggest normalization to the core of data warehouse and neither does Inmon anymore. Actually, Inmon has lately sided with Dan Linstedts Data Vault-modeling rather than normalization...

http://en.wikipedia.org/wiki/Data_Vault_Modeling

Anyway, I would even claim, that following any of those suggested models most likely leads to failure and waste. The issue isn't generalized architectures but how to solve actual data management and quality control issues -> in a way or other.

Just like Saputra said - the issue is actually combining them. Or actually, it's doing things that are left out of the architectures.

For example, all data warehouses have to solve a problem of audit trail or they will become a failure. Therefore successful (hub, centralized or data mart architectures etc) data warehouses always have some kind of archived original data layer. It can be called "archive layer", "acquisition stage" or any other similar concept.

Also, all data warehouses have to do hard technical and soft business transformations. Therefore, all successful data warehouses have a strict and transparent discipline to do those transformations starting from simple technical and moving towards more complex business transformations issue by issue and from layer to another layer. How these are organized as a series of procedures (serial, parallel, amount etc) or layers (virtual, physical, temporary, permanent etc) is a matter design preferences and requirements.

All successful data warehouses have to build end-user friendly data layer in a way or other. It can be a virtual database view on top of normalized model, an isolated stand-alone-table in data mart layer or systematically cleaned snowflake model. Any and all of these can be built-in to all data warehouse architectures for good reasons. End-users or even business analysts might not even know the difference - they think it's another architecture while in reality it's completely different.

I claim that there isn't such thing as the best architecture for data warehouse and it certainly isn't related to those generalized abstractions like hub spoke or bus architecture. Successful data warehouses just apply various techniques for good reasons in a systematic way. They all have to solve same wicked problems like archiving and version history, organizing data transformations systematically, quality controlling data flows and scripts, providing full audit trail from archives to end-user data layer. Etc.

RECOMMENDATIONS

Rather than looking for best or even good data warehouse architecture, one should look SYSTEMATIC REFERENCE ARCHITECTURES that provide systematic methods to real data management and data quality issues.

For a data warehouse for data mining, especially for policy making and education data, issues aren't about SQL performance. It's how to archive all potentially relevant data and how to document its original local semantics for later use. Business changes, world changes, software systems change - and nobody remembers how data definitions, structures, user interfaces etc changed during the 10 year period in all of the data providers. And all these subtle changes affect the data that is used for data mining trends and benchmarking comparisons. Even if the data looks completely same - but only technically. In the data source, context or semantics, something has and will change. All these original meanings, structures and changes have to be traceable and documented if one wants to make any valid decisions...

So it all begins from systematic archiving and documentation practices. After that one can build step-by-step and layer-by-layer more sophisticated data quality controls, data transformations etc. That's where you need a systematic and high quality reference architecture, that do not follow any generalized architecture but provides methods for real-data problems. One could use Linstedts Data Vault in EDW archive and Kimballs dimensional modelling in business layer. But these are too general guidelines to be actually helpful or accurate. One has to have a deeper understanding and know what issues can and should be solved at each layer and by what techniques. And what should not be done at one layer or at any of them.

And most important thing is nothing about technologies or architectures. The most important thing is to get a rock-solid and credible EXPERT, who has seen all the mistakes and also built successful data warehouses. Someone who does not do what customers or architectures demand but explains why something should be done and why something shouldn't be done. Explains how problems could be solved and solves them from experience rather than trying to do them for the first time.

Juanita Walton Billings

Mr. Laine, I thoroughly enjoyed reading your remarks in answer to the question. I absolutely agree with you that archiving and documentation processes are key to any measure of success enjoyed while actually using and manipulating datasets for anlysis. What are your thoughts regarding the underlying data model(s) that support the various sources of data being warehoused? Do you consider the point of origin, i.e., the data mart, the governing body for configuration changes to the data model? Or should the data marts conform to a stategic-level data model? What is your experience with configuration management of data models?

~ Juanita Walton Billings

Sami Laine

I wonder if I understood your questions correctly...

This link might be good start for considering different data models for different layers:

http://bifuture.blogspot.nl/2012/03/four-different-datamodeling-methods.html

Data Sources

I think that original data sources, their actual original data models, original user interfaces and actual data entry practices should be controlled and documented better than I have seen so far happening in reality.

For example, in healthcare, a common problem is that data interfaces are standardized as common file formats or data models, but in reality actual data sources are different in their structures (in and between technical systems) , meanings (semantics in and between clinical, administrative etc domains) and data usage (pragmatics in work context). Data definitions are not enough precise documentation to distinguish difference between technically identical but clinically completely different data instances. In practice, one should document also other contextual semantics such as data entry and work processes etc. If there is changes in those then also data will change. It is difficult or often impossible to distinguish which change is related to observed phenomena and which is related to context of data supply.

How this would be done is a different story. Currently it's almost impossible and original contexts and their subtle details are often lost forever...

However, in the future, the more and more should be done to document better actual data supply situations. The original data creation and its context should be recorded and stored much more widely and accurately rather than creating standardized data sets based on standardized data models (=illusions) that only hide the inherent differences.

After the true original meaning and context is documented you can start to build different interpretations i.e. secondary data models or business data marts.

Modeling Data Sources

In the data warehouse or archiving layer I like the ideas of Data Vault and similar models that try to record everything as-it-is, including all the errors and their versions. Rather than "modeling" data in warehouse, you systematically document it including metadata about its source. In this view, data quality error is not something simply to be fixed but more like a sign for business improvement and contextual problems. It should be stored rather than hidden or ignored. Fixing them in a data warehouse hides organizational facts and can make data biased.

However, these modeling methods are also somewhat unorthodox and should be applied very carefully. There is reasons why you need to follow strictly the modeling rules (that differ from normalized and dimensional modeling) or you end up in problems. Also, in the future, data warehouses should store much more rigorously metadata about data supply context. Issues that happen before any data is stored in source data model.

Point-of-Origin

I consider point-of-origin being outside of data warehouse. The true origin of data is in the data source and at the data providers business context.

I like Kimballs definition what is the lowest granularity of data - it's a click. The point of origin is that "click" and all the context that is surrounding it during the click situation. For example, when a physician "clicks" diagnosis code user interface component at a ward, it is actually a different "click" than the same data instance provided through a different user interface by a secretary transcribing physicians notes. The better you can trace "the click" and the more your know about the context when the keyboards "click" was heard the better information audit trail capability you have.

Unfortunately, these "clicks" are often hidden by data models and that leads to loss of context information and potential bias hidden in data sets. However, I believe that one should document more metadata about data sources and their contexts - not only data definitions but also how the "clicks" actually happen and what their mean in practice. If that metadata were available for data warehouses, it would improve validity and productivity of further data development and analysis significantly.

Configuration management of data models

I have been doing conceptual, logical and physical data models. However, the models were developed manually and more like isolated data models. I have no first-hand experience from technical environments that would use sophisticated data models that could actually link different versions of data models. Management was manual and based on each data model separately and maybe some mappings in administrative documents.

According to my understanding linking data models and all their version history as well as their related context changes are often in the heads of data analysts. Hopefully they have documented such changes at least in someway. Previous data models and data definitions are probably archived but that's not probably enough to recognize or understand potential biases or anomalies found during data analysis.

Data Marts and Strategic Data Models

I think that data warehouses do have to have also another layer (in addition to archive/documentation layer) that is based on this data use perspective. All the information needs simply cannot be derived directly from original data models. One has to change data to fit it to the required target data model and resulting data model has to be also explicitly defined.

In my opinion, target data models should not be idealized conceptual end-user requirement models. That's because you can't really produce all information in a way that someone wants it. Idealistic end-user data model is just an illusion that can be filled with data - but it is actually false and biased data just transformed enough to fit unrealistic expectations. Real data is transformed to wishful illusion.

Therefore, you should rather provide them data model that is close enough to original data and still useful for end-users. However, even then it might be too far away from actual semantics of original data and contain inherent biases or hidden heterogeneity. That's where you need traceability and transparency. End-users should have ability to trace back from their "own model" to the original "source model" and hopefully even behind that - actual context of data creation.

Also, if there is inherent accuracy problems that are based on difference between source and target models, those differences should be made visible. For example, in the data source, there is planned ambulatory surgeries. However, you asked for finished ambulatory surgeries. We can provide data along the surgery process until the end of surgery perfectly. However, after that there might be administrative reasons why some cases change. We have no way to get data about such cases but they are more likely to be only 1-2 percent from the total amount.

Even if you never can provide requested data according to the model users would like to have, you can have quite close model and then some kind of additional information about errors rates, subtle semantic changes and context influences that risk the absolute validity of presented data in target data model.

Juanita Walton Billings

Thanks for such a comprehensive response to my questions. Ideally, "point of capture" metadata should include "data ownership" (from a single point of entry) that can follow any route the data takes, with updates to a user's metamodel (together with date/time stamps) for archiving usage and to track iterative changes/maniuplations of "original" data. From my perspective, data is a strategic asset that is engineered for the sole purpose of meeting the organization's objectives and, thus, should be closely tied to operational processes (both of which can be modelled). This, of course, impacts (and is impacted by) operational software. It also introduces the opportunity for "slicing and dicing" of data for analysis, e.g., statistical, to generate the "view" required by the end user, which puts the onus on the back-end to document usage and manipulation.

One other comment: while the data metamodel may include "ownership," I believe the crucial step in data engineering is the "description" provided, i.e., the definition. With proper oversight, every data element can be understood in, hopefully, "plain English" and, if inadequate to user requirements, the baseline metamodel can be reassessed for appropriate interpretation and understanding.

Thanks again!

Mayur Ingle

You can reffer :- http://db.csail.mit.edu/madden/high_perf.pdf

http://db.csail.mit.edu/madden/high_perf.pdf

Roberto Tellez Ibarra

You should analize the different methodologies by leveraging its advantages and fit to your project, and after to decide what is the most appropiate. There isn't a top methodology: although Kimball and Inmon are leaders, everyone defends a different point of view.

Zoubir Ouaret

In data warehousing (DW or DWH), William Inmon and Ralph Kimball are the two great pioneers of all time in the field of data warehouse design and Multidimensional modeling, Their approaches to the DW design differ greatly in terms of design strategy; Inmon’s approach is defined as the top-down design process of building a large centralized enterprise-wide data warehouse as first step then build data mart to serve the analytical needs of specific departments or process such as "Sales" or "Production.". Conversely, Kimball’s approach known as bottom-up approach which dimensional data marts are first created to provide the analytical needs for specific business processes, then integrating and joingning together these data marts using a bus architecture, which consists of conformed dimensions between all the data marts to create a comprehensive and an integrated data warehouse.

The following compare the Inmon and Kimball approaches to the data warehouse and design approach:

Inmon

Design approach: Top-down

Architecture: Enterprise DW

Complexity: Complex

Paradigm: Subject or data driven

Facilitating end-user access: Low

Maintenance: Easy

Kimball

Design approach: Bottom-up

Architecture: Data marts

Complexity: Simple

Paradigm: Process oriented

Facilitating end-user access: High

Maintenance: Diffecult

Badges
Science topic

Similar topics
Education

Could you recommend some articles on Urban Transportation System optimization and Innovation?

13 August 2024 2,595 3 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View