Five major differences between data and data stores

01 January 1970 11 6K Report

Five major differences between data and data stores:

1. Data Lakes Retain All Data

During the development of a data warehouse, a considerable amount of time is spent analyzing data sources, understanding business processes and profiling data. The result is a highly structured data model designed for reporting. A large part of this process includes making decisions about what data to include and to not include in the warehouse. Generally, if data isn’t used to answer specific questions or in a defined report, it may be excluded from the warehouse. This is usually done to simplify the data model and also to conserve space on expensive disk storage that is used to make the data warehouse performant.

In contrast, the data lake retains ALL data. Not just data that is in use today but data that may be used and even data that may never be used just because it MIGHT be used someday. Data is also kept for all time so that we can go back in time to any point to do analysis.

This approach becomes possible because the hardware for a data lake usually differs greatly from that used for a data warehouse. Commodity, off-the-shelf servers combined with cheap storage makes scaling a data lake to terabytes and petabytes fairly economical.

2. Data Lakes Support All Data Types

Data warehouses generally consist of data extracted from transactional systems and consist of quantitative metrics and the attributes that describe them. Non-traditional data sources such as web server logs, sensor data, social network activity, text and images are largely ignored. New uses for these data types continue to be found but consuming and storing them can be expensive and difficult.

The data lake approach embraces these non-traditional data types. In the data lake, we keep all data regardless of source and structure. We keep it in its raw form and we only transform it when we’re ready to use it. This approach is known as “Schema on Read” vs. the “Schema on Write” approach used in the data warehouse.

3. Data Lakes Support All Users

In most organizations, 80% or more of users are “operational”. They want to get their reports, see their key performance metrics or slice the same set of data in a spreadsheet every day. The data warehouse is usually ideal for these users because it is well structured, easy to use and understand and it is purpose-built to answer their questions.

The next 10% or so, do more analysis on the data. They use the data warehouse as a source but often go back to source systems to get data that is not included in the warehouse and sometimes bring in data from outside the organization. Their favorite tool is the spreadsheet and they create new reports that are often distributed throughout the organization. The data warehouse is their go-to source for data but they often go beyond its bounds

Finally, the last few percent of users do deep analysis. They may create totally new data sources based on research. They mash up many different types of data and come up with entirely new questions to be answered. These users may use the data warehouse but often ignore it as they are usually charged with going beyond its capabilities. These users include the Data Scientists and they may use advanced analytic tools and capabilities like statistical analysis and predictive modeling.

The data lake approach supports all of these users equally well. The data scientists can go to the lake and work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.

4. Data Lakes Adapt Easily to Changes

One of the chief complaints about data warehouses is how long it takes to change them. Considerable time is spent up front during development getting the warehouse’s structure right. A good warehouse design can adapt to change but because of the complexity of the data loading process and the work done to make analysis and reporting easy, these changes will necessarily consume some developer resources and take some time.

Many business questions can’t wait for the data warehouse team to adapt their system to answer them. The ever increasing need for faster answers is what has given rise to the concept of self-service business intelligence.

In the data lake on the other hand, since all data is stored in its raw form and is always accessible to someone who needs to use it, users are empowered to go beyond the structure of the warehouse to explore data in novel ways and answer their questions at their pace.

If the result of an exploration is shown to be useful and there is a desire to repeat it, then a more formal schema can be applied to it and automation and reusability can be developed to help extend the results to a broader audience. If it is determined that the result is not useful, it can be discarded and no changes to the data structures have been made and no development resources have been consumed.

5. Data Lakes Provide Faster Insights

This last difference is really the result of the other four. Because data lakes contain all data and data types, because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster than the traditional data warehouse approach.

However, this early access to the data comes at a price. The work typically done by the data warehouse development team may not be done for some or all of the data sources required to do an analysis. This leaves users in the driver’s seat to explore and use the data as they see fit but the first tier of business users I described above may not want to do that work. They still just want their reports and KPI’s.

Maxwell Obubu Popular answer

I've learnt from the responses so far. Thanks for the insightful contributions

C K Gomathy

Hi.,

Data

the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Data Store

A Data Store is a connection to a store of data whether the data is stored in a database or in one or more files. The data stores may be used as a Source of data for a process or you may export the written Staged Data results of a process to a data store, or both. Data store data may be structured, unstructured or in another electronic format.

Best wishes..

Mohsen Ghorbian

C K Gomathy Thank you for your guidance

Sajda Taha Mahmood Thank you for your guidance

Nirmala S.V.S.G

Thank you for your guidance

Maxwell Obubu

Diary R. Sulaiman

With Dr.Sajda Taha Mahmood

Thank you

Fares F. Fares

Good question and interesting following.

Devi Parameswari.C

A Data Store is a connection to a store of data, whether the data is stored in a database or in one or more files.
The data store may be used as the source of data for a process, or you may export the written Staged Data results of a process to a data store, or both.

Madhusudhana Kamath

Scheduling, Record,Investigation, Regulatory and Significance.

Dr Kamath Madhusudhana

Satish Narula

It appears to more related with one data lake whether software or company

What do you think is the criterion for accepting an article in a scientific journal?

What are advantages of data mining?

Is the information provided on the kaggle website reliable for analysis?

How to become a data mining specialist?

Which of the following steps are important in data mining?

14 areas where data mining is widely used

Use of data-related technologies to improve the quality of health

Which programming language do you need for proper data processing?

Sources of Data that can be mined

Name the steps used in data mining?

How to learn more about SPSS and its Application?

Baseline drift in HPLC? What causes this?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How to calculate CCS for Sodiated adduct ions and Multiply Charged Ions?

Hello all, Looking for international reviewer to review Ph.D thesis in wireless sensor network.Can anybody help?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

How to report results of Generalised Linear Mixed Models in a journal article?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?