The term “warehousing” can be understood in English as “a storehouse for depositing goods and merchandise”. Suppose that the goods and merchandise are nothing but our good old data – sales reports, statistics, and business transactions – this will constitute a “data warehouse”. The concept of data warehousing states that data from older system is copied into a new system dedicated entirely for analyzing data. Data warehouse will store a substantial amount of historical data. Users of this system are able to continuously ask or query it to retrieve data for analysis. Many organizations have adopted this concept to informed decisions faster through the historical data.
The evolution of data warehouse starts from the Decision Support Systems (DSS), where computers were used to control and make basic decisions. In the early 1960s, the world of computation consisted of creating individual applications that were run using master files. The master files were housed on magnetic tape, but had to be accessed sequentially. To overcome the shortcomings, there was an advent of various technologies like Direct Access Storage Disk (DASD), PC/4GL technology, which still gave problems related to data credibility and inability to transform data into information. A change in approach was needed, which is where the architected data warehouse comes in. It differentiated between primitive data – detailed data used to run day-to-day operations of the company and derived data – data summarized to meet the needs of the management of the company. In a data warehouse, primitive data and derived data coexisted peacefully at different levels.
Technically, a data warehouse can be defined as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process”. In other words it is a repository of time-independent data that provides sufficient data to help business intelligence professionals make sound business decisions. Breaking down the original definition into parts:
Integrated: Data is fed into the data warehouse from multiple sources. To ensure that the data is continuous in a single form in the warehouse, it is reformatted, re-sequenced and summarized before being entered. For example, gender encoding in different sources might be different – 1/0, m/f, male/female etc. Warehouse encoding is then done consistently: a single format for all male/female assignments.
Time-Variant: The data warehouse contains snapshots of a record at a moment of time, thus every unit of data is accurate at some moment of time. It can store data with a time horizon of more than 5-10 years.
Non-Volatile: In a data warehouse, there is no auto-updating of records; data is non-volatile. Instead, snapshots of new records or data are made and stored.
Data warehouses have evolved to support more than just strategic reporting, analytics, and forecasting. Today, companies are investing significant resources to integrate valuable information contained in their data warehouse into their day-to-day operations.
Some of the key questions addressed in this section are:
- Why Data Warehousing?
- Structure of a data Warehouse?
- How it is used? Who are the users?
- Difference between Database and Data warehouse?
- Trade-off in terms of cost?
- Major vendors involved?
Why Data Warehousing?
1. Decision Support System Implementations: The DSS should have the ability to monitor historical trends and patterns in data and provide suggestions/conclusions.
2. Increased Compliance and Regulatory Requirements
3. Data Center Migrations
The lifecycle of a data record through enterprise analytics starts with the capture of a business event in a data repository such as a database. Data acquisition technologies deliver the event record to the data warehouse. Analytical processing helps turn the data into information, and a business decision leads to a corresponding action. To approach real time, the duration between the event and its consequent action needs to be minimized.
Structure of a Data Warehouse
Figure 1: Data Warehouse – the big picture
There are four levels of data in the architected environment – the operational level, the atomic or the data warehouse level, the departmental (or the data mart level) and the individual level. These different levels of data are the basis of a larger architecture called the corporate information factory. The departmental level, sometimes called the data mart level, OLAP level, or the multidimensional DBMS level – can be described as a specialized subset of the Data Warehouse. It caters to the data requirements of a specific group. E.g.: Finance, HR, and Marketing. Data warehousing is the process of building, maintaining a data warehouse, including the data mart and any downstream client applications. Data integrity is of the utmost importance in a Data warehousing. However, unlike in OLTP systems, the data need not be normalized to remove redundancies.
Usage and users of a Data Warehouse
Data warehouse brings together data from heterogeneous sources into one single destination. It captures the entire data of an organization. Data is Extracted from the source (like Database), Transformed and then Loaded into the data warehouse. This processing, called ELT, is done in the staging area (refer figure). There are several ETL software available on the market today which can automate this tedious process. Some of the ETL tools are listed in Table 1.
The data warehouse user – also called the DSS analyst – is a business person first and foremost and a technician second. The primary job of the DSS analyst is to define and discover information used in corporation decision-making. The DSS analyst operates in a mode of discovery; only on seeing a report or seeing a screen can the DSS analyst begin to explore the possibilities for DSS.
Trade-off in terms of cost
In most cases, the real benefits of the data warehouse are not known before its construction. The DSS analyst cannot determine how or why the data warehouse will be useful until the first iteration of the data warehouse is developed. For this reason, classic ROI (Return on Investment) techniques do not apply to the data warehouse environment. Data warehouse is built incrementally, where the starting cost is relatively small. The first iteration of the warehouse should be small enough to be built and large enough to be meaningful. However, as the warehouse is built and populated, the DSS analyst has to justify the increasing development costs of the warehouse. Data, processors, communications, software, tools, and so forth all cost money. The major investment for a system lies in creating, installing and establishing the system. Also there is the fact that a data warehouse is never done. Even after the initial few iterations of the data warehouse are successfully completed, adding more subject areas to the data warehouse is an ongoing need.
Major vendors involved in data warehousing projects
Several multinational companies, especially IT and Business organizations are moving towards the data warehouse and big data analytics. Oracle, Teradata, Intel, IBM, Microsoft and SAP are some such companies which make use of this technology. Oracle provides Oracle GoldenGate, Oracle Data Integrator, Oracle Exadata and Oracle Business Intelligence Enterprise Edition, resulting in a complete real-time, operational data warehousing solution.. Intel IT is implementing a strategy for multiple business intelligence (BI) data warehouses to provide significantly more powerful analytics capabilities to business groups across Intel. With an array of BI platforms, Intel mine a broader range of data faster, deeper, and more cost-effectively. This expanded architecture enables their business groups to solve more high-value business problems, achieve greater operational efficiencies, and improve their competitive performance in global markets .
Teradata is the leading company in data warehousing and explores the options of integrating DW products with Hadoop, which is a platform or a set of frameworks for processing big data – large amounts data, for ex., data from the web. Hadoop plays a key role in capturing, transforming, and publishing data. Using tools such as Apache Pig, advanced transformations can be applied in Hadoop with little manual programming effort; and since Hadoop is a low cost storage repository, data can be held for months or even years. Since Hadoop has been used to clean and transform the data, it is loaded directly into the data warehouse. Marketing is developing additional offers for consumers to save money by using analysis of the trends by household, neighborhood, time of day, and local events. And now the in-home display unit can give consumers detailed knowledge of their usage. Hadoop is not an Extract-Transform-Load (ETL) tool. It is a platform that supports running ETL processes in parallel. .
Figure 2: Hadoop and Data Warehouse
According to CT analyst and market research firm Gartner’s data warehouse Magic Quadrant report, the competition between today’s leading companies can be represented by the following graph.
A data warehouse application is about making better decisions. In addition to reducing cost, the business can actually earn money with a data warehouse implementation. An interesting example from Paul Westerman’s “Data Warehousing” picked my interest while actually helping me to understand the use of data warehouse better. Consider an inventory management example. A merchandise manager, through the use of a data warehouse, is able to graphically view inventory levels, sales, and deliveries. Reviewing a historic chart, she can see that the order frequency for a particular article is too long because there is a continual out-of-stock situation happening. The chart visually illustrates an inventory flow problem. Knowing this, the manager takes action to increase the order review frequency; orders are placed more frequently, thus increasing the sales and profits! Ultimately, the goal is that business people will be able to make better, faster business decisions by reviewing the summarized historical as well as newly created information.
Data warehouse is nothing more than a sophisticated series of snapshots, each taken at a moment of time. The effect created by the series of snapshots is that the data warehouse has a historical sequence of activities and events, something not at all apparent in a current-value environment where only the most current value can be found. Succeeding in today’s competitive business environment requires good decisions, not just at the top level of the organization. Operational data warehousing allows all users in the organization to access and respond to information in a timely manner. Data warehousing is a vast field of study, and combined with the power of Hadoop it provides solutions to complex situations with a flexibility never imagined before.
Several other related topics, like granularity maintained in a data warehouse, the distributed data warehouse, executive information systems in data warehouses can be found explained in detail in various internet sources and digital libraries. Paul Westerman’s “Data Warehousing” explains in detail about the data warehouse maintained at the Wal-Mart, which is about 70 terabytes and the world’s largest and most successful commercial and growing database.
Awadallah, D. A., & Graham, D. (2011). Hadoop and the Data Warehouse: When to use which. Cloudera Inc; Teradata Corporation.
Manoj Philip Mathen. (2010, March). Data Warehouse Testing. Building Tomorrow's Enterprise. India: DeveloperIQ Magazine.
Oracle. (2012). Data Integration Architectures for Operational Data Warehousing. Oracle Fusion Middleware Golden Gate.
Westerman, P. (2001). Data Warehousing using the Wal-Mart Model. San Francisco: Morgan Kaufmann Publishers.
Willliam.H.Inmon. (2002). Building the Data Warehouse (3rd ed.). USA: John Wiley & Sons, Inc.
Yalla, C., Chandramouly, A., & Eden, C. (2013, March). Using a Multiple Data Warehouse Strategy to Improve BI Analytics. BI Data Warehouse Strategy.