Business Intelligence and enterprise data mining management
A data warehouse is a kind of database designed to enable business intelligence solutions. Data warehouses are used to understand and enhance a performance. The warehouse is designed to query and analyze instead of merely conducting transaction processing. Historical data may be derived from transaction data but can include data from other locations. Data warehouse separates analysis workload from transaction workload, giving the organization the ability to maintain historical records and analyze the data to gain better understanding of the business and improve its practices. The environment in a data warehouse includes extraction, transportation, transformation and loading – referred as the ETL solution, statistical analysis, reporting, analytic tools and data mining. Other applications include those for managing the process of data gathering, transformation into useful data, derivation of actionable information and delivery to business users.
There is a differentiating aspect of data warehouses and online transaction processing systems. In a data warehouse, analysis workload is separated from transaction workload. Data warehouses are much read-oriented with high amount of reading capability over writing capability. This f more facilitates analytic functioning and eliminates impacting on transaction systems. With optimization, a data warehouse can consolidate data from numerous sources to achieve a given objective. It plays a role as an organizations single source of truthful data, adding great value to numerous number of users who looks upon the data.
In the 1980s, relational database revolution started and ushered in an era of improved access to valuable resources held by company’s information systems. However, great improvements were required to facilitate reporting and analytics. It was realized that databases modeled for efficient transactional processing were least optimized for complex reporting and analytical processes.
The first systems to offer decision support functionalities were the first relational database models and SQL. Data mart was introduced by market research and television rating magnate ACNielsen in the 1970s to improve sales efforts.
Founders of data warehousing
Data warehousing came to the limelight in 1980s when IBM systems journals article published in “ architecture for a business information system” coining the term business data warehouse. The term has earlier been used by future progenitor Bill Inmon in the 1970s. According to the IBM article, an industry processing environment where companies maintain their operational databases was the motive of computerization. It recognized that access to company information on a large-scale for analysis and reporting was relatively new phenomenon. Also, it was the motive of IBM to progress that informational system paradigm as a result of business needs and availability of improved tools for accessing and analyzing business data. IBM introduced what is referred as Business Information System architecture in Europe, Middle East, and Africa in a bid to draw together the pieces of information system activity within companies. EBIS proposed an integrated warehouse where data was based on relational databases. Users could easily access this warehouse using a consistent set of tools through the interface that was supported by business data directory and makes available what should be accessed by the user.
Bill Inmon has been called the father of warehousing and was named in 2007 by Computerworld as one of the ten IT people that mattered in the last forty years. Inmon worked extensively in 1970s and 1980s to hone his expertise in all matters of relational data modeling. In 1990s, Inmon ventured into data warehousing solutions, and the first such tool was Prism Warehouse Manager developed by his company Prism Solutions. In 1992, Inmon published ‘Building the Data Warehouse” a debut seminal volume of the industry that is currently in its fourth edition. The book continues to be a pillar of fine-tuned theoretical and practical examples in the world. Another establishment of Inmon is the Corporate Information Factory documenting enterprise level information on organizations data and warehousing. CIF website has numerous Inmon writings and white papers on data profession.
Inmon’s approach of data warehouse design is focused on centralized data repository modeled to the normal form. According to Inmon, strong relational modeling is required for enterprise-wide consistency that is essential for the development of individual data marts that serve the needs of businesses. This perspective differs with other pioneers of data warehousing in a big way.
Ralph Kimball, another father of data warehousing gave a robust theoretical undertaking of the concepts behind data warehousing., In 1996, Kimball published “The Data Warehouse Toolkit” and included a host of industry honed practical inferences of OLAP-based modeling. Kimball’s career is traced to Xerox as a workstation designer, Metaphor Computer System in decision support systems capacity and finally his company Red Brick Systems founded in 1986. Red Brick was developed as a relational model suited for high speed Data Warehousing applications. Kimball has a wide range of data warehousing resources including toolkits, web based solutions, books dealing on ETL in warehousing, and Microsoft editions of SQL server and Microsoft Business Intelligence Toolset.
The differences in Inmon’s and Kimball’s perspectives are apparent. Inmon advocated the development of once centralized data warehouse that leverage relational database. An example of such top down philosophy is the Corporate Information Factory. Kimball approach favors the development of individual data marts at abstract levels that can be integrated together using Information Bus architecture. It is a bottom-up approach that goes well with star-schema modeling.
Irrespective of their differences, both approaches, are the core to data housing architectures. Smaller businesses might go with Kimball’s approach because of easy implementation and constrained budgetary allocations while bigger firms might adopt Inmon’s approach.
21st century data warehouse
The many changes in today’s industry also affect’s data warehouses. Cloud computing and real time data analytics have played a significant role in the evolution of data warehouses. The end-user side has web-based and mobile computing as lead requirements of the century. Likewise, advances in the practice of ontology have magnified the capabilities of ETL systems for pursing the information out of unstructured and structured sources. Big data has emerged as a broad term in today’s computing platform. Equivalently, there is the emergence of industrial strength data warehouses to support this new trend in computing. The key principle in data warehousing, however, is solid enterprise integration. Irrespective of the infrastructure followed, architecture is paramount and should be preserving as warehousing enters the third decade in its history.
A data warehouse usually stores numerous amounts of data usually years or months of data to support historical analysis. Data is loaded via extraction, transformation and loading process from varied sources. In fact, modern databases employ ETL infrastructures that facilitate data transformation on the database that hosts the warehouses. Thus, defining ETL architecture is an important part of the design effort of data warehouses. Speed and reliability of ETL operations is founded on the design of the warehouse and is only computed when up and running.
Use of data warehouses to perform data analysis is time-related. For instance, consolidation of last year sales data, profit by product and customer and inventory analysis at a certain period. Additionally, with time-focus or otherwise, users want to slice continuously up their data and see whether they fit a given scenario and perspective, and hence, a well-designed data warehouse is desirable. If the requirements for data is highly aggregated and drill down to details, more sophisticated analysis which involve data mining, trend analysis and forecasting, may need to be done. A data warehouse is an underlying engine utilized by middleware business environments to produce reports, dashboards and interactive user interfaces.
Data warehouses have varied uses in analyzing the status and development of the organization. Data warehouses are based on large pools of data integrated from heterogeneous sources. The data is modeled to a multidimensional schema that comes natural to human analysts. Multidimensional schemas are made of facts, measures and dimensions. Facts represent decision-making aspects such as sales and orders while measures are numerical KPIs such as quantity sold and prices while dimensions represent the context of analyzing all these measures. Owing to this specification, the development of data warehouses becomes complex and require ad-hoc methodologies and relevant life cycles.
In the development of data warehouses, conceptual design and requirement analysis forms the basic core aspects.
Data warehouse is appreciated as one of the most complex information system modules in IT, and its design is design and maintenance are characterized by complexities. It is why in the early stages of DW development; high degree of project failures was noted. Authors have documented the critical factors of data warehousing projects and Demarest (1997) found social-technical (effect of DW on decisional processes and political equilibrium), technological and design aspects of systems are the determining factors. The choice of a specific lifecycle for the development of a data warehouse takes into account the specialties of the type of systems in place. Summarily, they include:
- Data warehouses rely on operational databases in place that feeds it with data
- User requirements are dynamic and need to be analyzed at every phase
- Data warehouse development are huge projects that takes long to be developed usually 12 to 36 months, and their cost is in tens of millions of dollars
- Managerial aspect of an organization demands that there be reliable results in a time compatible with business requirements.
While it is hard to document how to handle the first two aspects, data warehouse community has grafted an approach that cuts on cost and delivers a satisfactory solution. Instead of approaching data warehouse development in a top-down fashion, it is more efficient to build it based on the bottom-up perspective with single data marts. A data mart is part of data warehouse serving a single analytic problem or a department in an organization and with restricted scope of content and support for analytic processes. Adopting a bottom-up approach in building DW turns out to be an integration of the individual data marts.
Developing individual data marts is an iterative approach that promises to fulfill user requirements. The benefits are apparent and include reduction in development costs and time because of limited design and implementation efforts to getting the results. Conversely, the fourth requirement can be fulfilled if the designer has the potential to implement the data marts that are more aligned to stakeholders.
A pure bottom-up approach, though less costly and less complex, poses some risks originating from the partial vision of business domain that is available at each stage of the design. This risk can be minimized by first developing data mart that satisfies a given role within the Data warehouse so that other modules can be easily integrated into the founding backbone. This solution is referred as the bus architecture.
Thus, it can be summarized that the main phases of DW lifecycle are:
Data warehouse planning - the scope and the goals of the data warehouse are defined. Also, the order and the number in which the data marts are to be implemented as per the business priorities and technical constraints. The physical architecture of the design is also defined so that the designer knows the size of the system and hence the appropriate software and hardware. Also, it is at the planning phase that project staffing is done.
Data mart design and implementation – the micro phase is iterative for every data mart developed. At each iteration stage, a new data mart is developed and deployed.
Data warehouse maintenance and evolution –Maintenance usually concerns with performance optimization and the process is conducted periodically as a result of shifting user requirements. User requirements change according to problems and opportunities that come along. DW evolution refers to the process of keeping the data warehouse up-to-date with respect to the business domain and dynamic business requirements. For example, a new manager may require a new dimension for analyzing existing fact schema. They may also require addition of a new level of classification as a result of changing business process that is likely to cause early obsolescence of the system.
Data warehouse design methodologies
Data warehouse methodologies are those concerned with phase two of development. Although much literature exist on the methodologies, little consensus is available on the preferred methodology. As a result, many methods agree on the need to distinguish between the following stages:
Requirement analysis – this is the identification of which information is relevant with respect to decisional processes either by considering user needs or availability of data in the operational sources.
Conceptual design – is based on the conceptual model and aims at deriving an independent implementation conceptual schema for the data mart
Logical design – this phase takes the conceptual schema and outputs a logical schema based on the preferred logical model. Currently, most of DW are based on relational, logical model (ROLAP) whiles increasing number of software vendors are proposing pure or mixed multidimensional solutions (MOLAP/HOLAP).
ETL process design – refers to designing mapping and transformation techniques that load data into the logical schema of the data warehouse at levels where data is operational.
Physical design – is meant to address the issues related to the suite of tools adopted for implementation, for example, indexing and allocation.
Requirement analysis techniques
In database developments, requirements are termed as functional and non-functional. Functional requirements answer the question what information the system is expected to provide while non-functional requirements illustrate how this information is provided for correct utilization. Functional requirements are covered in three design principles: supply-driven goal-driven and user-driven. User requirements are determined based on these techniques, and they present advantages and disadvantages. Comparison of this techniques is presented in the table below:
It is argued that data warehouse is a glorified workaround of the database model. In relational databases, information is modeled, and entity classes are defined. Entity classes represent the categories into which the data in the database is organized and is taken to represent a static view of the world. For example, the entity model of a university comprising of entity classes for students and courses. Going by Heraclites view of reality, a student may show up for a class only once. After a session, the student is knowledge of the subject matter. Notably, the student may have greater or least enthusiasm about the subject matter due to events that might have happened in the course of their lives. Also, the lecturer updates course contents increasingly. However, all changes are happening are ignored to preserve the stability of the way nature is viewed. Finally, the student takes the course and gets a grade. Relational databases act as a categorical model of reality presented in Structured Query Language. Aggregate functions found in relational databases are limited to the functions expected for analyzing categorical data. For instance, sum, average and maximum provide a limited capability to determine numerical attributes but present no clue on temporal or correlation functions. There are no answers to the question- Is there a student who is more likely to get a higher grade in any particular course. There is also no way of asking if the average of a certain course is higher or lower than it was previously. In short, the relational databases have limited analytic capabilities. Other additional problems presented by relational databases include handling of times and dates.
Conversely, data warehouses begin with dimensional model representing a key business process. A common way of presenting dimensional model is a star schema where business processes are presented at in table at the center of the star while surrounding dimension tables has tables that might affect the measurement. Time is presented as one of the dimensions because dimensional models are temporal. In data warehouses, it is possible to ask information about a specific variable on a given day such as sales data and predict future data based on factors that affect the business process. Thus, data warehouses are instrumental in analyzing data and predicting future trends, a functionality not provided by relational data bases.
Demarest, M. (1997). The politics of data warehousing. . Retrieved from Retrie ved August 2008 from http://www.noumenal.com/marc/dwpoly.html
Desouza, K. C. (2003). Knowledge management barriers: Why the technology imperative seldom works. . Business Horizons, 46(1), 25-29.
Majumdar, B. D. (2006). ESB–A bandwagon worth jumping on. Elservier.
Reeves, L. (2009). A manager's guide to data warehousing. . John Wiley & Sons.
Williams, P. (2012, August 12). A Short History of Data Warehousing. Dataversity.