This paper aims to review a journal article entitled “Data Mining with Big Data” by Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing Wu, and Wei Ding, Senior Member, IEEE, which was published on January 2014. Furthermore, the paper will be divided into various subsections to provide an organization of the points being reviewed.
Objectives and Domain
The objective of the article was to discuss the limitations of using the HACE theorem as a tool to model Big Data or large data sets. The importance of the proper use of Big Data through careful analysis in many fields including the physical and biological sciences was also given emphasis in the article. Since users have different perspectives when looking at a data, HACE Theorem will make these data to be generalized enough so all people will look at the data with same meaning and interpretation.
The domain of the article or topic area is data engineering. It includes an in depth discussion of the four components of the HACE Theorem: Heterogeneity and diversity, autonomy and decentralization, and complexity and evolution. Heterogeneity and diversity focuses on how data are being interpreted in different ways. This can be seen in different applications such as in medicine, where the use of CT scan and X-rays used data images to detect a certain disease. Second is autonomy and decentralization. Just by these words, Data are being organized independently without the need to rely on other domains to function. Finally, the complexity and continuous evolution of data would mean that memory is becoming large as well, and there is no room for storage for such development. Beside this, data are becoming difficult to analyze and decode, due to the potentials of evolution. These issues impose challenges to the experts as to how data will be centralized for general purposes
The intended audience of the article may be those who have a keen interest in the utilization of the information that are readily available but are too large to be able to extract relevant information. Among those are company executives that wish to improve their profitability by understanding human behavior, researchers in the biological and physical sciences that generate a tremendous amount of data that may be more useful if consolidated efficiently, IT Professionals, Statisticians, and the like. The article requires that the reader is able to grasp concepts in data engineering and processing such as costs, privacy issues, and understand the basic relevance of data algorithms. This is because the article is written in a comprehensive approach that does not define each concept in detail, but rather assumes that the reader has already read articles that may serve as a foundation to understand the whole article.
The article is part of the IEEE Transactions on Knowledge and Data Engineering. The journal has a vision to disseminate information that may improve the knowledge in areas such as strategic planning and researchers by means of data engineering according to their website (“IEEE Transactions on Knowledge and Data Engineering”, n.d.). Thus, it is only appropriate that the article belong in the journal.
The article is Conceptual in nature in my opinion. It does not necessarily provide empirical evidences that are backed up by substantial statistical analysis, experiments, field studies, and such that are true of empirical articles as seen in the article when HACE Theorem was applied to Big Data. Likewise, the article is conceptual in nature as it tackles on the concept of HACE theorem and the challenges that are faced when dealing with Big Data. Information such as the different tiers are discussed in order to inform the readers as to how data can be handled by a theoretical computing and analyzing mechanisms such as parallel computing and collective data mining (tier I); by the existing regulations and policies which concern huge volume data management, which includes confidentiality and possible applications of such data (tier II); and by the use models and algorithms which will govern and analyze kinds of data such as partial and compound data.
The paper discusses the use of HACE theorem to model Big Data. This theorem characterizes Big Data as Huge, has Autonomous Sources that are decentralized, complex and evolving. As such each respective characteristic was discussed. That is, Big Data are not only huge but have heterogeneous dimensionalities that make user device their own protocols to record data. As such autonomous data allow decentralization that makes it difficult to assess. Big Data are also complex and evolving that makes the volume and complexity of the content harder and harder to assess.
The problems that are mentioned in previous sections may be solved through the use of the different tiers. Tier I made use of MapReduce which is a certain kind of parallel computing unit. Tier II proposed a model which will use third party system in order to analyze the data without the need to unlock data security. The tier also prevents the use of data patterns as it was easier to decode by the experts. Lastly, tier III used data pattern analysis, which aimed to find models that can be used in global scale. Data streaming is also prevented to reduce data loss.
However, these solutions are already proposed in past research and there was no new solution proposed to utilize Big Data as a whole. However, there are some mentioned solutions that are based on the modeling on HACE theorem such as the centralization of Big Data. The centralization of Big Data may be achieved by eliminating autonomous and independent data sets and likewise make the privacy guidelines uniform.
Moreover, the only evidence that was present to support that centralization is the key was providing alternative scenarios where decentralization will be minimized. However, centralization is not a simple task to do. After all, the data needs to be gathered all in a single database so that they can be managed easier because they are stored in a single location. With this, the security of such database must be taken into considerations as well.
Management and utilization of Big Data is still a challenge for the experts in data engineering and information technology. The present level of technology is still not enough to solve problems related to data loss and data complexity. Fortunately, the issue of privacy is being taken care of and developed to ensure that private information will be highly secured. The article concluded that Big Data is a huge revolutionary factor in the future development of the society, and this is the reason as to why experts are continuously developing systems which will be able to efficiently manage and store data and to be able to use them as a general and not exclusively on different practical applications. Also, data must be classified in order to determine what will be secured, and what will be disseminated to the public.
These Big Data may be analyzed, but inefficiently since the input volume per time frame exceeds the capacity of most software. In an IT point of view, data management should be addresses with high priority because the data that were lost have so much other applications that can be significant to other fields if only they were efficiently managed and fully maximized.
The paper does not really add on new knowledge to the field of data engineering. Likewise, the era on Big Data had already been given emphasis by many IT and “nontechnical executives” to make their decision process logical (”Opening Our Eyes in the Era of Big Data”, n.d.). What the article has done is to incorporate Big Data with the HACE Theorem and discuss the different components and issues underlying. Previous solutions which are not successful enough are also in the discussion. Aside from that, the article does not provide any proposed solution to the problems addressed.
The journal article used many works by past researchers as their support when stating claims such as the dynamic game theory (a game is dynamic due to the interaction between users that are repeated), and the theory of local pattern analysis (describes patterns of the world) . These works are reliable enough so the implication is that the findings in the article are accurate and can be referenced by other future researches as well. Experts are also cited within the article which will add to the reliability of the article
The journal article had focused on the model provided by the HACE model and its components which are linked with the issues faced by managing Big Data. HACE Theorem utilized a conceptual framework which specified the problems stated previously. However, revealing the actual problems that are apparent in the model is not useful. Rather, the researchers could have provided solution other than what they provided like statistical and computational improvement as they are the future of Big Data, and not just facts acquired from the past which also used the same approach like HACE Theorem (Shaw, 2014). Solution is more important than statistical facts and computation improvements. There are other approaches which can be made and the article should have go deeply into finding a possible solution on the problems on handling and managing Big Data.
The researchers or authors had fairly conducted their study by providing greater depth to the relevance of Big Data and modeling it by the use of HACE theorem. The authors had also managed to discuss the different dilemmas that may be encountered if one tries to address the limitations of the HACE theorem such as privacy issues and technology limitations.
Moreover, the paper could have provided other models that may be utilized in order to model Big Data so that readers may realize a different approach in relation to mining Big Data. Moreover, the researcher utilized a correct approach to engage the audience on its domain.
Issues Listed by the Author
With the relevance to HACE Theorem, issues such as heterogeneity, diversity, autonomy, decentralization, complexity and evolution of data emerged. These issues are important to be addressed because the authors emphasized that these issues made data mining of Big Data problematic. It was not really resolved, but the authors had proposed a change in the global protocols of the systems, models and data level to minimize the negative implications of decentralization. Decentralization posed a huge problem especially in the data management because some of the data are being wasted and lost which will reduce the efficiency and storage capacity of other systems and database.
Among the issues that the authors had not discussed in full details are the repercussions of the data mining of big data on the society itself. Data mining does not only provide the society with the benefits it promises to offer. It also promotes privacy risk and violations of certain ethical guidelines. Moreover, with the patterns of behavior analyzed it is highly likely that many other issues may arise. These issues are equally important especially on the part of data providers because leakage of private information is very unethical and too risky especially in the safety and security of the data providers. Among the Big Data that the article had provided are the multitude of tweets, the terabytes of images that are shared on social networking sites in a single day and the data from radio astronomy dishes.
The researchers had placed subsections in the article that makes the work organized and relatively easier to understand. It allows one to take each problem from Big Data mining one at a time but does not necessarily allows one to lose grasp of the whole concept.
A citation analysis shows that the journal article was cited in many other web pages aside from the journal that published it and in some books.
Questions came up after reading the journal article that had not been disclosed in full details. Up to what extent are the extracted information from the Big Data will be accessible? Do the benefits that the extracted information and predictions from Big Data outweigh the importance of privacy? Can the analysis of Big Data allow the precise prediction of matters such as calamities by analyzing previous data from many years?
IEEE Transactions on Knowledge and Data Engineering. (n.d.). Retrieved January 28, 2016, from http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?reload=true
The article is a statement from the IEEE Transactions on Knowledge and Data Engineering. The article was also about the purpose of the journal that published the journal article that was reviewed. They had stated that the purpose of the journal was to cater to data engineering and its applications on the physical and biological sciences. The article was likewise used to assess whether the journal was appropriate in terms of publishing the journal article that was reviewed.
Opening Our Eyes in the Era of Big Data. (n.d.). Retrieved January 29, 2016, from http://www.information-management.com/digital_edition/opening-our-eyes-in-the-era- of-big-data-10023762-1.html .
The article is about the possible applications of Big Data on the society nowadays. Among the mentioned uses are in the field of information technology and entrepreneurship. The article was likewise used to support a claim that the Era of Big Data is not a new concept. The message of the journal article is not itself a new contribution but rather a reiteration of known opinion.
Shaw, J. (2014, March/April). Why “Big Data” Is a Big Deal. Harvard Magazine. Retrieved January 29, 2016, from http://harvardmagazine.com/2014/03/why-big-data-is-a-big-deal
The importance of data and database is being discussed in the paper. With database being used at large, the data have huge impacts in the scientific world. Today, the use of data is so widespread that applications are not only limited to the technological aspect. However, the security provided by data systems do not ensure safety and protection against potential dangers such as hacking and data dredging.
The article was used to provide outside perspective on Big Data. Likewise, the article also offers facts from experts in the field.
Wu, X., Zhu, X., Wu, G., & Ding, W. (2014). Data Mining with Big Data. IEEE
Transactions on Knowledge and Data Engineering, 26(1), 97-107.
The spread of data in the application of many industries as provided a great help in terms of its functions such as networking, storage and collection of data. The sudden increase in the use of data are provided justifies this claim to be able to gather more accurate data through communication with one another. The introduction of HACE tells all that there is about data revolution and processes. Also, the paper also tells about possible challenges brought about by HACE with respect to its methods and analysis.
The article was used as the topic of interest in this journal article review.