Big data is a term that describes datasets that are so broad and complex that traditional data processing applications are inadequate. The size of a big data varies with time. With the colossal size of data, tasks such as analysis and storage become cumbersome and slow. The proposal seeks to establish a better way of data management by setting up an architecture whereby data is stored and distributed across multiple servers.
Datasets are on an ever-growing increase in size. This trend is due to the increasing number of cheap ways of gathering information. Mobile devices, cameras, microphones, radio-frequency identification readers, and aerials are some examples of information capturing devices that have increased the amount of data traffic. It was estimated that as of 2012, 2.5 Exabyte of data was created each day. With the exponential growth in data generation, challenges in the storage, analysis and management of this data at a bearable speed has risen. In order to address the problem of speed, this thesis proposes a better way of storage and access that will increase speed and efficiency.
This project aims to employ the use of multiple layer architecture when dealing with big data. According to recent research, distributed parallel plays a significant role in ensuring there an adequate supply of processing units. The type of architecture inserts data into parallel database management systems. The queries are split and distributed across parallel nodes, processed in parallel and the results gathered and delivered as a whole. This type of architecture looks to make the processing power transparent to the end user by use of a front-end application server.
3. Literature review.
In 2000, Seisint Inc. developed a C++ based distributed file sharing framework for data storage and querying. This structure acted as the basis for future development of data management on a large scale. This work relates strongly to this.
Seisint was acquired by LexisNexis in 2004 and was merged with Choice Point, Inc. to form HPCC System, which provides high-speed parallel processing platform for analyzing large Exabyte of data.
Google also published a paper in 2004 on a process called MapReduce that worked on the architecture of parallel processing and presenting the outcome as one. This project seeks to work on the same principles as its predecessors in big data management.
4. Research questions and objectives.
This proposed project aims to increase processing and analysis speeds. The rise in speed will be done by finding out whether a division of processing tasks among parallel processors will reduce processing and analysis time. The project will also explore sectioning of storage as an alternative way of storing gigantic amounts of data without losing its integrity.
Unique reference keys are added to each data type that is input. After that, the data is categorized into different groups set to parallel processors. The parallel architecture distributes information and data across several processing units providing for faster processing and analysis. This type of structuring implements the use of Map Reduce framework and big data lake. Big data lakes allow an organization to shift focus from centralized control to a shared model. The data lake enables quick segregation of data reducing overhead time.
a. Research strategy.
Research strategies employed included: Interviews, Online reading and use of questionnaires.
The members of this research proposal will include technical staff, software engineers and computer specialists in various companies. Questions were asked of these professionals in acquiring data for the project.
A team including a manager, four computer engineers, and two software engineers will be required.
Referencing – unique attributing keys are added to each type of data. A particular key represents an individual type of data.
Massively parallel- processing (MPP) technology is applied. MPP databases that are related have the ability to store and manage huge amounts of data. It also can load, monitor, backup and optimize the use of large data tables.
Direct-attached storage (DAS) – this method of storage is preferred due to its high capacity and speed of storage.
d. Ethical considerations.
When dealing with an organization's data, privacy issues are a high, controversial point. Data that will be handled include information on age, income and bank account information. All data will be encrypted as it is distributed to various processors. Firewalls and passwords will be used to protect the data.
4. Expected results.
Real-time and fast processing is expected. The division of processing tasks running at the same time reduces the processing load thereby increasing speed. Analysis of big data can be done in a shorter period. The cost of processing, however, will also increase. This rise in cost is due to the extra technological expenses in the use of MPPs and DASs. The benefits of distributed processing, however, equal and surpass the cost. Therefore, it is classified as a realistic project.
The primary data source for this project was an analysis of data management methods in big business and banks. Airports provide the optimum sample of big data management. In an aerodrome, the amount of data handled in a day is exceedingly large. Flight timetables, departures, arrivals, bookings, employee records and security information are all handled on a real-time basis.
Secondary sources include questionnaires given to employees in company’s IT departments. Accessing airport information was not simple as many procedures must be first followed. The credibility of answers to questionnaires is also not completely error-free as it depends on human recollection.
(Boja and Pocovnicu, 2012). Distributed Parallel Architecture for Big Data.
(De Mauro, 2015) “What is big data? A consensual definition and a review of key research topics”. AIP Conference Proceedings.
(Joe Hellerstein, 2008) “Parallel Programming in the Age of Big Data” Giagom blog.
(Roger Magoulas, 2009). Introduction to Big Data.
An inquiry to determine the challenges and solution to big data management, storage, and analysis.
I. General information
I am a worker at Big Data solutions collecting information on how big data affects your life and work. Big data refers the degree or level of velocity and volume of a variety of information that require new ways of processing. Information gathered will be handled with extreme confidentiality. The information collected will be used in carrying out a project that offers a safe and fast way of storing, analyzing and managing large volumes of data.
II. Problem addressed
Scientists, business executives, and governments among others regularly encounter problems when dealing with large sets of data. Data sets grow because of continuous cheap gathering methods like the use of mobile phones. With this enormous growth in the size of data that needs to be handled, analysis and storage become increasingly difficult.
III Socio-economic profile (Answer the questions below)
4. What are your responsibilities at work?
IV. Data management
1. In your work, what kind of information do you handle?
2. What is the average size (in GB) of data processed in a month?
3. What problems do you experience when processing data using your computer?
4. Is your work computer connected to a larger office network?
5. Do you consider the speed of analysis and information sharing in the workplace slow, fast or of medium speed?
6. Prior to this, what was the scope of your knowledge on big data?
7. What aspect of data management of enormous data sets do you find most tasking?
8. How often do you experience technical problems with data analysis in the workplace?
1. What part of data management, either in the workplace or at home, would you most likely want an improvement? Why?