## Literature Review

Literature Review

## Introduction

Machine learning techniques are based on an explicit or implicit model that enables categorize established patterns analyzed. A unique feature of these schemes is the need to train labeled data model behavior, this being a process that demands resources. Many machine learning-based schemes have been applied to NIDS. Some of the most important are Bayesian networks, Markov models, neural networks, fuzzy logic techniques, genetic algorithms, and clustering and outlier detection (Carbonneau, et. al., 2008, pp. 1140-54).

The task of automatic text classification is based on building and using machines called supervised learning. The process of creating an automatic text classification is to identify variables that are useful in discriminating texts belonging to different pre-existing classes. The main contributions to the subject at hand are the strategies of automatic classification algorithms based on different categorization. In order to analyze the advantages and disadvantages of categorization algorithms used for this task, we will mention a few that have been extensively tested by scholars in different contexts, such as classifiers techniques Naive Bayes, support vector machines (SVM) and decision trees. Through research, it also found some jobs those hybrid strategies that combine, for example, Naive Bayes and decision trees, support vector machines and decision trees, classification proposed textual cross-domain and classification of feelings, classification of feelings and Naive Bayes, or a fusion of various methods combined with a strategy decision. Bayesian classifiers are statistical classifiers which can predict both the probabilities of membership of a class, as the probability that a given sample belongs to a particular class. Such probabilistic classifiers based on Bayes theorem have shown high accuracy and speed when applied to large textual databases (Li & Zhou, 2007, pp. 1088-98).

Support Vector Machine (SVM) in a classification and regression method from statistical learning theory. The basis of SVM methodology can be summarized as follows: If its required to classify a data set (represented in an n-dimensional plane) not linearly separable, said data set is taken and is mapped to a space of higher dimension where other possible linear separation (this is done by functions called Kernel). In this new plane a hyperplane that is capable of separating into two kinds of data entry is searched; the plane should have the greatest possible distance to the points of both classes (those closest to the hyperplane separation points are support vectors). If what is required is to regress, the data set is taken and is transformed into a space of greater dimension (which itself can do a linear regression) and in this new space the linear regression is performed without penalizing small errors (Fan, et. al., 2008, pp. 1871-74).

Tsai, et. al., (2009, pp. 11994-12000) highlighted that the way the problem of finding a hyperplane maximizing margins between this and the points of both classes (the set of input data) is solved, defines whether the SVM is traditional or LS-SVM. SVM solves this problem by the structural risk minimization principle, while the LS-SVM solved by a set of linear equations. While in traditional SVM many bearing values are zero (zero values correspond to the support vectors), LS-SVM support values are proportional to the errors.

Its detailed definition is presented in the next section. The previous constants are dependent on the particular problem, and there are no methods for their estimates, so their values must be fixed heuristically by the expert; is analyzed as well as in references no suggestions on this point are presented. This aspect greatly complicates the process of developing a prediction model for a time series using SVM.

## Naive Bayes

Nguyen and Armitage (2008, pp. 56-76) argued that one of the first categorizer’stexts was presented by Maron and Kuhns (1960). The categorizer is based on the model Naive Bayes , which requires the assumption of statistical independence in the co-occurrence of terms in a document. Each document is represented by a set of terms that constitute its text. Each term is represented by a node in a directed graph, an additional node representing the categories. Since each node representing a term a bow to the node representing the categories subtends. Each arc is labeled by a coefficient whose value represents the probability of observing that term in the document assuming document belonging to a particular category. The document is assigned to the category that gets the highest probability of membership.

McCallum and Nigam (1998) tested two formulations model Naive Bayes . One of them, the model Bernoulli probability weighted terms that occur in the text of a document with a factor of 1 and 0 that do not occur. The second formulation, known as multinomial model the probability weighted terms considering the frequency they have on each document (Chu, et. al., 2007, pp. 281). The accuracy of these models on assessment different collections was studied with different results being generally superior performance achieved by the multinomial model. Among the factors that explain the results is the length of each document, which was studied by Bennett (2000), who attributed to this factor the tendency to derive estimates of the odds Bayesian close to 0 or 1.

## Neural Networks

Neural systems contained inside of connectionist models, are framed by a situated of basic figuring components called simulated neural frameworks. These neurons are interconnected through an association with related weights, which speak to information in the system. Every neuron figures the aggregate of its inputs, weighted by the weights of the associations, subtracts an edge esteem and applies a non-straight capacity (e.g. a sigmoid.); the outcome is the passage to the neurons in the following layer (in systems, for example, the Multilayer Perceptron).

One of the most used algorithms to train neural networks is the back-propagation, which uses an iterative method to propagate the error terms (difference between values obtained and desired values), necessary to modify the weights of the interneuronal connections. The back-propagation method can be considered a non-linear regression, which applies a gradient descent in the space of parameters (weights) to find local minima in the error function (Collobert& Weston, 2008, pp. 160-67). Neural networks have been used successfully in different types of problems:

Auto-association: the network generates an internal representation of the examples provided, and responds with the closest approximation to its "memory". Example: Boltzmann machine.

Pattern Classification: the network is able to classify each entry in a predefined set of classes. Eg: back-propagation.

Detection of regularities: the network adapts to the input examples, taking from them several features to classify; in this case, the set of classes is not defined in advance, so that learning is unsupervised. Eg: Network MAXNET, ART1 maps Kohonen network Oja, etc. (Feng, et. al., 2009, pp. 1352-57).

## Random Forests and Decision Trees

The tree is a great help for choosing between several courses of action. Provides a highly effective structure within which we can estimate, what are the options and investigate possible consequences of selecting each. It also helps build a balanced picture of the risks and rewards associated with each possible course of action. All decision trees require the following four components:

## Decision alternatives at each decision point.

Events that may occur as a result of each decision alternative.

Likelihood of possible events as a result of decisions.

Results of the possible interactions between decision alternatives and events (Chan &Paelinckx, 2008, pp. 2999-3011).

A decision tree is a graphical and analytical way of representing all events (events) that may arise from a decision made at some point. They assist in making the "right" decision from a probabilistic point of view, to a range of possible choices. These trees enable examining the results and visually determine how the model flows (Strobl, et. al., 2009, pp. 323). Visual results can help to find specific subgroups and relationships that might not be faced with more traditional statistics.

## Major Steps for Classification

Consider a robot trying to learn how to determine if a certain region in their field of vision is safe or not, in the first case you can move towards it and the second should avoid it. Computational learning strategy to address this problem is to collect examples of safe and unsafe regions (a training set) and from these 'learn' function when given a new region to classify correctly, so that the robot can make the right decision to move or not. Each training example is represented by a series of (called attributes or characteristics) values which depend on the particular problem. For the robot, attributes can be calculated from sensing devices with which it has, for example, a video camera, or a laser capable of measuring the distance to different points in the field of view (Banfield, et. al., 2007, pp. 173-80).

Training data have been collected from real robot navigations and distributed in the two classes by an external agent, for example a human, that determines whether a particular region is safe or not. As noted above, these data are input for training the robot to learn a classification model (also called discriminant function or function prediction) receiving as input sensing data, the vector of attributes, and outputs a of the two classes: safe or unsafe region.

The motivation to find this function prediction may come from other problems. For example, attributes of a case may be characteristic of a potential buyer who is sending and advertising a product class indicates whether the person finally purchase the product or not (Huang, et. al., 2011, pp. 107-22). The prediction model induced from the training data can be used to predict whether a new potential buyer will not acquire or product. Another example is the detection of spam, commonly called spam. In this case, the attributes correspond to features of an email (frequency of certain words, length of the mail sender domain, etc.) and the class classification as spam or non-spam. Function prediction can classify incoming e-mails so that unwanted sent to a special folder.

The example is used to illustrate the construction of a model to classify handwritten digits. The entrance to the training algorithm is a set of digits represented by the pixels with the respective class. The training algorithm generates a classification model (e.g. a neural network), which is evaluated on the input examples (usually different from those used for training). This process can be iterated multiple times by changing the parameters of the training algorithm to obtain a classification model with acceptable performance (Hinton, et. al., 2012, pp. 82-97). This model is then used to predict the class of new examples of entry.

The main activity of the learning process is the training algorithm, which should be able to find a good classification model from the data. The strategy to find a good prediction function is to quantify how well the function labels classify the training examples. This is accomplished with a loss function, which quantifies the error classification function produces when applied to a set of training data (Miche, et. al., 2010, pp. 158-62). The learning process then consists in finding a sorting function that minimizes the loss function, which is usually addressed using optimization methods.

## Learning Problems

The example presented above illustrates a particular type of learning disability, a classification task, which belongs to the general category of supervised learning. There are different types of learning problems that are primarily determined by the information available for the training and how it is provided to the learning algorithm. In a supervised learning problem it is assumed that some of the attributes in the training data functionally dependent on the values of other attributes (Garcia, et. al., 2009, pp. 959-77). In the case of a classification problem or dependent variable is the class attribute, which corresponds to an attribute that takes a discrete value. If the attribute is continuous predict talking about a regression problem.

A problem of unsupervised learning attributes are not considered to be predicted based on other attributes. That is, the training examples are not labeled. In this case, the learning algorithm has no oversight to let you know if a certain prediction is correct or not. In fact, in this context it does not make much sense to speak to predict, since there is no particular attribute to be predicted. Although it is possible to find patterns in the data and this is indeed the task of an algorithm for unsupervised learning (Olden, et. al., 2008, pp. 171-93). There are various tasks unsupervised learning, the two main ones are grouping and density estimation.

In the supervised and unsupervised learning it assumes that the training data are available before the learning process and, after learning the model, this is applied without modification to all new test examples (Huang, et. al., 2010, pp. 155-63). A more realistic scenario is one in which the agent learns continuously perceptions it receives from the environment and adapt their actions accordingly. This type of learning is called reinforcement learning, which aims to find a policy decision function used by the agent to make decisions at all times, enabling it to maximize profit (less penalties awards) long term. This function the agent learns progressively through the interaction with the environment, each prize is found positive reinforcement and a punishment each negative reinforcement. In the same vein, Zhang, et. al., (2009, pp. 3218-29) reinforcements allow the agent to modify its decision function accordingly.

## Learning Methods

There are a variety of methods that reflect the diverse areas of knowledge underlying computational learning and the various applications that has. Some early methods had a clear biological inspiration, for example, the perceptron, proposed by Frank Rosenblatt in 1957, a model for the recognition of images inspired by the functioning of the brain, which later give rise to networks neural. However, Chen, et. al., (2009, pp. 5432-35) highlighted that in modern computational learning methods converge from areas such as statistics, probabilistic modeling, pattern recognition and data mining, among others. Some of the most representative methods include decision trees, graphical models, support vector machines, logistic regression, nearest neighbor classification, induction of association rules, k-means grouping and hidden Markov models.

In this regard, Challagulla, et. al., (2008, pp. 389-400) neural networks were one of the first methods proposed computational learning. However, these early models had several restrictions, as efficient algorithms existing training at this time could only handle very simple networks (no intermediate layers of neurons) which could only solve very simple problems. This caused them to fall into disuse for several years. In the eighties, a new algorithm of training, backpropagation, which allowed multi-train network, was proposed. This produced a rebirth of neural networks with multiple success stories in its application to different problems ranging from computer vision to predict financial time series. Similarly, a renewed interest in the scientific community, which resulted in a large number of publications, conferences and journals generated. During the nineties, studied by Tan, et. al., (2009, pp. 337-49) highlighted that there were learning new computational algorithms with stronger, such as kernel methods, with support vector machines as its representative technique, which showed a better performance in many tasks mathematical foundation. This generated a further reduction of interest in neural networks. However, in recent years there has been a renewed interest in neural networks thanks to the emergence of so-called models of deep learning which are allowed to have unprecedented results in computer vision and speech recognition interest among others (Ye, et. al., 2009, pp. 6527-35)

## Conclusion

The literature review introduced a procedure to compare various criteria for selecting features and feature extraction methods. The main objective of this is to select the best settings in the pre-processing of data and achieve good performance by reducing the dimension of space and selecting appropriate state-of-the-art technique that is best suitable in the particular situation. The same applied in five different classifiers for the best settings, which technique is best in a given situation. As every technique, whether it would be SVM or neural network, each technique has it own pros and cons, which has been highlighted in the literature comprehensively. Two methods for selecting the separate components of the most used regarding the components used in the same order they are estimates were compared.

However, from the literature review an idea can be established that state-of-the-art and mathematics should be approached from two perspectives: a) related to the context intra-matematico b) extra-matematico related context. In applying the procedure the best settings were obtained in the pre-processing of data for each classifier. One of the main purposes is to promote the development, implementation and evaluation of mathematical techniques, at different levels of the state-of-the-art method, allowing implement, definitely, this wide range of theoretical principles set out repeatedly by many didactics of mathematics.

.

.

## References

Banfield, R. E., Hall, L. O., Bowyer, K. W., &Kegelmeyer, W. P. (2007). A comparison of decision tree ensemble creation techniques. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(1), 173-180.

Carbonneau, R., Laframboise, K., &Vahidov, R. (2008). Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184(3), 1140-1154.

Challagulla, V. U. B., Bastani, F. B., Yen, I. L., & Paul, R. A. (2008). Empirical assessment of machine learning based software defect prediction techniques.International Journal on Artificial Intelligence Tools, 17(02), 389-400.

Chan, J. C. W., &Paelinckx, D. (2008). Evaluation of Random Forest and Adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery. Remote Sensing of Environment, 112(6), 2999-3011.

Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.

Chu, C., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A. Y., &Olukotun, K. (2007). Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19, 281.

Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. InProceedings of the 25th international conference on Machine learning (pp. 160-167). ACM.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9, 1871-1874.

Feng, G., Huang, G. B., Lin, Q., & Gay, R. (2009). Error minimized extreme learning machine with growth of hidden nodes and incremental learning. Neural Networks, IEEE Transactions on, 20(8), 1352-1357.

García, S., Fernández, A., Luengo, J., & Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 13(10), 959-977.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), 82-97.

Huang, G. B., Ding, X., & Zhou, H. (2010). Optimization method based extreme learning machine for classification. Neurocomputing, 74(1), 155-163.

Huang, G. B., Wang, D. H., &Lan, Y. (2011). Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107-122.

Li, M., & Zhou, Z. H. (2007). Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 37(6), 1088-1098.

Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., &Lendasse, A. (2010). OP-ELM: optimally pruned extreme learning machine. Neural Networks, IEEE Transactions on, 21(1), 158-162.

Nguyen, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic classification using machine learning. Communications Surveys & Tutorials, IEEE, 10(4), 56-76.

Olden, J. D., Lawler, J. J., &Poff, N. L. (2008). Machine learning methods without tears: a primer for ecologists. The Quarterly review of biology, 83(2), 171-193.

Strobl, C., Malley, J., &Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological methods, 14(4), 323.

Tan, S., Cheng, X., Wang, Y., & Xu, H. (2009). Adapting naive bayes to domain adaptation for sentiment analysis. In Advances in Information Retrieval (pp. 337-349). Springer Berlin Heidelberg.

Tsai, C. F., Hsu, Y. F., Lin, C. Y., & Lin, W. Y. (2009). Intrusion detection by machine learning: A review. Expert Systems with Applications, 36(10), 11994-12000.

Ye, Q., Zhang, Z., & Law, R. (2009). Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Systems with Applications, 36(3), 6527-6535.

Zhang, M. L., Peña, J. M., & Robles, V. (2009). Feature selection for multi-label naive Bayes classification. Information Sciences, 179(19), 3218-3229.