Data mining involves extracting meaningful information from data. There is much misinformation about data mining. The data is important, and data mining is a larger process applied on that data. It is not necessary that data mining be accurate, but important is to derive patterns and hypothesis. In short data, mining can also be termed as an automated extraction process of information from big data sets. There are various data mining tools and algorithms that enable the extraction of information, though some level of statistical expertise is required to use those techniques. Some data mining techniques are decision tree, rule induction, nearest neighbor classification. Data mining is different from data warehousing. It can be said as an amalgamation of computing power, data collection, and statistical algorithms. Some common use of data mining includes web site personalization, direct mail marketing, bioinformatics, credit card fraud detection, and market-based analysis.
Statistical learning falls into either of the two categories i.e. Supervised or Unsupervised learning. Problems automatically come under either of the two learning methods. For example, if spending pattern of customers is available then, it comes under supervised learning. Linear regression, classification, and regression are some of the approaches of supervised learning.
In the supervised learning, there are features measured on the basis of observations, and the response is measured on the same set of observations. The goal is to predict the response and discover interesting values about the measurements. This kind of learning has a growing importance in fields like cancer treatment, search engine, genomic data, online shopping site, etc.
On the other hand, supervised learning is more clearly defined and consists of creating a model for estimation and prediction of results based on inputs. Supervised learning has a scope in fields like medicine, business, astrophysics and aerospace (James, Gareth, et al.).
(1) K-Nearest Neighbors Approach
The k-Nearest Neighbors Algorithm (k-NN) is an approach for pattern recognition. k-NN is a lazy learning or instance-based method, where most of the computation is postponed until classification. This algorithm is the most basic of all machine learning algorithms. The algorithm is useful in both classification and regression for weighing the neighbor contributions. A general weighing mechanism is giving each neighbor a weight of 1/d, d being the distance between neighbors. k-NN algorithm is sensitive to local data structure.
An example of k-NN algorithm use is at the study of diverse flowers. A botanist wants to study and observe the diversity of various flowers in a big farm. To examine each flower one by one would require a significant amount of time. So, the botanist measures the various characteristics of the flower like number of petals, stamen size, color, height and stores the data in a computer system. Now uses the pre-classified data to predict a variety of each flower based on its characteristics.
(2) Classification Tree Approach
Classification trees are used to predict data responses. A decision tree works from the root node and down up to the leaf node. In the case of Classification trees the responses are small like 'true' or 'false'. The classification tree consists of edges and nodes. The types of nodes are two i.e. Intermediate nodes consisting of a single attribute and a predicate, another type is Leaf nodes containing the value for prediction. The top attributes in the tree have a higher ranking in terms of classification.
An example of a classification tree can be created from a database of car drivers. The attributes are Driver Age, driver location, car type. The prediction attribute is 'Lives in suburb'. Using classification tree find a map between the attributes and the prediction attribute.
(3) Regression Tree Approach
Regression trees like Classification trees are also used to predict data responses. Regression tree approach is used where prediction of values is to be done from a set of variables where one or more variable is continuous variable and/or a categorized predictor variable.
An example of use is in the prediction of the selling prices of single family homes. Where a single family home is a continuous dependent variable, and the other variables are like square footage and categorical variables are home type, two-story, zip code, property location, etc.
(4) Ensembles Approach
The ensemble approach uses a combination of different learning algorithms. The predictive results are then obtained from any of the participating learning algorithm. An ensemble approach is a supervised learning algorithm, as it can be trained to make predictions. An example of this approach is Stacking, in which a learning algorithm combines the predictions from many different learning algorithms. All of the stacked algorithms are trained using the same data. A combined set of algorithm is then trained to make the final prediction utilizing all the predictions from the algorithms as an additional input. Stacking gives a better performance than any single trained models.
(5) Cluster Analysis Approach
Cluster analysis is a multivariate approach that classifies a set of subjects based on the measured variables classifying them into different groups where the similar subjects placed in the same group. The classification methods can be Hierarchical or Non-hierarchical methods.
An example of use this approach is the field of psychiatry. The characterization of patients is done on the basis of symptoms and thereby identification of the right therapy. In marketing, it is useful for identifying different groups of potential customers for target based advertising.
(6) Dimension Reduction Approach
Dimension reduction is the approach of reducing the number of random variables. Dimension reduction consists of feature extraction and feature selection. Feature extraction converts the data to fewer dimensions. The data conversion may be of linear or nonlinear dimensionality. Feature selection approach extracts a subset from the original variables by doing a filter or a wrapper approaches. An example is a search guided by the accuracy of results. Principal component analysis is a accepted technique for dimension reduction.
(7) Market-Basket Analysis
Market Basket Analysis is a technique on the basis of the theory that when a person buys a certain group of items then there is a probability of buying another group of items. Market basket analysis is utilized by Affinity analysis that is a data analysis and data mining technique. It discovers the relationships among activities performed by specific groups or individuals. In retail, affinity analysis is used to understand the purchasing behavior of the customers.
Market basket analysis might give information about purchase behaviors of customers like often doing a purchase of shampoo and conditioner. Using this information, a retailer can put both items on promotion to boost sales. Amazon utilizes affinity analysis for cross-selling recommending products based on purchase history and the purchase history of other people buying the same item.
(8) Collaborative Filtering
Collaborative filtering (CF) is the approach of filtering information using techniques that involve collaboration among multiple sources and it is an approach used by recommender systems. Collaborative filtering is used in large data sets. It is applied to different kinds of data like monitoring and sensing data from mineral exploration or environmental sensing, financial data, or user data from electronic commerce.
Recommendation algorithms are most used in e-commerce Websites, where they use customer inputs like purchase, items viewed, demography or interests to generate a list of recommendations.
Simulation and optimization are sometimes confused. Optimizing is a technique to maximize the functioning of methods. Optimization involves comparing different sets of options and choices in a certain situation to choose the most optimal choice or the best. Optimization puts a focus on maximizing profit, minimizing cost with an optimal design and minimum error. Optimization is more related to enhancing the productivity of current tools and methods in terms of their output and results. Simulation, on the other hand, is more of a tool for safety. In complex models simulation the only alternative to identifying undesirable outcomes like stock market crash. Simulation is beneficial in developing policies and processes to deter undesirable outcomes.
10 staff wages below (k = thousand):
Salary 15k, 18k, 16k, 14k, 15k, 15k, 12k, 17k, 90k, 95k
(1) Mean or Average Salary of the Staff: 30.7k
(2) An outlier is a value that lies at a large distance from the other values in the sample.
Ordered Salary Set: 12k, 14k, 15k, 15k, 15k, 16k, 17k, 18k, 90k, 95k
Median: (15+16)/2 = 15.5k
Based on above the outliers are: 90k and 95k as they lie way beyond boundaries.
(3) In this example, the calculated mean comes to 30.7k however the range of most salaries lie between 12k and 18k only two salaries are much higher than others that are 90k and 95k. Due to these two higher salaries the mean has increased although it does not reflect the salary in general.
(4) Interval scales are numeric scales that are ordered and also the exact difference between values is known. An example of an interval scale is Celsius temperature. All level of statistical analysis on Interval scale data sets can be performed like mean, mode, median or standard deviation.
Ratio scale is similar to interval scales except that they have a clear definition of zero. Examples of ratio data include height and weight.
Based on the above definition the salary in given example can be classified as a ratio data. As apart from mean, median and outlier calculation it is also possible to have an absolute zero for a salary that would be the case of no salary.
Thearling, Kurt. "An introduction to data mining." Whitepaper.
http://www3. shore. net/~ kht/dmwhite/dmwhite. htm (1999).
James, Gareth, et al. An introduction to statistical learning. New York: Springer, 2013.