How to Select the Right Features for Better Machine Learning Results

Authors:
(1) Mahdi Goldani;
(2) Soraya Asadi Tirvan.
Table of Links
Abstract and Introduction
Methodology
Dataset
Similarity methods
Feature selection methods
Measure the performance of methods
Result
Discussion
Conclusion and References
Feature selection methods
Once the database without missing value is obtained, the next step is to apply FS and similarity methods to choose the most relevant variables. Feature selection involves the study of algorithms aimed at reducing the dimensionality of data to enhance machine learning performance. In a dataset with N data samples and M features, feature selection aims to decrease M to M′, where M′ ≤ M. Subset selection entails evaluating a group of features together for their suitability. The general procedure for feature selection comprises four key steps: Subset Generation, Evaluation of Subset, Stopping Criteria, and Result Validation. Subset generation involves a heuristic search, where each state specifies a candidate subset for evaluation within the search space. Two fundamental issues determine the nature of the subset generation process. Firstly, the successor generation determines the search’s starting point, which influences its direction. Various methods, such as forward, backward, compound, weighting, and random methods, may be considered to decide the search starting points at each state [13]. Secondly, the search organization is responsible for the feature selection process with a specific strategy, such as sequential, exponential, or random search. Any newly generated subset must be evaluated based on specific criteria. Consequently, numerous evaluation start Historical Data(100 Companies) select Feature Reduced sample size by 1% until just remain 20% Linear Regression (training) Forecast (10 Days) Performance Evaluation Documented Results end criteria have been proposed in the literature to assess the suitability of candidate feature subsets. These criteria can be categorized into two groups based on their dependency on mining algorithms: independent and dependent criteria [14]. Independent criteria exploit the training data’s essential characteristics without employing mining algorithms to evaluate the goodness of a feature set or feature.
Based on the selection strategies and/or criteria, there are three main types of feature selection techniques. wrappers, filters, and embedded methods [15]. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of overfitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. The table below illustrates the most well-known methods in each category.
This paper is available on arxiv under CC BY-SA 4.0 by Deed (Attribution-Sharealike 4.0 International) license.