The Power of Filter Methods in Feature Selection for Machine Learning
The choice of the appropriate features to include in the model is one of the most important stages a data scientist can take to create a successful machine-learning model. The procedure of finding and choosing the most appropriate and instructive features from a dataset to be implemented in creating a predictive model is known as feature selection. For feature selection, several kinds of methods are available, including filter methods, wrapper methods, and embedding methods.
In the following blog post, we will be focusing on filter methods, which are a type of feature selection method that examines the importance of each feature independently of the model being employed. Filter techniques evaluate characteristics based on statistical criteria such as correlation with the target variable and rank them in order of relevance.
The key benefit of filter approaches is their speed and simplicity since they do not need machine learning model training. As a result, filter techniques are an excellent first step in feature selection since they may immediately reduce the dimensionality of the dataset and increase model performance.
Perhaps the most common filter techniques are:
i. Correlation-based Feature Selection: The correlation between each feature and the target variable is calculated using this technique. For inclusion in the model, the features with the strongest correlation are considered for building the model.
ii. Chi-Squared Test: For categorical features, this approach evaluates the independence between the feature and the target variable. The attributes with the greatest chi-squared statistic are chosen for building the model.
iii. Mutual Information: This approach evaluates the mutual dependencies of each attribute and the target variable. The attributes with the highest possible mutual information score are chosen for building the model.
iv. Variance Threshold: This strategy eliminates features that have a low variance since they are unlikely to have a major influence on the model’s performance.
Using Python to grasp the concept.
Here’s a Python code sample that demonstrates ways to execute Correlation-based Feature Selection with the f_regression:
# Importing relevant libraries.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
pandas
is used for data manipulation and analysis.numpy
is a numerical computing library used for handling arrays and matrices.SelectKBest
fromsklearn.feature_selection
is used for selecting the top K features based on a given scoring function.train_test_split
fromsklearn.model_selection
is used for splitting the dataset into training and testing sets.LogisticRegression
fromsklearn.linear_model
is used to create a logistic regression model for classification.accuracy_score
fromsklearn.metrics
is used to calculate the accuracy of the model's predictions.
# Creating a dummy dataset for student performance classification.
data = {'Age': [18, 19, 20, 21, 22, 23, 24, 25],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male'],
'StudyHours': [2, 4, 6, 8, 10, 12, 14, 16],
'Attendance': ['Low', 'Low', 'High', 'High', 'High', 'High', 'Low', 'High'],
'Pass': [0, 0, 1, 1, 1, 1, 0, 1]}
df = pd.DataFrame(data)
# Encode categorical variables
df = pd.get_dummies(df, columns=['Gender', 'Attendance'])
# Separate the features and target variable
X = df.drop(columns=['Pass'])
y = df['Pass']
# Apply the Pearson correlation coefficient for feature selection
from sklearn.feature_selection import f_regression
selector = SelectKBest(f_regression, k=2)
X_new = selector.fit_transform(X, y)
# Print the selected features
selected_features = X.columns[selector.get_support()]
print("Selected Features: ", selected_features)
The f_regression module in scikit-learn is a feature selection approach that computes the p-values for each feature with respect to the target variable using the F-value (F-test statistic). The F-value reflects the feature’s linear dependency on the target, while the p-values assess the statistical significance of the association. With the f_regression scoring function, the SelectKBest function picks the K best features with the greatest F-values, indicating the strongest linear relationships with the target variable.
This code block selects features based on the Pearson correlation coefficient. It uses the SelectKBest function with f_regression as the scoring function to calculate the Pearson correlation coefficient between each feature and the target variable. It chooses the top two attributes with the best scores and converts the dataset to include only those attributes. The resultant dataset X_new only contains the top two features, and their names are displayed using selected_features.
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)
# Train a logistic regression model using the selected features
model = LogisticRegression()
model.fit(X_train, y_train)
Using the train_test_split function from sklearn.model_selection, this code block divides the dataset into training and testing sets. For reproducible results, the data is divided into 80% for training and 20% for testing, with the random_state parameter set to 42. The LogisticRegression function from sklearn.linear_model is then used to train a logistic regression model on the training data. The model learns to predict the target variable y using just the features chosen during feature selection and saved in X_new. The fit approach is used for optimizing the model parameters using the training data during the model training process.
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The point to be highlighted is, filter methods are a strong and effective strategy to carry out feature selection and enhance a machine learning model’s performance. But in order to get the best results, it is frequently required to combine filter methods with additional techniques like the wrapper or embedded methods.
Finally, I emphasized the significance of feature selection in machine learning and highlighted three types of feature selection methods: filtering, wrapping, and embedding. I also put forward a filtering approach that uses correlation-based feature selection along with code snippets to implement it. While I could not cover all the feature selection techniques, I will be covering more methods like the chi-squared test, mutual information, and variance threshold in upcoming posts. I hope this blog post has given you a better understanding of feature selection methods and how to implement them in Python. If you have any questions or feedback, please feel free to get in touch with me via LinkedIn.