.. _feature_shuffling: .. currentmodule:: feature_engine.selection SelectByShuffling ================= :class:`SelectByShuffling()` selects features whose random value permutation reduces model performance. If a feature is predictive, shuffling its values across rows will result in predictions that deviate significantly from the actual outcomes. Conversely, if the feature is not predictive, altering the order of its values will have little to no impact on the model's predictions. Procedure --------- The algorithm operates as follows: 1. Train a machine learning model using all available features. 2. Establish a baseline performance metric for the model. 3. Shuffle the values of a single feature while keeping all other features unchanged. 4. Use the model from step 1 to generate predictions with the shuffled feature. 5. Measure the model's performance based on these new predictions. 6. If the performance drops beyond a predefined threshold, retain the feature. 7. Repeat steps 3-6 for each feature until all have been evaluated. Python Example -------------- Let's see how to use :class:`SelectByShuffling()` with the diabetes dataset that comes with Scikit-learn. First, we load the data: .. code:: python import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.linear_model import LinearRegression from feature_engine.selection import SelectByShuffling X, y = load_diabetes(return_X_y=True, as_frame=True) print(X.head()) In the following output, we see the diabetes dataset: .. code:: python age sex bmi bp s1 s2 s3 \ 0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 s4 s5 s6 0 -0.002592 0.019907 -0.017646 1 -0.039493 -0.068332 -0.092204 2 -0.002592 0.002861 -0.025930 3 0.034309 0.022688 -0.009362 4 -0.002592 -0.031988 -0.046641 Now, we set up a machine learning model. We'll use a linear regression: .. code:: python linear_model = LinearRegression() Now, we set up :class:`SelectByShuffling()` to select features by shuffling. We'll examine the change in the `r2` using 3 fold cross-validation. The parameter `threshold` is left to None, which means that features will be selected if the performance drop is bigger than the mean drop caused by all features. .. code:: python tr = SelectByShuffling( estimator=linear_model, scoring="r2", cv=3, random_state=0, ) The `fit`()` method identifies important variables—those whose value permutations lead to a decline in model performance. The `transform()` method then removes these variables from the dataset. .. code:: python Xt = tr.fit_transform(X, y) :class:`SelectByShuffling()` stores the performance of the model trained using all the features in its attribute: .. code:: python tr.initial_model_performance_ In the following output we see the r2 of the linear regression trained and evaluated on the entire dataset, without shuffling, using cross-validation. .. code:: python 0.488702767247119 In the following sections, we'll explore some of the additional useful data stored by :class:`SelectByShuffling()`. Evaluating feature importance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`SelectByShuffling()` stores the change in the model performance caused by shuffling every feature. .. code:: python tr.performance_drifts_ In the following output, we see the change in the linear regression r2 after shuffling each feature: .. code:: python {'age': -0.0054698043007869734, 'sex': 0.03325633986510784, 'bmi': 0.184158237207512, 'bp': 0.10089894421748086, 's1': 0.49324432634948095, 's2': 0.21163252880660438, 's3': 0.02006839198785859, 's4': 0.011098050006761673, 's5': 0.4828781996541602, 's6': 0.003963360084439538} :class:`SelectByShuffling()` stores the standard deviation of the performance change: .. code:: python tr.performance_drifts_std_ In the following output, we see the variability of the change in r2 after feature shuffling: .. code:: python {'age': 0.012788500580799392, 'sex': 0.040792331972680645, 'bmi': 0.042212436355346106, 'bp': 0.05397012536801143, 's1': 0.35198797776358015, 's2': 0.167636042355086, 's3': 0.03455158514716544, 's4': 0.007755675852874145, 's5': 0.1449579162698361, 's6': 0.011193022434166025} We can plot the performance change together with the standard deviation to get a better idea of how shuffling features affect the model performance: .. code:: python r = pd.concat([ pd.Series(tr.performance_drifts_), pd.Series(tr.performance_drifts_std_) ], axis=1 ) r.columns = ['mean', 'std'] r['mean'].plot.bar(yerr=[r['std'], r['std']], subplots=True) plt.title("Performance drift elicited by shuffling a feature") plt.ylabel('Mean performance drift') plt.xlabel('Features') plt.show() In the following image we see the change in performance resulting from shuffling each feature: .. figure:: ../../images/shuffle-features-std.png With this set up, features that elicited a mean performance drop greater than the mean performance of all features, will be removed. If, for any reason, this threshold is too conservative or too permissive, by analysing the former barplot, you can get a better idea of how these features affect the predictions of the model, and select a different threshold. Checking out the eliminated features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`SelectByShuffling()` stores the features that will be dropped based on a certain threshold: .. code:: python tr.features_to_drop_ The following features were deemed as non-important, because their performance drift is greater than the mean performance drift of all features: .. code:: python ['age', 'sex', 'bp', 's3', 's4', 's6'] If we now print the transformed data, we see that the features above were removed. .. code:: python print(Xt.head()) In the following output, we see the dataframe with the selected features: .. code:: python bmi s1 s2 s5 0 0.061696 -0.044223 -0.034821 0.019907 1 -0.051474 -0.008449 -0.019163 -0.068332 2 0.044451 -0.045599 -0.034194 0.002861 3 -0.011595 0.012191 0.024991 0.022688 4 -0.036385 0.003935 0.015596 -0.031988 Additional resources -------------------- For more details about this and other feature selection methods check out these resources: .. figure:: ../../images/fsml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-selection-for-machine-learning Feature Selection for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/fsmlbook.png :width: 200 :figclass: align-center :align: left :target: https://leanpub.com/feature-selection-in-machine-learning Feature Selection in Machine Learning | | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.