Feature Engineering with Feature Engine

Ashok kumar Palivela
5 min readApr 5, 2021

List of categorical encoding techniques for machine learning.

Photo by Miguel Á. Padriñán from Pexels

Feature engineering is one of the most important step in any ML project/hackathon to get good predictions. In this article, I’m going to discuss different types of categorical encoding techniques using the feature-engine package. Categorical encoding refers to converting string/categorical features to useful numerical features.

Feature-engine is an open source Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine’s transformers follow scikit-learn’s functionality with the fit() and transform() methods to first learn the transforming parameters from data and then transform the data. With the feature engine module, we can have the advantage of selecting only required features to transform and it returns the transformed features as a pandas dataframe.

Categorical encoding methods

  • Rare Label Encoding
  • One Hot Encoding
  • Ordinal/Label Encoding
  • Count/ Frequency Encoding
  • Mean Encoding
  • WoE Encoding
  • Decision Tree Encoding

In this article, I’m using the titanic dataset from openML repository.

sample data points

let’s separate the data into training and testing set.

X = data.drop('survived', axis=1)
y = data.survived
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,
random_state=0)

1. Rare Label Encoding

The process of grouping labels that show a small number of observations in the dataset into a new category -”Rare”. It prevents over-fitting.

from feature_engine.encoding import RareLabelEncodercat_features = ['cabin', 'sex', 'embarked']rare_encoder = RareLabelEncoder(
tol=0.02, # labels with frequencies < 'tol' will be grouped
n_categories=4, # minimum categories required for encoding
variables = cat_features) # categorical features list
rare_encoder.fit(X_train)train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_test)
Example for Rare Label Encoding

Since we handled our rare categories in categorical features, let’s convert them into useful numerical features.

2. One hot encoding

The process of encoding each categorical variable with a set of boolean variables which take values 0 or 1, indicating if a category is present for each observation.

from feature_engine.encoding import OneHotEncodercat_features = ['cabin', 'sex', 'embarked']ohe_encoder = OneHotEncoder(variables=cat_features)ohe_encoder.fit(X_train)ohe_train_t = ohe_encoder.transform(train_t)
ohe_test_t = ohe_encoder.transform(test_t)
Example for One Hot Encoding

3. Ordinal/Label Encoding

The process of replacing categories by ordinal numbers (0, 1, 2, 3, etc).

Ordinal Encoding : The numbers are ordered based on the mean of the target per category. For example, let us take a feature ‘colour’, if the mean of the target for blue, red and grey is 0.8, 0.5 and 0.1 respectively, blue is replaced by 1, red by 2 and grey by 0.

Label Encoding: The numbers will be assigned arbitrarily to the categories, on a first seen first served basis.

from feature_engine.encoding import OrdinalEncodercat_features = ['cabin', 'sex', 'embarked']ordinal_encoder = OrdinalEncoder(encoding_method='ordered',
variables=cat_features)
ordinal_encoder.fit(train_t, y_train) # provide target to fitord_train_t = ordinal_encoder.transform(train_t)
ord_test_t = ordinal_encoder.transform(test_t)

Note: For Label encoding y_train is not required in fit method and use the encoding method as encoding_method='arbitrary'

Example for Ordinal Encoding

4. Count/ Frequency Encoding

The process of replacing categories by the count of observations per category or by the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

from feature_engine.encoding import CountFrequencyEncodercat_features = ['cabin', 'sex', 'embarked']freq_encoder = CountFrequencyEncoder(encoding_method='frequency',

variables=cat_features)
freq_encoder.fit(train_t)
frq_train_t = freq_encoder.transform(train_t)
frq_test_t = freq_encoder.transform(test_t)

Note : For Count encoding use encoding_method='count'

Example for Frequency encoding

5. Mean Encoding

The process of replacing labels of the categorical feature by the mean value of the target for that label.
For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.

from feature_engine.encoding import MeanEncodercat_features = ['cabin', 'sex', 'embarked']mean_encoder = MeanEncoder(variables=cat_features)mean_encoder.fit(train_t, y_train) # provide target variablemean_train_t = mean_encoder.transform(train_t)
mean_test_t = mean_encoder.transform(test_t)
Example for Mean encoding

6. WoE Encoding

The process of replacing categories by the weight of evidence (WoE).The weight of evidence is given by: log(P(X=xj|Y = 1)/P(X=xj|Y=0))

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, then blue will be replaced by: np.log(0.8/0.2) = 1.386

Note: This categorical encoding is exclusive for binary classification. Learn more about WoE Encoding.

from feature_engine.encoding import WoEEncodercat_features = ['cabin','sex', 'embarked']woe_encoder = WoEEncoder(variables=cat_features)woe_encoder.fit(train_t, y_train) #provide target to find WoE valueswoe_train_t = woe_encoder.transform(train_t)
woe_test_t = woe_encoder.transform(test_t)
Example for WoE encoding

7. Decision Tree Encoding

This method encodes the categorical variables with predictions of a decision tree model.

The categorical variable will be first encoded into integers (Label/Ordinal encoding). Then a decision tree will be fit using the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.

click here to learn more about this technique.

from feature_engine.encoding import DecisionTreeEncodercat_features = ['cabin', 'sex', 'embarked']dt_encoder = DecisionTreeEncoder(
variables=cat_features,
encoding_method='arbitrary', # label encoding
cv=3, # 3-fold cross validation for training decision tree
scoring='accuracy',
param_grid={'max_depth': [1, 2, 3, 4]}, # Grid search params
regression=False) # classification
dt_encoder.fit(train_t, y_train)dt_train_t = dt_encoder.transform(train_t)
dt_test_t = dt_encoder.transform(test_t)
Example for DecisionTree encoding

Conclusion

In this article, we discussed several categorical encoding techniques with examples and implementation in python.

Try different methods and find out the best encoding method for your ML problem.

Feature Engine project: click here!

References: Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23–34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Thanks for reading!

--

--