Feature Engineering with Feature Engine

5 min readApr 5, 2021

List of categorical encoding techniques for machine learning.

Feature engineering is one of the most important step in any ML project/hackathon to get good predictions. In this article, I’m going to discuss different types of categorical encoding techniques using the feature-engine package. Categorical encoding refers to converting string/categorical features to useful numerical features.

Feature-engine is an open source Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine’s transformers follow scikit-learn’s functionality with the fit() and transform() methods to first learn the transforming parameters from data and then transform the data. With the feature engine module, we can have the advantage of selecting only required features to transform and it returns the transformed features as a pandas dataframe.

Categorical encoding methods

Rare Label Encoding
One Hot Encoding
Ordinal/Label Encoding
Count/ Frequency Encoding
Mean Encoding
WoE Encoding
Decision Tree Encoding

In this article, I’m using the titanic dataset from openML repository.

let’s separate the data into training and testing set.

X = data.drop('survived', axis=1)
y = data.survived
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=0)

1. Rare Label Encoding

The process of grouping labels that show a small number of observations in the dataset into a new category -”Rare”. It prevents over-fitting.

from feature_engine.encoding import RareLabelEncodercat_features = ['cabin', 'sex', 'embarked']rare_encoder = RareLabelEncoder(
    tol=0.02,      # labels with frequencies < 'tol' will be grouped
    n_categories=4,      # minimum categories required for encoding
    variables = cat_features) # categorical features listrare_encoder.fit(X_train)train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_test)

Since we handled our rare categories in categorical features, let’s convert them into useful numerical features.

2. One hot encoding

The process of encoding each categorical variable with a set of boolean variables which take values 0 or 1, indicating if a category is present for each observation.

from feature_engine.encoding import OneHotEncodercat_features = ['cabin', 'sex', 'embarked']ohe_encoder = OneHotEncoder(variables=cat_features)ohe_encoder.fit(X_train)ohe_train_t = ohe_encoder.transform(train_t)
ohe_test_t = ohe_encoder.transform(test_t)

3. Ordinal/Label Encoding

The process of replacing categories by ordinal numbers (0, 1, 2, 3, etc).

Ordinal Encoding : The numbers are ordered based on the mean of the target per category. For example, let us take a feature ‘colour’, if the mean of the target for blue, red and grey is 0.8, 0.5 and 0.1 respectively, blue is replaced by 1, red by 2 and grey by 0.

Label Encoding: The numbers will be assigned arbitrarily to the categories, on a first seen first served basis.

from feature_engine.encoding import OrdinalEncodercat_features = ['cabin', 'sex', 'embarked']ordinal_encoder = OrdinalEncoder(encoding_method='ordered',
                                 variables=cat_features)ordinal_encoder.fit(train_t, y_train) # provide target to fitord_train_t = ordinal_encoder.transform(train_t)
ord_test_t = ordinal_encoder.transform(test_t)

Note: For Label encoding y_train is not required in fit method and use the encoding method as encoding_method='arbitrary'

4. Count/ Frequency Encoding

The process of replacing categories by the count of observations per category or by the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

from feature_engine.encoding import CountFrequencyEncodercat_features = ['cabin', 'sex', 'embarked']freq_encoder = CountFrequencyEncoder(encoding_method='frequency',
                                     
                                     variables=cat_features)
freq_encoder.fit(train_t)frq_train_t = freq_encoder.transform(train_t)
frq_test_t = freq_encoder.transform(test_t)

Note : For Count encoding use encoding_method='count'

5. Mean Encoding

The process of replacing labels of the categorical feature by the mean value of the target for that label.
For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.

from feature_engine.encoding import MeanEncodercat_features = ['cabin', 'sex', 'embarked']mean_encoder = MeanEncoder(variables=cat_features)mean_encoder.fit(train_t, y_train) # provide target variablemean_train_t = mean_encoder.transform(train_t)
mean_test_t = mean_encoder.transform(test_t)

6. WoE Encoding

The process of replacing categories by the weight of evidence (WoE).The weight of evidence is given by: log(P(X=xj|Y = 1)/P(X=xj|Y=0))

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, then blue will be replaced by: np.log(0.8/0.2) = 1.386

Note: This categorical encoding is exclusive for binary classification. Learn more about WoE Encoding.

Weight of Evidence (WOE) and Information Value (IV) Explained

In this article, we will cover the concept of weight of evidence and information value and how they are used in…

www.listendata.com

from feature_engine.encoding import WoEEncodercat_features = ['cabin','sex', 'embarked']woe_encoder = WoEEncoder(variables=cat_features)woe_encoder.fit(train_t, y_train) #provide target to find WoE valueswoe_train_t = woe_encoder.transform(train_t)
woe_test_t = woe_encoder.transform(test_t)

7. Decision Tree Encoding

This method encodes the categorical variables with predictions of a decision tree model.

The categorical variable will be first encoded into integers (Label/Ordinal encoding). Then a decision tree will be fit using the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.

click here to learn more about this technique.

from feature_engine.encoding import DecisionTreeEncodercat_features = ['cabin', 'sex', 'embarked']dt_encoder = DecisionTreeEncoder(
    variables=cat_features,
    encoding_method='arbitrary', # label encoding
    cv=3, # 3-fold cross validation for training decision tree
    scoring='accuracy', 
    param_grid={'max_depth': [1, 2, 3, 4]}, # Grid search params
    regression=False) # classificationdt_encoder.fit(train_t, y_train)dt_train_t = dt_encoder.transform(train_t)
dt_test_t = dt_encoder.transform(test_t)

Conclusion

In this article, we discussed several categorical encoding techniques with examples and implementation in python.

Try different methods and find out the best encoding method for your ML problem.

Feature Engine project: click here!

References: Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23–34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Thanks for reading!