ML | How to deal with categorical data and What is Dummy variable Trap

4 min readAug 21, 2024

What is categorical data

Categorical data represents discrete values or labels that fall into distinct categories or groups. This data is often non-numerical, though it can also include numbers used as labels (like “1” for “Yes” and “0” for “No”). It doesn’t have a mathematical meaning, and operations like addition or averaging don’t make sense for categorical data. In the above dataset Manufacturer, Target,Shelf are categorical data.

So we have 3 types of categorical data(Qualitative data).

Nominal data : No order or rank (Hometown, Gender, Colour etc)
Ordinal data : There is a order (Customer satisfaction, Education, Quality)

We can put Binary data also into ordinal data

Why we need to deal with Categorical data

Machine learning models able to deal only with Numeric variables only. So we need to make these categorical values to numeric type for machine learning model. We call these steps as data preprocessing.

How to deal Categorical data

Method 1 : Using dummy columns

In this data set we have nominal data column called town. Lets see how we are going to deal with this using dummies.

# Loading data
df = pd.read_csv('./datafiles/homeprices_town.csv')
df.head()

# Get dummy columns for categorical data
dummies = pd.get_dummies(df.town)
dummies.head()

In here for each categorical data it creates a column. So we can have numeric values to feed it to ML model

# concat dummy columns to original data set and drop the town column
df_preprocessed_manual = pd.concat([df,dummies], axis='columns')
df_preprocessed_manual = df_preprocessed_manual.drop(['town','west windsor'], axis='columns')
df_preprocessed_manual.head()

In here we drop both town column and 1 column from dummy columns why we do that 🤔. We do this because to avoid from dummy variable trap. Now we have perfect dataset for our Machine learning model.

Method 2: Using sklearn OnehotEncoding

# First we use label encoder to encode categorical data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.town = le.fit_transform(df.town)
df.head()

In here label encoder encode categorical data to 0,1,2,3,4 like wise. First we need to label encode categorical column to pass the onehot encoder

# Seperate dataset to features and labels
x = dfle[['town','area']].values
y = dfle.price

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
column_transformer = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(drop='first'), [0])  # Specify the column(s) to be one-hot encoded
    ],
    remainder='passthrough'  # Leave the other columns unchanged
)
x = column_transformer.fit_transform(x)
x

As like previous it use dummy variable for each town. In here we avoid dummy variable trap by using drop first 😈.

So now we can feed our data to ML model. For example as below

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x,y)

Dummy Variable trap

The dummy variable trap is a scenario in regression analysis, particularly in models involving categorical variables that have been one-hot encoded, where highly correlated (multicollinear) features are created. This situation occurs when one category can be perfectly predicted from the others, leading to redundancy in the model.

When categorical variables are one-hot encoded, they are transformed into binary (0 or 1) columns called dummy variables. For example, suppose we have a categorical variable with three categories: Red, Blue, and Green. One-hot encoding would result in three columns:

The dummy variable trap occurs because these three columns are linearly dependent (they add up to 1). For instance, if you know the values of two columns, you can determine the value of the third column.

This perfect multicollinearity can cause issues in linear regression models, leading to unstable estimates of coefficients and making it difficult for the model to determine the unique effect of each variable.

To avoid the dummy variable trap, one of the dummy variables is usually dropped. This ensures that the remaining columns are independent and no longer perfectly correlated.

For example, if you drop the “Green” column:

Now, if both “Red” and “Blue” are 0, the observation must be “Green.” The information remains intact, and multicollinearity is avoided.

Thats what I need to share with you all about categorical data on Machine Learning models. Lets meeet in another blog as this one. ✌️