Label Encoding
Most of the data frames contains either continuous data or categorical data. As we know Machine learning algorithms accepts only numbers. So, if data contains categorical data, categorical data must be converted to numbers. Label encoding is one of the technique to convert categorical data to Numerical data, so that machine learning algorithms can understand.
Let’s see an example to get better understanding
The above data frame has 4 columns.
First 2 columns, ‘Customer ID’ and ‘App User’ columns are PII (Personal Identity Information) columns. When we are feeding this table to machine learning algorithm, we are going to discard these 2 columns. So we no need to worry about these 2.
Last 2 columns i.e ‘Duration Category’ and ‘App Used Duration’ columns must be taken care.
‘App Used Duration’ column if we remove ‘hour’ and ‘mins’ then the remaining data is a numerical number, that can used for computation (Off course hours data has to be converted to mins).
‘Duration Category’ is the column where we need ‘Label Encoding’. After applying Label Encoding to this column this looks like below.
Encoding will be applied in alphabetical order. So Long is awarded with 0, Medium is awarded with 1 and Short is awarded with 2.
Now the same example will be used in python.
Label Encoding can be performed in 2 ways.
1. Label Encoding using scikit-learn library
2. Label Encoding with Category codes
Label Encoding using scikit-learn library:
import pandas as pd
# initialise data of lists.
data = {‘Customer ID’:[1, 2, 3, 4],
‘App User’:[‘Name1’, ‘Name2’, ‘Name3’, ‘Name4’],
‘Duration Category’:[‘Long’, ‘Medium’, ‘Short’, ‘Long’],
}
# Created a Data Frame
df = pd.DataFrame(data)
Data Frame will look like below table:
# Created a copy of data
df1 = df.copy()
# Importing Label Encoder from sklearn
from sklearn.preprocessing import LabelEncoder
# Creating a instance of Label Encoder class from sklearn
le = LabelEncoder()
# Transforming the data using Label Encoder
df1[‘Duration Encoded’] = le.fit_transform(df1[“Duration Category”])
After applying sklearn Label Encoding the table looks like:
‘Duration Category’ column can be deleted, ‘Duration Encoded’ column can be used for modelling.
Label Encoding with Category codes:
# Created a copy of data
df2 = df.copy()
# Converted column as category column
df2[“Duration Category”] = df2[“Duration Category”].astype(‘category’)
# Converted categories to Numbers
df2[“Duration Encoded”] = df2[“Duration Category”].cat.codes
‘Duration Category’ column can be deleted, ‘Duration Encoded codes’ column can be used for modelling.
Any method can be used. when I observe the time duration taken to implement the above 2 methods, sklearn Label Encoding method took 0 microseconds and category codes method took 996 microseconds. These time values when we are considering 4 columns and 4 rows of data. If the data frame size increases, category codes method may take more time to calculate the categories. So choice is yours! 😂🤣😁.