Mean Removal / Standardisation

4 min readJan 30, 2021

Feature scaling is a crucial step in pre-processing pipeline. Two main feature scaling techniques are: Normalization and Standardization.

In this I am explaining about Standardization. Which is also called as Z-Distribution Conversion, Normal Distribution Conversion, Variance Scaling or Mean Removal.

Any distribution can be converted as Standard distribution using the below formula

Z-Distribution is a normal distribution with the mean as 0 and a standard distribution as 1.

Standardization can be more practical for many machine learning algorithms compare to Normalization (min-max scalar), especially for optimization algorithms such as gradient descent. Bagging and Boosting machine learning algorithms where we don’t need to worry about feature scaling. Those algorithms are scale invariant. principal component analysis often works better using standardization, while min-max scaling is often recommended for neural networks

The reason is that many linear models, initialize the weights to 0 or small random values close to 0. Using standardization, we centre the feature columns at mean 0 with standard deviation 1 so that the feature columns takes the form of a normal distribution, which makes it easier to learn the weights.

If our data has significant outliers, it can negatively impact our standardization by affecting the feature’s mean and variance. if you do have outlier’s standardization might not be appropriate because the mean and variance might be highly influenced by the outliers. In this scenario, it is often helpful rescale the feature using the median and quartile range or with Normalization.

If the data is sparse and if we standardize the data, most of the values will become dense one. This in turn could create a huge computational burden for the classifier. Bag-of-words is a sparse representation, and most classification libraries optimize for sparse inputs.

Normally distributed data:

The below left side picture is a normally distributed data, but which has mean and standard deviation is other than 0 and 1. After standardization mean and standard deviation are converted as 0 and 1, the same can be observed in the right-side picture.

Right Skewed Data:

Same is applicable in case of Right skewed data.

Left Skewed Data:

Same is applicable in case of Left skewed data.

Python code Example 1:

data = {‘Name’:[‘Mahesh’, ‘Harish’, ‘Suresh’, ‘Krish’], ‘Age’:[20, 21, 19, 18]}

df = pd.DataFrame(data)

# Data Frame is created with the above given data

sns.displot(df, x=”Age”, kind=”kde”)

# Plotted to see the distribution

df_age_mean = np.mean(df.Age)

# Calculated the mean

df[“Age”] = (df[“Age”] — df_age_mean) / np.std(df[“Age”])

# Subtracted the mean and divided the result with standard deviation of the same variable. The above step is the crucial step in standardization.

sns.displot(df, x=”Age”, kind=”kde”)

# Plotted to see the distribution, mean and standard deviation.

Python code Example 2:

For this I have taken example which is explained in sklearn itself. The main advantage with this method is total data frame features can be standardized.

from sklearn import preprocessing

import numpy as np

X_train = np.array([[ 1., -1., 2.],

[ 2., 0., 0.],

[ 0., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)

X_scaled = scaler.transform(X_train)

X_scaled

array([[ 0. …, -1.22…, 1.33…],

[ 1.22…, 0. …, -0.26…],

[-1.22…, 1.22…, -1.06…]])

Let me explain one more thing before closing this short note. when different features are in different scales, after applying standardization all the features will be converted to the same scale. Let’s take we have two features where one feature is measured on a scale from 1 to 10 and the second feature is measured on a scale from 1 to 100,00, respectively. If we calculate the mean squared error, algorithm will mostly be busy with optimizing the weights corresponding to second feature instead of both the features. Same will be applicable when the algorithm uses distance calculations like Euclidian or Manhattan distances, second feature will dominate the result.

Thanks for reading my post.

Mean Removal / Standardisation

Written by MaheswaraReddy

No responses yet