Understanding the Differences Between Scaling, Normalization, and Log Transformations
Written on
In this article, we will explore the distinctions between Standardization (Scaling), Normalization, and Log transformations. You'll gain a comprehensive understanding of when to implement each method and the rationale behind selecting one approach over another.
What does it mean to adhere to statistical methodologies?
You may have encountered the phrase in various articles or courses:
The features in a dataset must align with the statistical assumptions of the models.
But what does this adherence imply? Several algorithms in Sklearn may yield suboptimal results if the numeric features deviate significantly from a standard Gaussian (normal) distribution. This is especially true for non-tree-based models, as their objective functions presume that the features conform to a normal distribution.
In fact, using the term "presume" might be an understatement. For algorithms such as K-Nearest-Neighbors, transforming features is critical for the algorithm to function as expected.
In practice, there may be instances where feature transformations lead to improvements exceeding a mere 5% in performance. Numerous techniques exist to adjust your features toward a normal distribution, and their application depends on the specific distributions of each feature.
This article will introduce you to three methods: Scaling, normalization, and logarithmic transformations. You will acquire a practical understanding of their differences and how to apply them effectively in your workflow.
Identifying the Underlying Distribution of Features
Before implementing any of the methods discussed, it's crucial to visually examine each feature. Creating perfect plots isn't necessary; simple histograms and boxplots with default settings will suffice for distribution identification. Consider the following histograms from the Diamonds dataset in Seaborn:
Pay close attention to the shapes of the distributions and the ranges on the X-axes. The subplots indicate that the price and carat features exhibit a skewed distribution, while depth, table, and x may somewhat resemble a normal distribution.
Besides the distributions, the value ranges are also a significant factor. Boxplots effectively illustrate this using the 5-number summary:
>>> diamonds.boxplot()
This plot clearly shows that the features have varying scales. The larger scale of diamond prices compresses other boxplots into a line. Here's the 5-number summary for further comparison:
>>> diamonds.describe().T.round(2)
Now that you've learned how to identify distribution shapes and scale differences among features, by the conclusion of this article, we will standardize all features to have similar scales and approximate a normal distribution:
Scaling or Standardization with StandardScaler
One common solution for situations where one feature exhibits significantly greater variance than others is scaling (also known as standardization):
According to the official Sklearn documentation on scaling:
Many elements in the objective function of a learning algorithm expect all features to be centered around zero with similar variances. If a feature's variance is significantly larger than others, it can overshadow the objective function, leading to inaccurate learning from the other features.
Thus, scaling is often essential for achieving optimal performance with various models. Sklearn implements this through the StandardScaler() transformer, which adjusts numerical features to have a mean of 0 and a variance of 1.
The process involves two steps: 1. Centering: Subtract the mean from each value in the distribution. 2. Scaling: Divide each result by the standard deviation.
These operations leave the original feature following a normal distribution. Here’s how we would achieve this manually:
> We're excluding the price and carat features because they follow a skewed distribution, which we will discuss later.
Let's replicate this using StandardScaler():
The results are consistent.
Verifying the mean and variances:
Now, let's evaluate the success of the scaling. Depth and x features now resemble a Gaussian distribution. However, features like table, y, and z remain compressed at the edges of their plots, hinting at outliers (otherwise, the majority of the histograms would cluster in the center). This indicates that scaling was more effective for depth and x features than for the others. We should keep this in mind for subsequent sections.
Log Transformation with PowerTransformer
When a feature does not conform to a linear distribution, using the mean and standard deviation for scaling would be ill-advised. For instance, consider the outcome of scaling the skewed distributions of price and carat:
The fact that these features remain skewed confirms that standardization is ineffective for them.
To apply non-linear transformations, Sklearn offers the PowerTransformer class (which utilizes logarithmic functions) to reduce skewness and map any distribution as closely as possible to a normal one:
>>> diamonds[["carat", "price"]].var()
The new features exhibit a significant improvement over the original skewed ones. Thus, whenever dealing with a skewed distribution, the PowerTransformer class is the recommended approach.
Normalization with MinMaxScaler
An alternative to scaling is known as normalization. Instead of utilizing variance and mean, normalization relies on the minimum and maximum values of the distribution. The following equation is used for each value:
This transformation confines the distribution within a defined minimum and maximum value, typically ranging from 0 to 1. Sklearn offers a corresponding MinMaxScaler transformer for this purpose:
While this method compels features to conform to a normal distribution, they will not possess unit variance or a mean of 0.
However, there are limitations associated with this approach. For example, if the maximum value in the training set is lower than that in the test set, the scaling may yield unexpected predictions. The same applies to minimum values.
Additionally, MinMaxScaler does not perform well with features that contain outliers. Consider why this might be the case (hint: pay attention to the MMScaler formula).
Furthermore, MinMaxScaler does not alter the distribution shape at all. After normalization, values fall within the specified range, but the distribution's shape remains unchanged.
For these reasons, StandardScaler is more frequently employed.
Bringing it All Together
In this segment, we will attempt to predict diamond cuts using a LogisticRegression algorithm. We will integrate the StandardScaler and PowerTransformer within a pipeline. If you're unfamiliar with how Sklearn pipelines and ColumnTransformers function, refer to this post:
How to Use Sklearn Pipelines For Ridiculously Neat Code
- towardsdatascience.com
Let's begin. First, we will construct the feature/target arrays and extract the names of the columns to which we will apply the transformers:
Next, we will create a ColumnTransformer object that maps the transformers to the appropriate columns:
We will then integrate this transformer into a Pipeline concluding with a LogisticRegression model:
Finally, we will split the data into training and testing sets and evaluate the classifier's performance:
We achieved a ROC AUC score of 0.83. At this stage, hyperparameter tuning can commence to enhance this score.
Data Leakage Considerations For the Transformers
Whenever preprocessing is undertaken, it is essential to be vigilant about data leakage. Since all the transformers covered today derive metrics from the underlying distributions of the features, there is potential for them to inadvertently leak information from the test data.
Therefore, it is advisable to partition the data into training and test sets prior to preprocessing. Additionally, avoid using the fit_transform() method on the test set. Transformers should be fitted exclusively to the training data, with transformations applied solely through the transform method.
However, when utilizing these transformers within pipelines, you need not worry about this issue, as Sklearn manages data leakage automatically during feature fitting and transformation.
Summary
In this article, you learned how to engineer numeric features to conform to the statistical assumptions required by many models. Specifically, we covered:
- Scaling data with StandardScaler, a transformer used to achieve a normal distribution with a mean of 0 and unit variance, primarily suited for distributions without excessive outliers.
- Log transforming data with PowerTransformer, a transformer suitable for converting highly skewed features into a distribution that closely resembles a normal one.
- Normalizing data using MinMaxScaler, a transformer intended for ensuring feature values fall within specified minimum and maximum values. It is less effective with numerous outliers and can behave unexpectedly if values exceed the defined range in the test set, making it a less popular alternative to scaling.
Thank you for reading!
Enjoyed this article and its unique writing style? Imagine having access to many more like it, all authored by a brilliant, charming, and witty writer (that’s me, by the way :)
For just $4.99, you can join and gain access not only to my stories but also a wealth of knowledge from the brightest minds on Medium. Plus, using my referral link will earn my heartfelt gratitude and a virtual high-five for your support.
Join Medium with my referral link — Bex T.
Get exclusive access to all my premium content and limitless access throughout Medium. Support my work by joining me…