+34 616 71 29 85 carsten@dataz4s.com

Transformation of data

Transformation of data can be done to achieve linearity which enables the use of the simple linear regression tools.

 

Key points on data transformations in (SLR)

  • Transformation of data can help achieving linearity
  • Typical tools for regression models are log, reciprocal, powers and square root
  • Transformation can be done for X, Y, or for both

 

Why transform?

The value of the linear model is that we then can use the tools for linear regression:

Therefore, it can be interesting to see if linearity can be achieved and this can be done by transforming data.

 

What data can be transformed and how?

Transformation can be carried for the explanatory variable (X), the response variable (Y), or for both. It typically includes the square root, reciprocal, natural log or log of any base.

It is important to keep in mind if a relationship is non-linear, it IS and STAYS nonlinear, although we achieve linearity through transforming. We transform data, so that we can use the tools of linear regression, but we must remember to undo the transformation for the final calculations.

 

Example of transforming both X and Y with log

The dataset: The following example is based on the dataset from Sacher, G.A. Staffeldt, E. (1974). The American Naturalist, 108:593–615. The 99 observations represent one pair of measurements for each of the 99 species that had a brain and body weight measurement. The dataset can be installed from here: https://rdrr.io/rforge/Sleuth3/man/ex0333.html.

The following example is a hand drawn illustration and only an approximation of the dataset mentioned above.

The example: This line does not summarize these points adequately:

Transformation of data to achieve linearity

 

 If we take the log of both the explanatory (body weight (X)) and the response variable (brain weight (Y)) of the same dataset shown above, we get a “good looking” straight line:

 

Transformation of data to achieve linearity

 

 

So, in this case we achieved a linear relationship by transforming X and Y with the natural log.

 

Example transforming by square rooting (Y)

In the following example we will achieve linearity by square rooting Y. The dataset is for Jankar hardness (Y) vs. density (X) for 36 Austrialian trees and is available on PASWR2 package for R statistical programming.

The dataset explores the relationship between the density (X) and the hardness where the density (X) is easy to measure and the hardness (Y), which is understood as the durability of the wood, is hard to find. So, finding an adequate regression model, makes us capable of estimating and predicting hardness that, otherwise, would be difficult to find.

I am running this example based on Jeremy Balka’s video  Simple linear regression: Transformations.

At a glance the model looks satisfying with a coefficient of determination (r²)≈ 0.95.

Australiean trees analysis_regression

 

But when plotting the residuals, a curvature is immediately revealed:

Transformation of data to achieve linearity_curvature_dif variance

 

 

Let’s try transforming data by taking the log of hardness (Y):

Over-transformed

 

 

 

The residual plot then shows a clear curvature, so the log function transformation has “over-transformed” our data.

 

Then, let’s try by square rooting the response variable (hardness (Y)):

quantile-quantile and residual plot to check

 

This seems to overcome the curvature, and despite the seeming difference in the variance along the X levels, we take it as an “improved” model. Also, the coefficient of determination (r²) has increased from 0.95 to 0.97. So, it seems that our model improves and becomes “reasonable” when square rooting the Y variable in this case.

 

Calculations in the transformed model

For any calculations in this transformed model, we need to “tie up”, or undo the transformation. The Y variable (hardness) is square rooted, as in this case, so we need to square the predicted Y. For example, if we aim to predict the hardness (Y) at a density (X) of 50:

Influential points graph

 

 

Transformation of data in MS Excel

The =LN function can be applied to calculate the log and the =SQRT function for the square root. Plots and summary statistics can be obtained then by Data >> Data Analysis >> Regression:

Transformation of data in MS Excel

 

 

 

 

Transformation of data in R

Coming

 

Learnings on transformations of data

 

My preferred materials for learnings on transformation of data:

 

Carsten Grube

Carsten Grube

Freelance Data Analyst

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

Drop me a line

What are you working on just now? Can I help you, and can you help me? 

About me

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children. 

What they say

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.