Transformation of data
Transformation of data can be done to achieve linearity which enables the use of the simple linear regression tools.
Key points on data transformations in (SLR)
- Transformation of data can help achieving linearity
- Typical tools for regression models are log, reciprocal, powers and square root
- Transformation can be done for X, Y, or for both
The value of the linear model is that we then can use the tools for linear regression:
- plotting and reading the least squares regression line
- calculating the regression model
- finding strength of relationship through the coefficient of determination (r2)
- calculating confidence intervals for the slope as well as
- testing relationship through hypothesis test for the slope
- calculating and using mean and single response intervals for inference
Therefore, it can be interesting to see if linearity can be achieved and this can be done by transforming data.
What data can be transformed and how?
Transformation can be carried for the explanatory variable (X), the response variable (Y), or for both. It typically includes the square root, reciprocal, natural log or log of any base.
It is important to keep in mind if a relationship is non-linear, it IS and STAYS nonlinear, although we achieve linearity through transforming. We transform data, so that we can use the tools of linear regression, but we must remember to undo the transformation for the final calculations.
Example of transforming both X and Y with log
The dataset: The following example is based on the dataset from Sacher, G.A. Staffeldt, E. (1974). The American Naturalist, 108:593–615. The 99 observations represent one pair of measurements for each of the 99 species that had a brain and body weight measurement. The dataset can be installed from here: https://rdrr.io/rforge/Sleuth3/man/ex0333.html.
The following example is a hand drawn illustration and only an approximation of the dataset mentioned above.
The example: This line does not summarize these points adequately:
If we take the log of both the explanatory (body weight (X)) and the response variable (brain weight (Y)) of the same dataset shown above, we get a “good looking” straight line:
So, in this case we achieved a linear relationship by transforming X and Y with the natural log.
Example transforming by square rooting (Y)
In the following example we will achieve linearity by square rooting Y. The dataset is for Jankar hardness (Y) vs. density (X) for 36 Austrialian trees and is available on PASWR2 package for R statistical programming.
The dataset explores the relationship between the density (X) and the hardness where the density (X) is easy to measure and the hardness (Y), which is understood as the durability of the wood, is hard to find. So, finding an adequate regression model, makes us capable of estimating and predicting hardness that, otherwise, would be difficult to find.
I am running this example based on Jeremy Balka’s video Simple linear regression: Transformations.
At a glance the model looks satisfying with a coefficient of determination (r²)≈ 0.95.
But when plotting the residuals, a curvature is immediately revealed:
Let’s try transforming data by taking the log of hardness (Y):
The residual plot then shows a clear curvature, so the log function transformation has “over-transformed” our data.
Then, let’s try by square rooting the response variable (hardness (Y)):
This seems to overcome the curvature, and despite the seeming difference in the variance along the X levels, we take it as an “improved” model. Also, the coefficient of determination (r²) has increased from 0.95 to 0.97. So, it seems that our model improves and becomes “reasonable” when square rooting the Y variable in this case.
Calculations in the transformed model
For any calculations in this transformed model, we need to “tie up”, or undo the transformation. The Y variable (hardness) is square rooted, as in this case, so we need to square the predicted Y. For example, if we aim to predict the hardness (Y) at a density (X) of 50:
Transformation of data in MS Excel
The =LN function can be applied to calculate the log and the =SQRT function for the square root. Plots and summary statistics can be obtained then by Data >> Data Analysis >> Regression:
Transformation of data in R
Learnings on transformations of data
My preferred materials for learnings on transformation of data:
- Jbstatistics video (7:26): I find Jeremy Balka’s video on transformation a very well explained tutorial for learning about transformation in SLR. Some of the sections in this chapter are based on his video: Simple linear regression: Transformations
- Khan Academy:
- Video 2:54: Transforming nonlinear data
- Video 7:39: Worked example of linear regression when using transformed data
- Liz Minton:
- James H. Steiger, Department of Psychology and Human Development, Vanderbilt University: (online pdf – more details also on when to use what type of transformation): Transforming to linearity
Freelance Data Analyst
+34 616 71 29 85
Spain: Ctra. 404, km 2, 29100 Coín, Malaga
Denmark: c/o Musvitvej 4, 3660 Stenløse
Drop me a line
What are you working on just now? Can I help you, and can you help me?
Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.
Connect with me
What they say
20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.