# Transformation of data

Transformation of data can be done to **achieve linearity** **which enables the use of the simple linear regression tools**.

**On this page**hide

## Key points on data transformations in (SLR)

- Transformation of data can help
**achieving linearity** - Typical tools for regression models are
**log, reciprocal**,**powers and square root** - Transformation can be done for
**X, Y, or for both**

## Why transform?

**The value of the linear model is that we then can use **the tools for linear regression:

- plotting and reading the least squares regression line
- calculating the regression model
- finding strength of relationship through the coefficient of determination (r
^{2}) - calculating confidence intervals for the slope as well as
- testing relationship through hypothesis test for the slope
- calculating and using mean and single response intervals for inference

**Therefore, it can be interesting** to see if linearity can be achieved and this can be done by transforming data.

## What data can be transformed and how?

Transformation can be carried for the explanatory variable (**X**), the response variable **(Y)**, or for **both**. It typically includes the **square root**, **reciprocal**, natural **log** or log of any base.

It is important to keep in mind if **a** **relationship is non-linear, it IS and STAYS nonlinear**, although we achieve linearity through transforming. We transform data, so that we can use the tools of linear regression, but we must remember to undo the transformation for the final calculations.

## Example of transforming both X and Y with log

**The dataset:** The following example is based on the dataset from Sacher, G.A. Staffeldt, E. (1974). The American Naturalist, 108:593–615. The 99 observations represent one pair of measurements for each of the **99 species that had a** **brain and body weight measurement**. The dataset can be installed from here: https://rdrr.io/rforge/Sleuth3/man/ex0333.html.

The following example is a hand drawn illustration and only an approximation of the dataset mentioned above.

**The example**: This line does not summarize these points adequately:

** **If we take the **log of both the explanatory (body weight (X)) and the response variable (brain weight (Y)) **of the same dataset shown above, we get a “good looking” straight line:

So, in this case we achieved a linear relationship by transforming X and Y with the natural log.

## Example transforming by square rooting (Y)

In the following example we will achieve linearity by square rooting Y. The dataset is for Jankar **hardness (Y) vs. density** **(X)** for 36 **Austrialian trees** and is available on **PASWR2 package for R** statistical programming.

The dataset explores the relationship between the **density (X) **and the hardness where the **density (X) is easy to measure **and the **hardness (Y)**, which is understood as the durability of the wood, is **hard to find**. So, finding an adequate regression model, makes us capable of estimating and predicting hardness that, otherwise, would be difficult to find.

I am running this example based on Jeremy Balka’s video Simple linear regression: Transformations.

At a glance the model **looks satisfying** with a coefficient of determination (r²)≈ 0.95.

**But** when plotting the residuals, a **curvature** is immediately revealed:

Let’s try transforming data by taking the **log of hardness (Y):**

The residual plot then shows a clear curvature, so the log function transformation has **“over-transformed”** our data.

Then, let’s try by **square rooting** the response variable (hardness (Y)):

This seems to **overcome the curvature**, and despite the seeming difference in the variance along the X levels, we take it as an “improved” model. Also, the coefficient of determination **(r²)**** has increased** from 0.95 to 0.97. So, it seems that our model improves and becomes **“reasonable” when square rooting the Y** variable in this case.

## Calculations in the transformed model

For any calculations in this transformed model, we need to “tie up”, or **undo the transformation**. The Y variable (hardness) is square rooted, as in this case, so we need to square the predicted Y. For example, if we aim to predict the hardness (Y) at a density (X) of 50:

## Transformation of data in MS Excel

The **=LN function** can be applied to calculate the log and the **=SQRT function** for the square root. Plots and summary statistics can be obtained then by **Data >> Data Analysis >> Regression**:

** **

## Transformation of data in R

Coming

** **

## Learnings on transformations of data

My preferred materials for learnings on transformation of data:

**Jbstatistics**video (7:26): I find Jeremy Balka’s video on transformation a very well explained tutorial for learning about transformation in SLR. Some of the sections in this chapter are based on his video: Simple linear regression: Transformations**Khan Academy**:- Video 2:54: Transforming nonlinear data
- Video 7:39: Worked example of linear regression when using transformed data

**Liz Minton**:- Video 5:14: Transforming to achieve linearity
- Video 12:21: The logarithm transformation
- Video 6:14 Transforming with powers

**James H. Steiger**, Department of Psychology and Human Development, Vanderbilt University: (online pdf – more details also on when to use what type of transformation): Transforming to linearity

#### Carsten Grube

Freelance Data Analyst

##### Normal distribution

##### Confidence intervals

##### Simple linear regression, fundamentals

##### Two-sample inference

##### ANOVA & the F-distribution

+34 616 71 29 85

Call me

Spain: Ctra. 404, km 2, 29100 Coín, Malaga

...........

Denmark: c/o Musvitvej 4, 3660 Stenløse

**Drop me a line**

*What are you working on just now? Can I help you, and can you help me? *

**About me**

Learning statistics. Doing statistics. Freelance since 2005. Dane. Living in Spain. With my Spanish wife and two children.

**Connect with me**

**What they say**

20 years in sales, analysis, journalism and startups. See what my customers and partners say about me.