# Linear regression in R and python

Fortunately (or unfortunately), I haven’t had the need to learn python in detail to carry out my main job functions. However, python has gained quite a lot of traction for data science and bioinformatics, thus I’ve decided to get up to speed during the few breaks I get.

Today I wanted to see how simple is to translate some very basic R code to python. Specifically, I wanted to see how to do the following in python:

- Work with data frames
- Simulate data
- Perform linear regressions
- Plot data

For this particular exercise I will create a data frame with simulated data drawn from a multivariate normal distribution for which I will know the real regression coefficients, and check these through a linear regression.

# R code

In R, achieving this is very simple with the help of the MASS library. The code for a simple simulation and linear regression looked like this:

Briefly, in the snippet above, I first simulate some data using the function `mvrnorm()`

that will draw random numbers from a multivariate distribution for two variables with mean=0, var(*x1*)=var(*y*)=1 and cov(*y,x1*)=0.5. The linear regression is then fit with the `lm()`

function, and the `summary()`

allows to inspect the results:

`Coefficients:`

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.00428 0.01229 -0.348 0.728

x1 0.50926 0.01218 41.801 <2e-16 ***

Here we see that the regression coefficient for x1 = 0.509 which is in line with what we simulated *b* = cov(*y,x1*)/var(*x1*)=0.5/1=0.5

The data and regression line are then plotted with `plot()`

and `lines()`

function:

Code for a multivariable regression looks very similar, here I am adding an extra variable, changing the mean of my outcome variable *y* to mean=2 and assigning cov(*y,x1*)=0.4 and cov(*y,x2*)=0.2. Given that variances = 1 across variable, we expect the regression coefficients to be ~0.4 for *x1* and ~ 0.2 for *x2*.

And the result:

`....`

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.99545 0.01261 158.29 <2e-16 ***

x1 0.41296 0.01255 32.90 <2e-16 ***

x2 0.20444 0.01267 16.13 <2e-16 ***

....

# Python code

The first time I googled “linear regression in python”, I got prompted to look into `sklearn.linear_model.LinearRegression`

. Then I realized that it was an overkill to use the `sklearn`

library for this particular exercise (gladly, because there were too many lines of code). However, turns out that the python code is very similar to that from R if you use the library `statsmodels`

. The following code replicates what was done in R in the previous section:

And this is the result:

`coef std err t P>|t| [0.025 0.975]`

------------------------------------------------------------------------------

x1 0.5185 0.012 43.433 0.000 0.495 0.542

Which seemed fine until I realized that there was no estimate for the intercept! — Turns out that unlike R, adding the intercept in the regression is not the default functionality. This has to be added manually by including a constant to the predictors data frame:

After adding the constant to the data frame of predictors with `sm.add_constant(d['x1'])`

now the linear regression returns the intercept estimate as well:

`coef std err t P>|t| [0.025 0.975]`

------------------------------------------------------------------------------

const 0.0057 0.012 0.471 0.638 -0.018 0.029

x1 0.5187 0.012 43.426 0.000 0.495 0.542

Finally, the example of a multivariable regression in R “translated” to python looks like this: