Sign in

Linear regression in R and python

Fortunately (or unfortunately), I haven’t had the need to learn python in detail to carry out my main job functions. However, python has gained quite a lot of traction for data science and bioinformatics, thus I’ve decided to get up to speed during the few breaks I get.

Today I wanted to see how simple is to translate some very basic R code to python. Specifically, I wanted to see how to do the following in python:

  • Work with data frames
  • Simulate data
  • Perform linear regressions
  • Plot data

For this particular exercise I will create a data frame with simulated data drawn from a multivariate normal distribution for which I will know the real regression coefficients, and check these through a linear regression.

R code

In R, achieving this is very simple with the help of the MASS library. The code for a simple simulation and linear regression looked like this:

Briefly, in the snippet above, I first simulate some data using the function mvrnorm() that will draw random numbers from a multivariate distribution for two variables with mean=0, var(x1)=var(y)=1 and cov(y,x1)=0.5. The linear regression is then fit with the lm() function, and the summary() allows to inspect the results:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00428 0.01229 -0.348 0.728
x1 0.50926 0.01218 41.801 <2e-16 ***

Here we see that the regression coefficient for x1 = 0.509 which is in line with what we simulated b = cov(y,x1)/var(x1)=0.5/1=0.5

The data and regression line are then plotted with plot() and lines() function:

Code for a multivariable regression looks very similar, here I am adding an extra variable, changing the mean of my outcome variable y to mean=2 and assigning cov(y,x1)=0.4 and cov(y,x2)=0.2. Given that variances = 1 across variable, we expect the regression coefficients to be ~0.4 for x1 and ~ 0.2 for x2.

And the result:

....
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.99545 0.01261 158.29 <2e-16 ***
x1 0.41296 0.01255 32.90 <2e-16 ***
x2 0.20444 0.01267 16.13 <2e-16 ***
....

Python code

The first time I googled “linear regression in python”, I got prompted to look into sklearn.linear_model.LinearRegression . Then I realized that it was an overkill to use the sklearn library for this particular exercise (gladly, because there were too many lines of code). However, turns out that the python code is very similar to that from R if you use the library statsmodels . The following code replicates what was done in R in the previous section:

And this is the result:

coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1 0.5185 0.012 43.433 0.000 0.495 0.542

Which seemed fine until I realized that there was no estimate for the intercept! — Turns out that unlike R, adding the intercept in the regression is not the default functionality. This has to be added manually by including a constant to the predictors data frame:

After adding the constant to the data frame of predictors with sm.add_constant(d['x1']) now the linear regression returns the intercept estimate as well:

coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const 0.0057 0.012 0.471 0.638 -0.018 0.029
x1 0.5187 0.012 43.426 0.000 0.495 0.542

Finally, the example of a multivariable regression in R “translated” to python looks like this:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store