Working with data frames in R and python: basics

6 min readMay 28, 2021

I am thrilled with how smooth the transition from R to python has been. Certainly the syntax similarities to do basic work in R and python has helped.

This is the first post out of a few where I will describe how to work with data frames in R and python. Although there are many ways to achieve the same results for the following examples, my goal was to keep the syntax and product as similar as possible between R and python. At least for me, doing this has facilitated my learning.

Generating dataframes

First off, a data frame is a 2-dimensional data structure where the columns may be of different types.

R

df <- data.frame(days=c("Mon","Tue","Wed","Thu","Fri"), 
    letters=letters[1:5], 
    frequency=c(0.1,0.3,0.2,0,0.5))
df
#   days letters frequency
# 1  Mon       a       0.1
# 2  Tue       b       0.3
# 3  Wed       c       0.2
# 4  Thu       d       0.0
# 5  Fri       e       0.5

Python

In python, we need to use the pandas library to achieve a very similar experience.

import pandas as pd
import string #Just to get letters as in the R example
df = pd.DataFrame({'days': ['Mon','Tue','Wed','Thu','Fri'], 
    'letters': list(string.ascii_lowercase)[1:6], 
    'frequency': [0.1,0.3,0.2,0,0.5] })
df
#   days letters  frequency
# 0  Mon       b        0.1
# 1  Tue       c        0.3
# 2  Wed       d        0.2
# 3  Thu       e        0.0
# 4  Fri       f        0.5

Writing and reading data frames

R

Lets use the dataset `iris` to play around in these examples.

data(iris)#Writing a csv file
write.csv(iris, file="iris.csv", quote=F, row.names=FALSE)#Using a more flexible function (i.e. specify delimiter)
write.table(iris, file="iris.tsv", quote=F, col.names=TRUE, sep="\t", row.names=FALSE)#Similarly, to read this dataframe:
#Reading a csv file
df <- read.csv(file="iris.csv")
#Using a flexible function (i.e. specify things such as delimiter)
df <- read.table(file="iris.tsv", header=TRUE, sep="\t")

The above is not an exhaustive list by any means. There are others such as read.delim(), read.delim2() that are wrappers for read.table().

If you have a very large data frame or matrix, I recommend to use fread/fwrite from the library data.table. As it is considerably faster:

library(data.table)
fwrite(iris, file="iris.tsv", quote=F, col.names=TRUE, sep="\t", row.names=FALSE)
df <- fread(file="iris.tsv", header=T, data.table=FALSE) #data.table=FALSE to return a standard data.frame

Python

Lets first see how to read the iris dataset from the files we just created above with R

import pandas as pd
df = pd.read_table("iris.tsv", sep="\t") #The one we wrote from R
#Or read the .csv file
df = pd.read_csv("iris.csv")#To write the data
df.to_csv("iris_python.csv", index=False) #index=False to remove index (row names)
#Or separator can be specified
df.to_csv("iris_python.tsv", sep="\t", index=False)

Likewise, python also offers the data.table library which may increase the speed while reading data

import datatable as dt
df = dt.fread('iris.tsv', sep="\t")
df.to_csv("iris.csv")

Inspecting the data frame

R

Getting a summary of the data, the structure and types

Some of the operations I use more often to get an idea of the data in a data frame include:

summary(df)
#   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#    Species         
#  Length:150        
#  Class :character  
#  Mode  :characterstr(df)
# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Getting the data frame dimensions

dim(df) 
#[1] 150   5
nrow(df)
#[1] 150
ncol(df)
#[1] 5

Python

Getting a summary of the data, the structure and types

df.describe()
#        Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
# count    150.000000   150.000000    150.000000   150.000000
# mean       5.843333     3.057333      3.758000     1.199333
# std        0.828066     0.435866      1.765298     0.762238
# min        4.300000     2.000000      1.000000     0.100000
# 25%        5.100000     2.800000      1.600000     0.300000
# 50%        5.800000     3.000000      4.350000     1.300000
# 75%        6.400000     3.300000      5.100000     1.800000
# max        7.900000     4.400000      6.900000     2.500000df.dtypes
#Sepal.Length    float64
#Sepal.Width     float64
#Petal.Length    float64
#Petal.Width     float64
#Species          object
#dtype: object

Getting the data frame dimensions

df.shape
#(150, 5)
df.shape[0]
#150
df.shape[1]
#5

Subsetting the data frame

R

There are too many ways to access/get specific sections of a data frame. Here, I’ll just mention a few that I use most often.

Getting data from indexes or column/row names

#Getting the first row
df[1,]
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          5.1         3.5          1.4         0.2  setosa#Getting rows 1:2 and columns 1:4
df[1:2,1:4]
#  Sepal.Length Sepal.Width Petal.Length Petal.Width
#1          5.1         3.5          1.4         0.2
#2          4.9         3.0          1.4         0.2#Getting the first column
df[,1]#Getting columns through names (can be done for rows too if they have names)df[,c("Species","Petal.Length")]
#       Species Petal.Length
#1       setosa          1.4
#2       setosa          1.4
#....#Getting data from a single column
df$Species
#[1] "setosa"     "setosa"     "setosa"....

Subsetting data

You can subset data using indexes and names for the columns. There are also functions such as head() and tail() to get the first or last rows of a data frame. E.g:

head(df,2)
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          5.1         3.5          1.4         0.2  setosa
#2          4.9         3.0          1.4         0.2  setosa
tail(df,2)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#149          6.2         3.4          5.4         2.3 virginica
#150          5.9         3.0          5.1         1.8 virginica

Much more useful is subsetting data based on specific conditions using subset(). For example, lets get all the rows where Species is “setosa” and Sepal.Width > 3.0.

subset(df, Sepal.Width > 3.0 & Species == 'setosa')
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1           5.1         3.5          1.4         0.2  setosa
#3           4.7         3.2          1.3         0.2  setosa
#4           4.6         3.1          1.5         0.2  setosa
#....

You can also uses “masks” to return rows where a condition is met (i.e. it returns TRUE). However, I normally dislike this method to subset data, as I find it slightly harder to read. For example, doing the same as above with this method:

df[df$Sepal.Width > 3.0 & df$Species == 'setosa',]
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1           5.1         3.5          1.4         0.2  setosa
#3           4.7         3.2          1.3         0.2  setosa
#4           4.6         3.1          1.5         0.2  setosa
#.....

Finding row indexes

Another way to achieve the above, is by finding the rows where the condition is met using which() :

#Get row indexes where condition is met
row_idx <- which(df$Sepal.Width > 3.0 & df$Species == 'setosa')
row_idx
# [1]  1  3  4  5  6  7  8 10 11 12 15 16 17 18 19 20 21 22 23 24 25 #27 28 29 30
#[26] 31 32 33 34 35 36 37 38 40 41 43 44 45 47 48 49 50#Using it to access the desired data
df[row_idx,]
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1           5.1         3.5          1.4         0.2  setosa
#3           4.7         3.2          1.3         0.2  setosa
#4           4.6         3.1          1.5         0.2  setosa

Python

Getting data from indexes or column/row names

#Getting the first row
#Note that python indexes start in 0 while in R they start in 1df.iloc[0,:]
#Sepal.Length       5.1
#Sepal.Width        3.5
#Petal.Length       1.4
#Petal.Width        0.2
#Species         setosa#Getting rows 1:2 and columns 1:4 which in python is 0:1 and 1:3 #respectively
#In contrast to R, 0:2 means row [0,2) i.e. does not include row with index 2 and 0:5 is [0,5) i.e. does not include column 5df.iloc[0:2,0:5]
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa#Getting the first column
df.iloc[:,0]
#....#Getting columns through names (can be done for rows too if they have names)df.loc[:,["Species","Petal.Length"]]
#       Species  Petal.Length
#0       setosa           1.4
#1       setosa           1.4#Or simply:
df[["Species","Petal.Length"]]#Getting data from a single column
df.Species
#0         setosa
#1         setosa
#2         setosa

Subsetting data

Getting the first or last rows of a data frame:

df.head(2)
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
df.tail(2)
#  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
#148        6.2          3.4           5.4          2.3  virginica
#149        5.9          3.0           5.1          1.8  virginica

Subsetting data based on specific conditions can be achieved with:

df[(df["Sepal.Width"] > 3.0) & (df["Species"] == 'setosa')]
#    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0            5.1          3.5           1.4          0.2  setosa
#2            4.7          3.2           1.3          0.2  setosa
#3            4.6          3.1           1.5          0.2  setosa

Finding row indexes

In order to find the indexes of rows that meet the conditions from the example above:

df[(df["Sepal.Width"] > 3.0) & (df["Species"] == 'setosa')].index
#Int64Index([ 0,  2,  3,  4,  5,  6,  7,  9, 10, 11, 14, 15, 16, 17, #18, 19, 20,
#            21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, #36, 37, 39,
#            40, 42, 43, 44, 46, 47, 48, 49]

And that’s it for this post. In the next one I plan to explore the options to loop through data frames in python. For some of the ways on how to do thin in R, here is another post.

Working with data frames in R and python: basics

Generating dataframes

R

Python

Writing and reading data frames

R

Python

Inspecting the data frame

R

Python

Subsetting the data frame

R

Python

Written by Gabriel