Working with data frames in R and python: basics

Gabriel
6 min readMay 28, 2021

I am thrilled with how smooth the transition from R to python has been. Certainly the syntax similarities to do basic work in R and python has helped.

This is the first post out of a few where I will describe how to work with data frames in R and python. Although there are many ways to achieve the same results for the following examples, my goal was to keep the syntax and product as similar as possible between R and python. At least for me, doing this has facilitated my learning.

Generating dataframes

First off, a data frame is a 2-dimensional data structure where the columns may be of different types.

R

df <- data.frame(days=c("Mon","Tue","Wed","Thu","Fri"), 
letters=letters[1:5],
frequency=c(0.1,0.3,0.2,0,0.5))
df
# days letters frequency
# 1 Mon a 0.1
# 2 Tue b 0.3
# 3 Wed c 0.2
# 4 Thu d 0.0
# 5 Fri e 0.5

Python

In python, we need to use the pandas library to achieve a very similar experience.

import pandas as pd
import string #Just to get letters as in the R example
df = pd.DataFrame({'days': ['Mon','Tue','Wed','Thu','Fri'],
'letters': list(string.ascii_lowercase)[1:6],
'frequency': [0.1,0.3,0.2,0,0.5] })
df
# days letters frequency
# 0 Mon b 0.1
# 1 Tue c 0.3
# 2 Wed d 0.2
# 3 Thu e 0.0
# 4 Fri f 0.5

Writing and reading data frames

R

Lets use the dataset `iris` to play around in these examples.

data(iris)#Writing a csv file
write.csv(iris, file="iris.csv", quote=F, row.names=FALSE)
#Using a more flexible function (i.e. specify delimiter)
write.table(iris, file="iris.tsv", quote=F, col.names=TRUE, sep="\t", row.names=FALSE)
#Similarly, to read this dataframe:
#Reading a csv file
df <- read.csv(file="iris.csv")
#Using a flexible function (i.e. specify things such as delimiter)
df <- read.table(file="iris.tsv", header=TRUE, sep="\t")

The above is not an exhaustive list by any means. There are others such as read.delim(), read.delim2() that are wrappers for read.table().

If you have a very large data frame or matrix, I recommend to use fread/fwrite from the library data.table. As it is considerably faster:

library(data.table)
fwrite(iris, file="iris.tsv", quote=F, col.names=TRUE, sep="\t", row.names=FALSE)
df <- fread(file="iris.tsv", header=T, data.table=FALSE) #data.table=FALSE to return a standard data.frame

Python

Lets first see how to read the iris dataset from the files we just created above with R

import pandas as pd
df = pd.read_table("iris.tsv", sep="\t") #The one we wrote from R
#Or read the .csv file
df = pd.read_csv("iris.csv")
#To write the data
df.to_csv("iris_python.csv", index=False) #index=False to remove index (row names)
#Or separator can be specified
df.to_csv("iris_python.tsv", sep="\t", index=False)

Likewise, python also offers the data.table library which may increase the speed while reading data

import datatable as dt
df = dt.fread('iris.tsv', sep="\t")
df.to_csv("iris.csv")

Inspecting the data frame

R

Getting a summary of the data, the structure and types

Some of the operations I use more often to get an idea of the data in a data frame include:

summary(df)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
# 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
# Median :5.800 Median :3.000 Median :4.350 Median :1.300
# Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
# 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
# Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# Species
# Length:150
# Class :character
# Mode :character
str(df)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : chr "setosa" "setosa" "setosa" "setosa" ...

Getting the data frame dimensions

dim(df) 
#[1] 150 5
nrow(df)
#[1] 150
ncol(df)
#[1] 5

Python

Getting a summary of the data, the structure and types

df.describe()
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# count 150.000000 150.000000 150.000000 150.000000
# mean 5.843333 3.057333 3.758000 1.199333
# std 0.828066 0.435866 1.765298 0.762238
# min 4.300000 2.000000 1.000000 0.100000
# 25% 5.100000 2.800000 1.600000 0.300000
# 50% 5.800000 3.000000 4.350000 1.300000
# 75% 6.400000 3.300000 5.100000 1.800000
# max 7.900000 4.400000 6.900000 2.500000
df.dtypes
#Sepal.Length float64
#Sepal.Width float64
#Petal.Length float64
#Petal.Width float64
#Species object
#dtype: object

Getting the data frame dimensions

df.shape
#(150, 5)
df.shape[0]
#150
df.shape[1]
#5

Subsetting the data frame

R

There are too many ways to access/get specific sections of a data frame. Here, I’ll just mention a few that I use most often.

Getting data from indexes or column/row names

#Getting the first row
df[1,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#Getting rows 1:2 and columns 1:4
df[1:2,1:4]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#Getting the first column
df[,1]
#Getting columns through names (can be done for rows too if they have names)df[,c("Species","Petal.Length")]
# Species Petal.Length
#1 setosa 1.4
#2 setosa 1.4
#....
#Getting data from a single column
df$Species
#[1] "setosa" "setosa" "setosa"....

Subsetting data

You can subset data using indexes and names for the columns. There are also functions such as head() and tail() to get the first or last rows of a data frame. E.g:

head(df,2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
tail(df,2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#149 6.2 3.4 5.4 2.3 virginica
#150 5.9 3.0 5.1 1.8 virginica

Much more useful is subsetting data based on specific conditions using subset(). For example, lets get all the rows where Species is “setosa” and Sepal.Width > 3.0.

subset(df, Sepal.Width > 3.0 & Species == 'setosa')
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#....

You can also uses “masks” to return rows where a condition is met (i.e. it returns TRUE). However, I normally dislike this method to subset data, as I find it slightly harder to read. For example, doing the same as above with this method:

df[df$Sepal.Width > 3.0 & df$Species == 'setosa',]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#.....

Finding row indexes

Another way to achieve the above, is by finding the rows where the condition is met using which() :

#Get row indexes where condition is met
row_idx <- which(df$Sepal.Width > 3.0 & df$Species == 'setosa')
row_idx
# [1] 1 3 4 5 6 7 8 10 11 12 15 16 17 18 19 20 21 22 23 24 25 #27 28 29 30
#[26] 31 32 33 34 35 36 37 38 40 41 43 44 45 47 48 49 50
#Using it to access the desired data
df[row_idx,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa

Python

Getting data from indexes or column/row names

#Getting the first row
#Note that python indexes start in 0 while in R they start in 1
df.iloc[0,:]
#Sepal.Length 5.1
#Sepal.Width 3.5
#Petal.Length 1.4
#Petal.Width 0.2
#Species setosa
#Getting rows 1:2 and columns 1:4 which in python is 0:1 and 1:3 #respectively
#In contrast to R, 0:2 means row [0,2) i.e. does not include row with index 2 and 0:5 is [0,5) i.e. does not include column 5
df.iloc[0:2,0:5]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#Getting the first column
df.iloc[:,0]
#....
#Getting columns through names (can be done for rows too if they have names)df.loc[:,["Species","Petal.Length"]]
# Species Petal.Length
#0 setosa 1.4
#1 setosa 1.4
#Or simply:
df[["Species","Petal.Length"]]
#Getting data from a single column
df.Species
#0 setosa
#1 setosa
#2 setosa

Subsetting data

Getting the first or last rows of a data frame:

df.head(2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
df.tail(2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#148 6.2 3.4 5.4 2.3 virginica
#149 5.9 3.0 5.1 1.8 virginica

Subsetting data based on specific conditions can be achieved with:

df[(df["Sepal.Width"] > 3.0) & (df["Species"] == 'setosa')]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa

Finding row indexes

In order to find the indexes of rows that meet the conditions from the example above:

df[(df["Sepal.Width"] > 3.0) & (df["Species"] == 'setosa')].index
#Int64Index([ 0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 14, 15, 16, 17, #18, 19, 20,
# 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, #36, 37, 39,
# 40, 42, 43, 44, 46, 47, 48, 49]

And that’s it for this post. In the next one I plan to explore the options to loop through data frames in python. For some of the ways on how to do thin in R, here is another post.

--

--