How to Interface R Using Python for Bioinformatics

January 17th 2021

@varunsendilrajVarun Sendilraj

I am an 11th grader interested in machine learning, bioinformatics, website development, and robotics.

Introduction

In Bioinformatics, all tasks can be done using one of 2 programming languages:

0 reactions

Python

Python is a high-level programming language that is known for its “Easy to read and learn” format. Python can do tasks from automation, to data science and machine learning.

0 reactions

On the other hand, R is a programming language that was made for statistics and data processing.

0 reactions

Through-out the past few years, R was adopted by the bioinformatics community as the number one programming language for the release of new packages, partially because of Bioconductor (a collection of mature libraries for next-generation sequencing analysis) and the ggplot2 library for advanced plotting.

0 reactions

But from personal experience with Bioinformatics, most of my data wrangling and manipulation is done in python and endpoint analysis, and plotting is done in R. But R can do almost anything that Python can in terms of statistics and even more. My only problem with R is the sometimes — unintuitive syntax (<-, %>%, variable$attribute), but there is a way around this using the rpy2 library in python.

In this post, I want to try to show an example of using this library for bioinformatics.

0 reactions

Setting up the environment

Download the Anaconda Package
Download Python 3 and R
Install rpy2 from here
Open up Jupyter Notebook

Downloading the Data

For this demonstration, we are going to use data from the 1000 genome project. To download the data for the project use the wget command below:

0 reactions

Linux:

0 reactions

# Get the data
!wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv

If you are on windows or mac, the wget command won't work so use the urlib library instead :

0 reactions

Windows/Mac:

0 reactions

import urllib.request
url = 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv'
filename = 'genomes_other_sample_info_sample_info.csv'
urllib.request.urlretrieve(url, filename)

Converting Python DataFrame to R DataFrame

Let’s start by importing the necessary libraries to open and convert the python DataFrame to an R DataFrame:

0 reactions

#Import Libraries
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

Now let's open the file that we have downloaded using pandas read_csv function:

0 reactions

df = pd.read_csv('1000-genomes_other_sample_info_sample_info.csv')
df.head()

Using the .head() method we can view the data of the first few rows. Lets Double check that this is a pandas DataFrame before we continue on:

0 reactions

print(type(df))
#expected output: <class 'pandas.core.frame.DataFrame'>

The conversion of the pandas DataFrame to R DataFrame has trouble encoding float and integer values so let’s start by converting the entire DataFrame into a string.

0 reactions

#converts values in df into string 
df = df.applymap(str)
# Double check that values have changed
print(type(df['In_Final_Phase_Variant_Calling'][0]))
#Expected output: <class 'str'>

Next, let’s convert the Pandas DataFrame to an R DataFrame using localcoverter from the rpy2 library:

0 reactions

#conversion
with localconverter(ro.default_converter + pandas2ri.converter):
  r_df = ro.conversion.py2rpy(df)
#Check if conversion Worked
print(type(r_df))
#Expected output: rpy2.robjects.vectors.DataFrame

Converting Datatypes using rpy2

Before we move forward, let’s get a better understanding of our data:

0 reactions

print(f'This dataframe has {r_df.ncol} columns and {r_df.nrow} rows\n')
print(r_df.colnames)

Here we are just printing out the different columns in our DataFrame. The output should look like this:

0 reactions

This dataframe has 62 columns and 3500 rows

 [1] "Sample"                              "Family_ID"                          
 [3] "Population"                          "Population_Description"             
 [5] "Gender"                              "Relationship"                       
 [7] "Unexpected_Parent_Child"             "Non_Paternity"                      
 [9] "Siblings"                            "Grandparents"                       
[11] "Avuncular"                           "Half_Siblings"                      
[13] "Unknown_Second_Order"                "Third_Order"                        
[15] "In_Low_Coverage_Pilot"               "LC_Pilot_Platforms"                 
[17] "LC_Pilot_Centers"                    "In_High_Coverage_Pilot"             
[19] "HC_Pilot_Platforms"                  "HC_Pilot_Centers"                   
[21] "In_Exon_Targetted_Pilot"             "ET_Pilot_Platforms"                 
[23] "ET_Pilot_Centers"                    "Has_Sequence_in_Phase1"             
[25] "Phase1_LC_Platform"                  "Phase1_LC_Centers"                  
[27] "Phase1_E_Platform"                   "Phase1_E_Centers"                   
[29] "In_Phase1_Integrated_Variant_Set"    "Has_Phase1_chrY_SNPS"               
[31] "Has_phase1_chrY_Deletions"           "Has_phase1_chrMT_SNPs"              
[33] "Main_project_LC_Centers"             "Main_project_LC_platform"           
[35] "Total_LC_Sequence"                   "LC_Non_Duplicated_Aligned_Coverage" 
[37] "Main_Project_E_Centers"              "Main_Project_E_Platform"            
[39] "Total_Exome_Sequence"                "X_Targets_Covered_to_20x_or_greater"
[41] "VerifyBam_E_Omni_Free"               "VerifyBam_E_Affy_Free"              
[43] "VerifyBam_E_Omni_Chip"               "VerifyBam_E_Affy_Chip"              
[45] "VerifyBam_LC_Omni_Free"              "VerifyBam_LC_Affy_Free"             
[47] "VerifyBam_LC_Omni_Chip"              "VerifyBam_LC_Affy_Chip"             
[49] "LC_Indel_Ratio"                      "E_Indel_Ratio"                      
[51] "LC_Passed_QC"                        "E_Passed_QC"                        
[53] "In_Final_Phase_Variant_Calling"      "Has_Omni_Genotypes"                 
[55] "Has_Axiom_Genotypes"                 "Has_Affy_6_0_Genotypes"             
[57] "Has_Exome_LOF_Genotypes"             "EBV_Coverage"                       
[59] "DNA_Source_from_Coriell"             "Has_Sequence_from_Blood_in_Index"   
[61] "Super_Population"                    "Super_Population_Description"

Now, we need to perform some data cleanup. For example, some columns should be interpreted as Integers or Floats, but they are read as strings:

0 reactions

Let’s start by defining the as_numeric and match functions:

0 reactions

#define the as_numeric function
as_numeric = ro.r('as.numeric')
#define the Match function
match = ro.r.match

Functions:

0 reactions

as_numeric: converts string to integer or float value
match: works like python index function

Since there are a lot of columns that need to be converted into numbers, let's store those column names in a list and iterate through it:

0 reactions

#columns
numCol = ['Has_Sequence_in_Phase1','In_Phase1_Integrated_Variant_Set','Has_Phase1_chrY_SNPS','Has_phase1_chrY_Deletions','Has_phase1_chrMT_SNPs','LC_Passed_QC','E_Passed_QC','In_Final_Phase_Variant_Calling','Has_Omni_Genotypes', 'Has_Axiom_Genotypes','Has_Affy_6_0_Genotypes','Has_Exome_LOF_Genotypes', 'Has_Sequence_from_Blood_in_Index', 'Total_LC_Sequence', 'LC_Non_Duplicated_Aligned_Coverage','Total_Exome_Sequence', 'X_Targets_Covered_to_20x_or_greater', 'VerifyBam_E_Omni_Free', 'VerifyBam_E_Affy_Free', 'VerifyBam_E_Omni_Chip', 'VerifyBam_E_Affy_Chip', 'VerifyBam_LC_Omni_Free', 'VerifyBam_LC_Affy_Free','VerifyBam_LC_Omni_Chip','VerifyBam_LC_Affy_Chip', 'LC_Indel_Ratio', 'E_Indel_Ratio', 'EBV_Coverage']
#loop
for col in numCol:
    my_col = match(col, r_df.colnames)[0] #returned as a vector 
    print('Type of read count before as.numeric: %s' % r_df[my_col - 1].rclass[0])
    r_df[my_col - 1] = as_numeric(r_df[my_col - 1])
    print('Type of read count after as.numeric: %s' % r_df[my_col - 1].rclass[0])

The results should look like this (make sure all columns are converted to numbers):

0 reactions

Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
...
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric

Using ggplot2 to plot Data

Now let's use ggplot2 from the rpy2 library to plot our data.

0 reactions

Let's start by making a bar graph plotting the count of people per country that participated in the 1000 genome project:

0 reactions

#import ggplot2
import rpy2.robjects.lib.ggplot2 as ggplot2
from rpy2.robjects.functions import SignatureTranslatedFunction
#set theme
ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate = {'axis_text_x': 'axis.text.x'})
#plot
bar = ggplot2.ggplot(r_df) + ggplot2.geom_bar() + ggplot2.aes_string(x='Population') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
#save to img
ro.r.png('out.png', type='cairo-png')
bar.plot()
dev_off = ro.r('dev.off')
dev_off()

This may look a little bit intimidating but its mainly just boilerplate Code:

0 reactions

Code for Plotting:

bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x=’CENTER_NAME’) + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))

Format:

variableName = ggplot2.ggplot(**DATAFRAME**) + ggplot.**GRAPH_TYPE** + ggplot2.aes_string(x=’**X-AXIS**’) + ggplot2.theme(**ADJUST THE THEME HOWEVER NEEDED**)

In the end, the graph will be stored in a pdf file, but you can also view the file in your Jupyter Notebook:

0 reactions

from IPython.display import Image
Image(filename='out.png')

Graph:

0 reactions

Similarly, let's create a scatterplot to compare the Total_LC_Sequence and the LC_Non_Duplicated_Aligned_Coverage with relation to the population and gender:

0 reactions

#plot
pp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', col='Population', shape='Gender') + ggplot2.geom_point()
#save img
ro.r.png('scatter.png', type='cairo-png')
pp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='scatter.png')

Graph:

0 reactions

We can also plot a Box Plot using the same data above:

0 reactions

#plot
bp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', fill='Population') + ggplot2.geom_boxplot()
#save img
ro.r.png('box.png', type='cairo-png')
bp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='box.png')

Graph:

0 reactions

Completed Jupyter Notebook:

0 reactions

Bioinformatics/rpy2.ipynb at main · VarunSendilraj/Bioinformatics (github.com)

0 reactions

Also published at https://varunsendilraj.medium.com/interfacing-r-using-python-for-bioinformatics-9387c17344bd

0 reactions

Share this story

Read my stories

I am an 11th grader interested in machine learning, bioinformatics, website development, and robotics.

Join Hacker Noon

Create your free account to unlock your custom reading experience.

How to Interface R Using Python for Bioinformatics

How to Interface R Using Python for Bioinformatics

@varunsendilrajVarun Sendilraj

Introduction

Setting up the environment

Downloading the Data

Converting Python DataFrame to R DataFrame

Converting Datatypes using rpy2

Using ggplot2 to plot Data

Recommend

GitHub - bioinformatics-ua/gto: The genomics toolkit

Choosing a Language for Bioinformatics

A comparison of three programming languages for a full-fledged next-generation s...

What’s a Bioinformatics Systems Engineer?

sequence alignment - Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaa...

Bioinformatics and Information Science Center

Bioinformatics is the Combination of Data Science and Biology

AWS Rust Lambdas and Bioinformatics

DevOps in BioInformatics with Jillian Rowe - DevOps 074

Functional Bioinformatics Algorithms: Part 2 | Jamie Dixon's Home

About Joyk