6

How to Interface R Using Python for Bioinformatics

 4 years ago
source link: https://hackernoon.com/how-to-interface-r-using-python-for-bioinformatics-ho133178
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How to Interface R Using Python for Bioinformatics

@varunsendilrajVarun Sendilraj

I am an 11th grader interested in machine learning, bioinformatics, website development, and robotics.

Introduction

In Bioinformatics, all tasks can be done using one of 2 programming languages:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
  1. Python

Python is a high-level programming language that is known for its “Easy to read and learn” format. Python can do tasks from automation, to data science and machine learning.

0 reactions
heart.png
light.png
thumbs-down.png
money.png

On the other hand, R is a programming language that was made for statistics and data processing.

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Through-out the past few years, R was adopted by the bioinformatics community as the number one programming language for the release of new packages, partially because of Bioconductor (a collection of mature libraries for next-generation sequencing analysis) and the ggplot2 library for advanced plotting.

0 reactions
heart.png
light.png
thumbs-down.png
money.png

But from personal experience with Bioinformatics, most of my data wrangling and manipulation is done in python and endpoint analysis, and plotting is done in R. But R can do almost anything that Python can in terms of statistics and even more. My only problem with R is the sometimes — unintuitive syntax (<-, %>%, variable$attribute), but there is a way around this using the rpy2 library in python.

In this post, I want to try to show an example of using this library for bioinformatics.

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Setting up the environment

  1. Download the Anaconda Package
  2. Download Python 3 and R
  3. Install rpy2 from here
  4. Open up Jupyter Notebook

Downloading the Data

For this demonstration, we are going to use data from the 1000 genome project. To download the data for the project use the wget command below:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Linux:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
# Get the data
!wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv

If you are on windows or mac, the wget command won't work so use the urlib library instead :

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Windows/Mac:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
import urllib.request
url = 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv'
filename = 'genomes_other_sample_info_sample_info.csv'
urllib.request.urlretrieve(url, filename)

Converting Python DataFrame to R DataFrame

Let’s start by importing the necessary libraries to open and convert the python DataFrame to an R DataFrame:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#Import Libraries
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

Now let's open the file that we have downloaded using pandas read_csv function:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
df = pd.read_csv('1000-genomes_other_sample_info_sample_info.csv')
df.head()

Using the .head() method we can view the data of the first few rows. Lets Double check that this is a pandas DataFrame before we continue on:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
print(type(df))
#expected output: <class 'pandas.core.frame.DataFrame'>

The conversion of the pandas DataFrame to R DataFrame has trouble encoding float and integer values so let’s start by converting the entire DataFrame into a string.

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#converts values in df into string 
df = df.applymap(str)
# Double check that values have changed
print(type(df['In_Final_Phase_Variant_Calling'][0]))
#Expected output: <class 'str'>

Next, let’s convert the Pandas DataFrame to an R DataFrame using localcoverter from the rpy2 library:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#conversion
with localconverter(ro.default_converter + pandas2ri.converter):
  r_df = ro.conversion.py2rpy(df)
#Check if conversion Worked
print(type(r_df))
#Expected output: rpy2.robjects.vectors.DataFrame

Converting Datatypes using rpy2

Before we move forward, let’s get a better understanding of our data:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
print(f'This dataframe has {r_df.ncol} columns and {r_df.nrow} rows\n')
print(r_df.colnames)

Here we are just printing out the different columns in our DataFrame. The output should look like this:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
This dataframe has 62 columns and 3500 rows

 [1] "Sample"                              "Family_ID"                          
 [3] "Population"                          "Population_Description"             
 [5] "Gender"                              "Relationship"                       
 [7] "Unexpected_Parent_Child"             "Non_Paternity"                      
 [9] "Siblings"                            "Grandparents"                       
[11] "Avuncular"                           "Half_Siblings"                      
[13] "Unknown_Second_Order"                "Third_Order"                        
[15] "In_Low_Coverage_Pilot"               "LC_Pilot_Platforms"                 
[17] "LC_Pilot_Centers"                    "In_High_Coverage_Pilot"             
[19] "HC_Pilot_Platforms"                  "HC_Pilot_Centers"                   
[21] "In_Exon_Targetted_Pilot"             "ET_Pilot_Platforms"                 
[23] "ET_Pilot_Centers"                    "Has_Sequence_in_Phase1"             
[25] "Phase1_LC_Platform"                  "Phase1_LC_Centers"                  
[27] "Phase1_E_Platform"                   "Phase1_E_Centers"                   
[29] "In_Phase1_Integrated_Variant_Set"    "Has_Phase1_chrY_SNPS"               
[31] "Has_phase1_chrY_Deletions"           "Has_phase1_chrMT_SNPs"              
[33] "Main_project_LC_Centers"             "Main_project_LC_platform"           
[35] "Total_LC_Sequence"                   "LC_Non_Duplicated_Aligned_Coverage" 
[37] "Main_Project_E_Centers"              "Main_Project_E_Platform"            
[39] "Total_Exome_Sequence"                "X_Targets_Covered_to_20x_or_greater"
[41] "VerifyBam_E_Omni_Free"               "VerifyBam_E_Affy_Free"              
[43] "VerifyBam_E_Omni_Chip"               "VerifyBam_E_Affy_Chip"              
[45] "VerifyBam_LC_Omni_Free"              "VerifyBam_LC_Affy_Free"             
[47] "VerifyBam_LC_Omni_Chip"              "VerifyBam_LC_Affy_Chip"             
[49] "LC_Indel_Ratio"                      "E_Indel_Ratio"                      
[51] "LC_Passed_QC"                        "E_Passed_QC"                        
[53] "In_Final_Phase_Variant_Calling"      "Has_Omni_Genotypes"                 
[55] "Has_Axiom_Genotypes"                 "Has_Affy_6_0_Genotypes"             
[57] "Has_Exome_LOF_Genotypes"             "EBV_Coverage"                       
[59] "DNA_Source_from_Coriell"             "Has_Sequence_from_Blood_in_Index"   
[61] "Super_Population"                    "Super_Population_Description"

Now, we need to perform some data cleanup. For example, some columns should be interpreted as Integers or Floats, but they are read as strings:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Let’s start by defining the as_numeric and match functions:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#define the as_numeric function
as_numeric = ro.r('as.numeric')
#define the Match function
match = ro.r.match

Functions:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
  • as_numeric: converts string to integer or float value
  • match: works like python index function

Since there are a lot of columns that need to be converted into numbers, let's store those column names in a list and iterate through it:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#columns
numCol = ['Has_Sequence_in_Phase1','In_Phase1_Integrated_Variant_Set','Has_Phase1_chrY_SNPS','Has_phase1_chrY_Deletions','Has_phase1_chrMT_SNPs','LC_Passed_QC','E_Passed_QC','In_Final_Phase_Variant_Calling','Has_Omni_Genotypes', 'Has_Axiom_Genotypes','Has_Affy_6_0_Genotypes','Has_Exome_LOF_Genotypes', 'Has_Sequence_from_Blood_in_Index', 'Total_LC_Sequence', 'LC_Non_Duplicated_Aligned_Coverage','Total_Exome_Sequence', 'X_Targets_Covered_to_20x_or_greater', 'VerifyBam_E_Omni_Free', 'VerifyBam_E_Affy_Free', 'VerifyBam_E_Omni_Chip', 'VerifyBam_E_Affy_Chip', 'VerifyBam_LC_Omni_Free', 'VerifyBam_LC_Affy_Free','VerifyBam_LC_Omni_Chip','VerifyBam_LC_Affy_Chip', 'LC_Indel_Ratio', 'E_Indel_Ratio', 'EBV_Coverage']
#loop
for col in numCol:
    my_col = match(col, r_df.colnames)[0] #returned as a vector 
    print('Type of read count before as.numeric: %s' % r_df[my_col - 1].rclass[0])
    r_df[my_col - 1] = as_numeric(r_df[my_col - 1])
    print('Type of read count after as.numeric: %s' % r_df[my_col - 1].rclass[0])

The results should look like this (make sure all columns are converted to numbers):

0 reactions
heart.png
light.png
thumbs-down.png
money.png
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
...
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric

Using ggplot2 to plot Data

Now let's use ggplot2 from the rpy2 library to plot our data.

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Let's start by making a bar graph plotting the count of people per country that participated in the 1000 genome project:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#import ggplot2
import rpy2.robjects.lib.ggplot2 as ggplot2
from rpy2.robjects.functions import SignatureTranslatedFunction
#set theme
ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate = {'axis_text_x': 'axis.text.x'})
#plot
bar = ggplot2.ggplot(r_df) + ggplot2.geom_bar() + ggplot2.aes_string(x='Population') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
#save to img
ro.r.png('out.png', type='cairo-png')
bar.plot()
dev_off = ro.r('dev.off')
dev_off()

This may look a little bit intimidating but its mainly just boilerplate Code:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
  • Code for Plotting:
bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x=’CENTER_NAME’) + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
  • Format:
variableName = ggplot2.ggplot(**DATAFRAME**) + ggplot.**GRAPH_TYPE** + ggplot2.aes_string(x=’**X-AXIS**’) + ggplot2.theme(**ADJUST THE THEME HOWEVER NEEDED**)

In the end, the graph will be stored in a pdf file, but you can also view the file in your Jupyter Notebook:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
from IPython.display import Image
Image(filename='out.png')

Graph:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Similarly, let's create a scatterplot to compare the Total_LC_Sequence and the LC_Non_Duplicated_Aligned_Coverage with relation to the population and gender:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#plot
pp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', col='Population', shape='Gender') + ggplot2.geom_point()
#save img
ro.r.png('scatter.png', type='cairo-png')
pp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='scatter.png')

Graph:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

We can also plot a Box Plot using the same data above:

0 reactions
heart.png
light.png
thumbs-down.png
money.png
#plot
bp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', fill='Population') + ggplot2.geom_boxplot()
#save img
ro.r.png('box.png', type='cairo-png')
bp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='box.png')

Graph:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Completed Jupyter Notebook:

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Bioinformatics/rpy2.ipynb at main · VarunSendilraj/Bioinformatics (github.com)

0 reactions
heart.png
light.png
thumbs-down.png
money.png

Also published at https://varunsendilraj.medium.com/interfacing-r-using-python-for-bioinformatics-9387c17344bd

0 reactions
heart.png
light.png
thumbs-down.png
money.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
Share this story
Read my stories

I am an 11th grader interested in machine learning, bioinformatics, website development, and robotics.

Join Hacker Noon

Create your free account to unlock your custom reading experience.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK