Python Tutorial For Researchers Who use R

Installation, Loading Data, Visualization, Linear Regression, Rpy2

@wwarby unsplash.com

T his tutorial is aimed at researchers who are used to using R. Data is at the center of any research project. To me, every researcher is now empowered with massive amounts of data than ever before. This puts a researcher squarely in the role of being a data scientist. If you are a researcher who’s been using R all this time, you are missing out. There are powerful models at your disposal to handle larger amounts of data available in Python. To upgrade your skills as a data scientist, give Python a try. Like me, you might find that you still love R for some tasks. Slowly, as you get used to using Python for other tasks, you may find that it’s more robust in handling larger amounts of data.

I will go over python installation, data loading, simple visualization, and a linear regression example. Toward the end, just for novelty sake, I will show you how to use R in Python.

The Case for Using Python

For researchers who are using R, using Python might seem to be a daunting task. On the contrary, today’s python’s analysis packages such as pandas , numpy and sklearn make it very easy to load data, explore data and analyze data as you would in R. You don’t need to write extensive code for simple analysis.

Using pandas and numpy together — The combination of the two enables you to handle any data management tasks. You create dataframes. Using the packages, you can handle missing data, manage columns, rows.

Using sklearn — All of your scientific computing needs are all contained in this package. You can find models for classification, regression, and clustering. You can also find tools for dimensionality reduction, model selection and preprocessing.

Using matplotlib — All of your graphing needs are all taken care of in this package. There are simple graphs such as bar charts, line charts. There are more complex graphs such as gradients, contours, and heatmaps.

Mac and Windows Installation Python

Now, let’s get started by installing python onto your desktop. The anaconda distribution of python is recommended. It contains not only python. It also contains the Spyder editor for development.

Download Anaconda for Windows or Mac
Choose Python 3.7 version, Download, Install
Open up Terminal and run below

conda list

You should get a list of conda commands coming back. If you do, congratulations, you just installed Python. If you receive an error, try to do the following and check again. Errors are usually caused by not being able to find your ananconda installation in your path. Your .profile file should already be appended by the ananconda installation process. Then, sourcing it one more time on your terminal would do the trick.

source .profile
conda list

Spyder

With Anaconda, you want to use the Spyder editor for free. It’s a good default editor to use for python. You can start up the spyder by simply initiating on the command line.

spyder

Once your editor starts up, you want to create a new file.

File -> New File, File -> Save As to name your new python code.

Commenting out code in Python

You can comment out blocks of code using quotes or you can comment out one line of code using “#”.

“””
diabetes=pd.read_csv('data/diabetes.csv')
 
“””

#diabetes=pd.read_csv('data/diabetes.csv')

Loading Datasets

In R, you can load test datasets by default. In Scipy, you can also load test datasets by default. For completion sake, there are a lot more datasets on Kaggle.

Let’s grab our test data from here: Test Datasets can be found at Kaggle . We’ll use the “Pima Indians Diabetes Database” dataset. Click to Download it.

Loading the data. First, print some columns. Then, print some rows.

import pandas as pd

data=pd.read_csv(‘data/diabetes.csv’)

#Print data Colummns

print(data.columns)

#Preview first 5 lines of the loaded data
print(data.head())

Summary Statistics by Jun Wu

Summary Statistics

Summary statistics can be run using pandas. The describe() function is similar to the summary() function and will output the result in a table.

print (data.describe())

The output is here in a table format.

Creating a sub dataframe to explore

Our columns are: [Pregenancies (Number of), Glucose (Level of), Blood Pressure (Level of), SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]

We can create a sub dataframe by simply indexing our dataframe(data) by the columns.

age = data[[“Age”,”Glucose”]]

print(age)

Age vs Glucose by Jun Wu

Plotting and Visualization

Plotting and Visualization for models can be done using matplotlib.

Histogram

Initially, displaying the histogram of all the column data relative to the “Outcome” can show you the distribution. A simple groupby on the dataframe (data) by the column “Outcome” accomplishes this. Using “.hist”, the histogram is generated of all the other columns in relation to “Outcome”.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data=pd.read_csv(‘data/diabetes.csv’)

data.groupby(‘Outcome’).hist(figsize=(9,9))

Boxplot

Simple boxplot can be created by simply using the “by=” parameter inside the boxplot function for the dataframe (data). The “Outcome” data either positive=1 and negative=0. In this case, “Age” is grouped by the “Outcome”. This is a boxplot of all the Ages of people who have either Outcome=0 or Outcome=1.

data.boxplot(column=[‘Age’], by=[‘Outcome’])

Missing Data Handling

Checking for missing data is important in any data analysis. You can use these below functions to check for missing data. Pandas has a great tutorial on missing data for more information.

In this case, we checked there’s no data that’s null or na.

print(data.isnull().sum())
print(data.isna().sum())

Missing Data by Jun Wu

Linear Regression

Linear regression is a great example to start to see the power of sklearn. This package contains all the models needed for scientific computing.

Instead of using the diabetes dataset that grabbed earlier, we can use the same dataset imported inside sklearn.

The process of running linear regression is as follows:

- splitting data into training data and test data (X)

- splitting target data into training data and test data (Y)

- create the linear model object (linear_model.LinearRegression())

- train the model using training sets

- make predictions

- output the scores

- plot the linear regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

diabetes = datasets.load_diabetes()

diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print(‘Coefficients: \n’, regr.coef_)
# The mean squared error
print(“Mean squared error: %.2f”
 % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print(‘Variance score: %.2f’ % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color=’black’)
plt.plot(diabetes_X_test, diabetes_y_pred, color=’blue’, linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Linear Regression by Jun Wu

Using R inside Python

Just for novelty, did you know that you can also use R inside python?

rpy2 package allows you to do just that. You can go back to your terminal and run this command.

conda install -c r rpy2

Then, upon success, you can close and reopen Spyder inside this terminal.

import pandas.rpy.common as com
import pandas as pd

import rpy2.robjects as ro

diabetes=pd.read_csv(‘data/diabetes.csv’)

dia=com.convert_to_r_dataframe(diabetes)

print(ro.r(‘summary(dia)’))

You should get the summary() statistics data from R. To find out more about using R inside Python, you can see the tutorials at rpy2 website.

I hope that this tutorial enabled you to get started with using Python. As data proliferate our research lives, learning to use Python and R side by side for a research project will only give us more tools to analyze the data from our research projects.

Python Tutorial For Researchers Who use R