Python Tutorial For Researchers Who use R
source link: https://www.tuicool.com/articles/NzEjauV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Python Tutorial For Researchers Who use R
Installation, Loading Data, Visualization, Linear Regression, Rpy2
T his tutorial is aimed at researchers who are used to using R. Data is at the center of any research project. To me, every researcher is now empowered with massive amounts of data than ever before. This puts a researcher squarely in the role of being a data scientist. If you are a researcher who’s been using R all this time, you are missing out. There are powerful models at your disposal to handle larger amounts of data available in Python. To upgrade your skills as a data scientist, give Python a try. Like me, you might find that you still love R for some tasks. Slowly, as you get used to using Python for other tasks, you may find that it’s more robust in handling larger amounts of data.
I will go over python installation, data loading, simple visualization, and a linear regression example. Toward the end, just for novelty sake, I will show you how to use R in Python.
The Case for Using Python
For researchers who are using R, using Python might seem to be a daunting task. On the contrary, today’s python’s analysis packages such as pandas , numpy and sklearn make it very easy to load data, explore data and analyze data as you would in R. You don’t need to write extensive code for simple analysis.
Using pandas and numpy together — The combination of the two enables you to handle any data management tasks. You create dataframes. Using the packages, you can handle missing data, manage columns, rows.
Using sklearn — All of your scientific computing needs are all contained in this package. You can find models for classification, regression, and clustering. You can also find tools for dimensionality reduction, model selection and preprocessing.
Using matplotlib — All of your graphing needs are all taken care of in this package. There are simple graphs such as bar charts, line charts. There are more complex graphs such as gradients, contours, and heatmaps.
Mac and Windows Installation Python
Now, let’s get started by installing python onto your desktop. The anaconda distribution of python is recommended. It contains not only python. It also contains the Spyder editor for development.
- Download Anaconda for Windows or Mac
- Choose Python 3.7 version, Download, Install
- Open up Terminal and run below
conda list
You should get a list of conda commands coming back. If you do, congratulations, you just installed Python. If you receive an error, try to do the following and check again. Errors are usually caused by not being able to find your ananconda installation in your path. Your .profile file should already be appended by the ananconda installation process. Then, sourcing it one more time on your terminal would do the trick.
source .profile conda list
Spyder
With Anaconda, you want to use the Spyder editor for free. It’s a good default editor to use for python. You can start up the spyder by simply initiating on the command line.
spyder
Once your editor starts up, you want to create a new file.
File -> New File, File -> Save As to name your new python code.
Commenting out code in Python
You can comment out blocks of code using quotes or you can comment out one line of code using “#”.
“”” diabetes=pd.read_csv('data/diabetes.csv') “””
#diabetes=pd.read_csv('data/diabetes.csv')
Loading Datasets
In R, you can load test datasets by default. In Scipy, you can also load test datasets by default. For completion sake, there are a lot more datasets on Kaggle.
Let’s grab our test data from here: Test Datasets can be found at Kaggle . We’ll use the “Pima Indians Diabetes Database” dataset. Click to Download it.
Loading the data. First, print some columns. Then, print some rows.
import pandas as pd
data=pd.read_csv(‘data/diabetes.csv’)
#Print data Colummns
print(data.columns)
#Preview first 5 lines of the loaded data print(data.head())
Summary Statistics
Summary statistics can be run using pandas. The describe() function is similar to the summary() function and will output the result in a table.
print (data.describe())
The output is here in a table format.
Creating a sub dataframe to explore
Our columns are: [Pregenancies (Number of), Glucose (Level of), Blood Pressure (Level of), SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
We can create a sub dataframe by simply indexing our dataframe(data) by the columns.
age = data[[“Age”,”Glucose”]]
print(age)
Plotting and Visualization
Plotting and Visualization for models can be done using matplotlib.
Histogram
Initially, displaying the histogram of all the column data relative to the “Outcome” can show you the distribution. A simple groupby on the dataframe (data) by the column “Outcome” accomplishes this. Using “.hist”, the histogram is generated of all the other columns in relation to “Outcome”.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
data=pd.read_csv(‘data/diabetes.csv’)
data.groupby(‘Outcome’).hist(figsize=(9,9))
Boxplot
Simple boxplot can be created by simply using the “by=” parameter inside the boxplot function for the dataframe (data). The “Outcome” data either positive=1 and negative=0. In this case, “Age” is grouped by the “Outcome”. This is a boxplot of all the Ages of people who have either Outcome=0 or Outcome=1.
data.boxplot(column=[‘Age’], by=[‘Outcome’])
Missing Data Handling
Checking for missing data is important in any data analysis. You can use these below functions to check for missing data. Pandas has a great tutorial on missing data for more information.
In this case, we checked there’s no data that’s null or na.
print(data.isnull().sum()) print(data.isna().sum())
Linear Regression
Linear regression is a great example to start to see the power of sklearn. This package contains all the models needed for scientific computing.
Instead of using the diabetes dataset that grabbed earlier, we can use the same dataset imported inside sklearn.
The process of running linear regression is as follows:
- splitting data into training data and test data (X)
- splitting target data into training data and test data (Y)
- create the linear model object (linear_model.LinearRegression())
- train the model using training sets
- make predictions
- output the scores
- plot the linear regression
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:]
# Create linear regression object regr = linear_model.LinearRegression()
# Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients print(‘Coefficients: \n’, regr.coef_) # The mean squared error print(“Mean squared error: %.2f” % mean_squared_error(diabetes_y_test, diabetes_y_pred)) # Explained variance score: 1 is perfect prediction print(‘Variance score: %.2f’ % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color=’black’) plt.plot(diabetes_X_test, diabetes_y_pred, color=’blue’, linewidth=3)
plt.xticks(()) plt.yticks(())
plt.show()
Using R inside Python
Just for novelty, did you know that you can also use R inside python?
rpy2 package allows you to do just that. You can go back to your terminal and run this command.
conda install -c r rpy2
Then, upon success, you can close and reopen Spyder inside this terminal.
import pandas.rpy.common as com import pandas as pd
import rpy2.robjects as ro
diabetes=pd.read_csv(‘data/diabetes.csv’)
dia=com.convert_to_r_dataframe(diabetes)
print(ro.r(‘summary(dia)’))
You should get the summary() statistics data from R. To find out more about using R inside Python, you can see the tutorials at rpy2 website.
I hope that this tutorial enabled you to get started with using Python. As data proliferate our research lives, learning to use Python and R side by side for a research project will only give us more tools to analyze the data from our research projects.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK