26

Tricks in R to Boost Your Productivity (Part 2)

 5 years ago
source link: https://www.tuicool.com/articles/mUzmQjr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

eaAvua7.jpg!web

Photo by Brad Neathery on Unsplash

I am always keen on tools and tricks that can help me get things done faster. While increasing productivity does not necessarily mean that it can give you good or right results, it can definitely reduce your working hours which means less time to make mistakes. Spending some time learning tricks and tools will pay itself back later in your work and is totally worthwhile. Following my previousarticle on tricks in R and RStudio, I will continue to share with you in this article some other tricks that I think is useful.

Set up R Profiles

When an R session starts, it first tries to search for a .Rprofile file in your project directory or in your home directory and execute the function .First if it exists. A few things you can use the .Rprofile to execute at the beginning of each session to save your time are

  • Load packages that you constantly work (such as tidyverse, data.table, ggplot2, etc.).
  • Set Environmental variables (such as a development environment identifier (“sandbox”, “staging”, “prod”, etc.), AWS credentials, your database connection credentials, .etc.).
  • Set Java parameters (you definitely need to set this if you need to pull large data from a database through DBI/rJava package).
  • Set ggplot2 theme and color palette (useful if you need to use a brand-specific color palette).

An example of my .Rprofile is shown below:

Make your Database Query Wrapper Functions

If you use R to query the database a lot, you should consider better organizing your file structures and writing wrapper functions to make your life easier. First of all, you should put all your R files into a folder and all SQL files to another one. Some people like to write SQL commands directly into the R functions or scripts, which I don’t think is the best practice and it will make your R code ugly and your SQL code is not reusable by another R code. After the organization, you can write the following wrapper functions:

Wrapper Functions for database query
  • connect_db is the wrapper function that returns a database connection handler after you call the driver and establish the database connection with the url and credentials.
  • get_sql is the SQL file parser.
  • db_query is the function you call to execute your query, where you can send either a SQL file path or a pure sql string. You can pass additional arguments to the function to substitute parameters in the SQL query through DBI::sqlInterpolate function. Most importantly, we need to close the connection after executing the query.

With the above wrapper functions, so you can perform the SQL query by simply calling the following commands for example.

# String based query
accounts <- db_query('select count(*) from accounts where created_at < ?time', time = '2019-01-01')# File based query
accounts <- db_query('sql/accounts.sql', time = '2019-01-01')

You can easily extend the above wrapper functions with more functionality such as delete, load, etc. to meet your needs.

Never Save Workspace Data on Exit

If you use the R command-line tool in a terminal for all your work, you will be asked whether to save your workspace data when you want to exit the R session. If you choose to save the workspace data, a hidden “.RData” file will be created in your working directory. It is OK to save the data if your workspace only contains a small amount of data. However, if your workspace contains a large amount of data (512MB or more), the saving process could take a long time to execute because the workspace data needs to be compressed into the file, which is slow for large data. Therefore, my personal suggestion is that you should never save the workspace data store. If you use RStudio, you can set the options like below. If you really want to save the data on some occasions, you can simply call the save command before you exit.

U7VFvee.png!web

Package Development

Package Reload

If you are writing an R package and want to reload the whole developing package without restarting the R session and hence losing the data, you can use the function devtools::reload() from devtools package to achieve that. The shortcut for reloading is CMD + SHIFT + L in RStudio.

Package Documentation

Inside the function of your package, you can press CMD + Option + Shift + R to call Roxygen2 to generate a document skeleton for your function as shown below.

3MjYRbj.png!web

Package Documentation Skeleton generated by Roxygen2

RStudio Code Setting

I am a heavy user of RStudio and have been using RStudio for over 4 years. Below is my code setting recommendation for RStudio. The purpose is to make the coding experience more pleasant and efficient.

zuqYZzY.png!web

Display Setting

VVZ7bei.png!web

Saving Setting

Development Environment

I believe that 99% of R users use RStudio as their primary R programming IDE. There are two major products of RStudio IDE. One is RStudio desktop and the other is RStudio server. Most of the R users start with the desktop version mainly for ad-hoc analysis and model development. If their needs expand to model deployment, task automation, shiny website, dashboard, reports, etc., then the server version becomes a better choice. If you use the server version, you had better create your dev, staging, and production environment on your servers for better software management. For me, I only have one EC2 server and don’t have three dedicated servers to dev, staging, and production environments, respectively. As a workaround, you can create three R projects in three different folders version controlled by git and Github to mimic the dev, staging, and production environments as shown below:

JBreIzq.png!web

Environment Management
  • dev : it is my sandbox for modeling and analysis.
  • stg : it is a testing environment for new features.
  • prod : it contains code and cron jobs used to generate reports, dashboard, and automated tasks. In the production environment, you should only change the code through git pull but never manually.

If you use the free version of RStudio server, switching between projects will restart your R session and hence lose data. If you use the RStudio Pro version, you can run 3 R sessions side by side in your browser, which is more convenient.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK