Easy error handling in R with purrr’s possibly

See how the purrr package’s possibly() function helps you flag errors and keep going when applying a function over multiple objects in R.

Executive Editor, Data & Analytics,

InfoWorld | Dec 17, 2020 3:00 am PST

It’s frustrating to see your code choke part of the way through while trying to apply a function in R. You may know that something in one of those objects caused a problem, but how do you track down the offender?

The purrr package’s possibly() function is one easy way.

In this example, I’ll demo code that imports multiple CSV files. Most files’ value columns import as characters, but one of these comes in as numbers. Running a function that expects characters as input will cause an error.

[ Get R data.table and tidyverse code for dozens of data tasks by downloading InfoWorld’s ultimate R data.table cheat sheet ]

For setup, the code below loads several libraries I need and then uses base R’s list.files() function to return a sorted vector with names of all the files in my data directory.

Volume 0%

Loading ad

library(purrr)
library(readr)
library(rio)
library(dplyr)
my_data_files <- list.files("data_files", full.names = TRUE) %>%
  sort()

I can then import the first file and look at its structure.

x <- rio::import("data_files/file1.csv")
str(x)
'data.frame':	3 obs. of  3 variables:
 $ Category     : chr  "A" "B" "C"
 $ Value        : chr  "$4,256.48 " "$438.22" "$945.12"
 $ MonthStarting: chr  "12/1/20" "12/1/20" "12/1/20"

Both the Value and Month columns are importing as character strings. What I ultimately want is Value as numbers and MonthStarting as dates.

I sometimes deal with issues like this by writing a small function, such as the one below, to make changes in a file after import. It uses dplyr’s transmute() to create a new Month column from MonthStarting as Date objects, and a new Total column from Value as numbers. I also make sure to keep the Category column (transmute() drops all columns not explicity mentioned).

library(dplyr)
library(lubridate)
process_file <- function(myfile) {
  rio::import(myfile) %>%
    dplyr::transmute(
      Category = as.character(Category),
      Month = lubridate::mdy(MonthStarting),
      Total = readr::parse_number(Value)
    )
}

I like to use readr’s parse_number() function for converting values that come in as character strings because it deals with commas, dollar signs, or percent signs in numbers. However, parse_number() requires character strings as input. If a value is already a number, parse_number() will throw an error.

SponsoredPost Sponsored by HPE

8 steps to achieving the 'cloud everywhere' vision

If you were to imagine the perfect book to describe IT organizations right now, it might be entitled A Tale of Two Clouds.

My new function works fine when I test it on the first two files in my data directory using purrr’s map_df() function.

my_results <- map_df(my_data_files[1:2], process_file)

But if I try running my function on all the files, including the one where Value imports as numbers, it will choke.

all_results <- map_df(my_data_files, process_file)
 Error: Problem with `mutate()` input `Total`.
x is.character(x) is not TRUE
ℹ Input `Total` is `readr::parse_number(Value)`.
Run `rlang::last_error()` to see where the error occurred.

That error tells me Total is not a character column in one of the files, but I’m not sure which one. Ideally, I’d like to run through all the files, marking the one(s) with problems as errors but still processing all of them instead of stopping at the error.

possibly() lets me do this by creating a brand new function from my original function:

SponsoredPost Sponsored by Cisco and Intel

HCI: The Backbone of Continuous Innovation

Shifting from a traditional infrastructure to HCI can open up possibilities you’ve never even dreamed of. From increasing efficiencies to lowering costs to fueling new revenue streams, learn how...

safer_process_file <- possibly(process_file, otherwise = "Error in file")

The first argument for possibly() is my original function, process_file. The second argument, otherwise, tells possibly() what to return if there’s an error.

To apply my new safer_process_file() function to all my files, I’ll use the map() function and not purrr’s map_df() function. That’s because safer_process_file() needs to return a list, not a data frame. And that’s because if there’s an error, those error results won’t be a data frame; they’ll be the character string that I told otherwise to generate.

all_results <- map(my_data_files, safer_process_file)
str(all_results, max.level = 1) 
List of 5
 $ :'data.frame':	3 obs. of  3 variables:
 $ :'data.frame':	3 obs. of  3 variables:
 $ :'data.frame':	3 obs. of  3 variables:
 $ : chr "Error in file"
 $ :'data.frame':	3 obs. of  3 variables:

You can see here that the fourth item, from my fourth file, is the one with the error. That’s easy to see with only five items, but wouldn’t be quite so easy if I had a thousand files to import and three had errors.

If I name the list with my original file names, it’s easier to identify the problem file:

names(all_results) <- my_data_files
str(all_results, max.level = 1) 
List of 5
 $ data_files/file1.csv:'data.frame':	3 obs. of  3 variables:
 $ data_files/file2.csv:'data.frame':	3 obs. of  3 variables:
 $ data_files/file3.csv:'data.frame':	3 obs. of  3 variables:
 $ data_files/file4.csv: chr "Error in file"
 $ data_files/file5.csv:'data.frame':	3 obs. of  3 variables:

I can even save the results of str() to a text file for further examination.

str(all_results, max.level = 1) %>%
  capture.output(file = "results.txt")

Now that I know file4.csv is the problem, I can import just that one and confirm what the issue is.

x4 <- rio::import(my_data_files[4])
str(x4)
'data.frame':	3 obs. of  3 variables:
 $ Category     : chr  "A" "B" "C"
 $ Value        : num  3738 723 5494
 $ MonthStarting: chr  "9/1/20" "9/1/20" "9/1/20"

Ah, Value is indeed coming in as numeric. I’ll revise my process_file() function to account for the possibility that Value isn’t a character string with an ifelse() check:

process_file2 <- function(myfile) {
  rio::import(myfile) %>%
    dplyr::transmute(
      Category = as.character(Category),
      Month = lubridate::mdy(MonthStarting),
      Total = ifelse(is.character(Value), readr::parse_number(Value), Value)
    )
}

Now if I use purrr’s map_df() with my new process_file2() function, it should work and give me a single data frame.

all_results2 <- map_df(my_data_files, process_file2)
str(all_results2)
'data.frame':	15 obs. of  3 variables:
 $ Category: chr  "A" "B" "C" "A" ...
 $ Month   : Date, format: "2020-12-01" "2020-12-01" "2020-12-01" ...
 $ Total   : num  4256 4256 4256 3156 3156 ...

That’s just the data and format I wanted, thanks to wrapping my original function in possibly() to create a new, error-handling function.

For more R tips, head to the “Do More With R” page on InfoWorld or check out the “Do More With R” YouTube playlist.

Easy error handling in R with purrr’s possibly

Easy error handling in R with purrr’s possibly

See how the purrr package’s possibly() function helps you flag errors and keep going when applying a function over multiple objects in R.

[ Get R data.table and tidyverse code for dozens of data tasks by downloading InfoWorld’s ultimate R data.table cheat sheet ]

Recommend

Predictions for cloud computing in 2021

Microsoft seeks to build trust in third-party .NET libraries

GitHub.com dumps non-essential cookies

7 things to know before becoming a developer manager

Go 1.16 moves to beta with library, runtime enhancements

2 mistakes that will kill your multicloud project

Setting up a PWA toolchain with PWABuilder

Blazor WebAssembly 3.2 正式发布

TechEmpower Web 框架性能第19轮测试结果正式发布，ASP.NET Core在主流框架中拔得头筹

.NET开发者省份分布排名

About Joyk