4

Cleaning Up Ruby Strings 13 Times Faster

 3 years ago
source link: https://blog.appsignal.com/2019/08/20/clean-up-strings.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Cleaning Up Ruby Strings 13 Times Faster

Maud de Vries on Aug 20, 2019

“I absolutely love AppSignal.”


Discover AppSignal

When translating your thoughts into code, most likely, you use the methods that you are most familiar with. These are methods that are top of mind and come automatically to you: you see a string that needs cleaning up and your fingers type the methods that will get the result.

Often, the methods that you type automatically are the most generic Ruby methods, because they are the ones that we read and write more than others, e.g. #gsub is a generic method to substitute characters in strings. But, Ruby has so much more to offer, with more specialized convenience methods for standard operations.

I love Ruby’s rich idiom mostly because it makes code more elegant and easier to read. If we want to benefit from this richness, we need to spend time refactoring even the simplest parts of our code—for instance, cleaning up a string—and it takes a bit of an effort to expand our vocabulary. The question is: is the extra effort worth it?

Four Ways to Remove Spaces

Here’s a string that represents a credit card number: “055 444 285”. To work with it, we want to remove the spaces. #gsub can do this; with #gsub you can substitute anything with everything. But there are other options.

string = "055 444 285"
string.gsub(/ /, '')
string.gsub(' ', '')
string.tr(' ', '')
string.delete(' ')

# => "055444285"

It’s the expressiveness that I like most about the convenience methods. The last one is a good example of this: it doesn’t get more obvious than “delete spaces”. Thinking about trade-offs between options, readability is my first priority, unless of course, it causes performance problems. So, let’s see how much pain my favorite solution, #delete really causes.

I benchmarked the examples above. Which one of these methods do you think is the fastest?

Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('gsub')           { string.gsub(/ /, '') }
  x.report('gsub, no regex') { string.gsub(' ', '') }
  x.report('tr')             { string.tr(' ','') }
  x.report('delete')         { string.delete(' ') }

  x.compare!
end
Guess the order from most to least performant. Open the toggle to see the result

I wasn’t surprised about the order, but the differences in speed still surprised me. #gsub is not only slower, but it also requires an extra effort for the reader to ‘decode’ the arguments. Let’s see how this comparison works out when cleaning up more than just spaces.

Pick Your Numbers

Take the following phone number: '(408) 974-2414'. Let’s say we only need the numericals => 4089742414. I added a #scan as well because I like that it expresses more clearly that we aim for some particular things, instead of trying to remove all the things we don’t want.

Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report ('gsub')           { string.gsub(/[^0-9] /, '') }
  x.report('tr')              { string.tr("^0-9", "") }
  x.report('delete_chars')    { string.delete("^0-9") }
  x.report('scan')            { string.scan(/[0-9]/).join }
  x.compare!
end
Again, guess the order, then open the toggle to see the answer

Using a regex slows things down, that’s not surprising. And the intention revealing expressiveness of #scan costs us dearly. But looking at how Ruby’s specialized methods handle cleaning up, gave me a taste for more.

On the Money

Let’s try some ways of removing the substring "€ " from the string "€ 300". Some of the following solutions specify the exact substring "€ ", some will simply remove all currency symbols or all non-numerical characters.

Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('delete specific chars')  { string.delete("€ ") }
  x.report('delete non-numericals')  { string.delete("^0-9") }
  x.report('delete prefix')          { string.delete_prefix("€ ") }
  x.report('delete prefix, strip')   { string.delete_prefix("€").strip }

  x.report('gsub')                   { string.gsub(/€ /, '') }
  x.report('gsub-non-nums')          { string.gsub(/[^0-9]/, '') }
  x.report('tr')                     { string.tr("€ ", "") }
  x.report('slice array')            { string.chars.slice(2..-1).join }
  x.report('split')                  { string.split.last }
  x.report('scan nums')              { string.scan(/\d/).join }
  x.compare!
end

You may expect, and correctly so, that the winner is one of the #deletes. But which one of the #delete variants do you expect to be the fastest? Plus: one of the other methods is faster than some of the #deletes. Which one?

Guess and then open.

I was surprised that even slicing an array is faster than #gsub and I’m always pleased to see how fast #split is. And note that deleting all non-numericals is faster than deleting a specific substring.

Follow the Money

Let’s remove the currency after the number. (I skipped the slower #gsub variants.)

Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('gsub')                        { string.gsub(/ USD/, '')
  x.report('tr')                          { string.tr(" USD", "") }
  x.report('delete_chars')                { string.delete("^0-9")
  x.report('delete_suffix')               { string.delete_suffix(" USD") }
  x.report('to_i.to_s')                   { string.to_i.to_s }
  x.report("split")                       { string.split.first }
  x.compare!
end

There’s a draw between winners. Which 2 do you expect to compete for being the fastest?

And: guess how much slower #gsub is here.

There isn’t always a specialized method that will suit your needs. You can’t use #to_i if you need to keep a leading “0”. And #delete_suffix leans heavily on the assumption that the currency is US Dollars.

The specialized methods are like precision tools—suitable for a specific task in a specific context. So there will always be cases where #gsub is exactly what we need. It is versatile, and it’s always top of mind. But it can be a bit harder to process and is often slower, even slower than I expected. To me, Ruby’s richness is also one of the reasons that makes it so much fun to work with. The speed wins are a nice bonus.

Guest author Maud de Vries is a freelance Ruby on Rails developer, a Coach for (solo) entrepreneurs and she used to be an editor as well. The writer inside sometimes escapes.

We’re hiring: ✍️ (Remote) Editor in Chief @AppSignal ✏️


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK