Ruby's external malloc problem

In this post I would like to cover a severe, extremely hard to debug vector for memory bloat in Ruby which can be triggered by the PG gem.

I have blogged a bit about theRuby GC previously and covered some basics about malloc triggering GC runs. Over the years much in that blog post has been addressed in Ruby including dynamically growing malloc limits that mean we very rarely would need to amend malloc related GC vars.

As an aside, the only GC var Discourse still overrides is RUBY_GLOBAL_METHOD_CACHE_SIZE for reasons that are specified in the Shopify blog post by Scott Francis .

The GC in Ruby can be triggered by 2 different types of conditions.

We are out of space in our managed heaps.
We detected that data associated with Ruby objects via malloc calls has grown beyond a certain threshold.

In this blog post I am covering (2) and demonstrating what happens when Ruby is not aware of malloc calls.

Why malloc calls can trigger a GC?

When reading through GC.stat we may be a bit surprised to see the amount of malloc related accounting:

malloc_increase_bytes
malloc_increase_bytes_limit
oldmalloc_increase_bytes
oldmalloc_increase_bytes_limit

We keep track of the amount of memory allocated using malloc, if it hits the malloc_increase_bytes_limit we will trigger a minor GC.

When we promote an object to the old generation we also try to estimate how much malloc increased since the last major GC. This way when we promote large objects from a young heap to an old heap we have a chance to GC as soon oldmalloc_increase_bytes_limit is hit.

The oldmalloc_increase_bytes_limit and malloc_increase_bytes_limit dynamically size themselves growing as we hit GCs due to malloc limits.

Seeing this in action

Having this in place allows us to run code like this without bloating memory:

def count_malloc(desc)
  start = GC.stat[:malloc_increase_bytes]
  yield
  delta = GC.stat[:malloc_increase_bytes] - start
  puts "#{desc} allocated #{delta} bytes"
end

def process_rss
  puts 'RSS is: ' + `ps -o rss -p #{$$}`.chomp.split("\n").last
end

def malloc_limits
  s = GC.stat
  puts "malloc limit #{s[:malloc_increase_bytes_limit]}, old object malloc limit #{s[:oldmalloc_increase_bytes_limit]}"
end

puts "start RSS/limits"
process_rss
malloc_limits

count_malloc("100,000 byte string") do
  "x" * 100_000
end

x = []
10_000.times do |i|
  x[i%10]  = "x" * 100_000
end

puts "RSS/limits after allocating 10k 100,000 byte string"
malloc_limits
process_rss

Result is:

start RSS/limits
RSS is: 11692

malloc limit 16777216, old object malloc limit 16777216
100,000 byte string allocated 103296 bytes

RSS/limits after allocating 10k 100,000 byte string
malloc limit 32883343, old object malloc limit 78406160

RSS is: 42316

The key figures to watch here is.

malloc_increase_bytes_limit starts at 16MB and moves up to 32MB
oldmalloc_increase_bytes_limit starts at 16MB and moves up to 78MB
RSS moves up from 11MB to 42MB

To recap this is a fairly well behaved non bloated process, despite allocating pretty gigantic objects (strings that have 100,000 bytes in them) and retaining a handful (10).

This is what we want and it gets a stamp of approval!

Zvayeej.jpg!web

Where malloc accounting falls over!

Ruby does not “monkey patch” the libc malloc function to figure out how much memory got allocated.

It requires c extension authors to be very careful about how they allocate memory, in particular extension authors are expected to use all sorts of helper macros and functions when allocating and converting memory that will be tied to Ruby objects.

Unfortunately, some gems that package up c libraries do not use the helpers in some cases. This is often nobody’s explicit fault, but a culmination of a very sad series of coincidences.

I have been looking at improving Active Record performance recently and was very surprised to see this pattern everywhere:

Every time we are running a piece of SQL and getting a perfectly good PG::Result back we convert it to an array of arrays that is 100% materialized and manually discard the PG::Result object. Why is this?

Turns out, this is there for a very good reason ™

If we adapt our sample to use the PG gem to allocate the strings we see this:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pg'
end

require 'pg'

conn = PG.connect(dbname: 'test_db')
sql = "select repeat('x', $1)"

# simulate a Rails app by long term retaining 400_000 objects

puts "start RSS/limits"
process_rss
malloc_limits

count_malloc("100,000 bytes PG") do
  conn.exec(sql, [100_000])
end

x = []
10_000.times do |i|
  r = x[i%10] = conn.exec(sql, [100_000])
  r.clear
end

puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)"
malloc_limits
process_rss

10_000.times do |i|
  x[i%10] = conn.exec(sql, [100_000])
end

puts "RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)"
malloc_limits
process_rss

We get this:

start RSS/limits
RSS is: 27392
malloc limit 16777216, old object malloc limit 16777216
100,000 bytes PG allocated 960 bytes
RSS/limits after allocating 10k 100,000 byte strings in libpq (and clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 27636
RSS/limits after allocating 10k 100,000 byte strings in libpq (and NOT clearing)
malloc limit 16777216, old object malloc limit 16777216
RSS is: 295500

nEZnYrJ.png!web our RSS just jumped to 295MB when we forgot to run #clear on the results PG gave us!!!

Further more we can make the problem WAY worse if we simulate a Rails App by growing our Ruby heaps first with:

400_000.times do
  $long_term << +""
end

If we run that code first we reach 1GB of RSS after “forgetting” to clear our PG::Result object!

3iURviZ.png!web We can see PG allocated 100,000 bytes but Ruby was only aware of 960.

Aaron Patterson has been aware of this issue for many years, in fact he has attempted to patch libpq the library that powers the PG gem so it can handle this exact case gracefully.

See: PostgreSQL: Custom allocators in libpq

So where does this leave us?

At Discourse we notice occasional bloat in our Sidekiq process. This is despite being extremely careful to run a specific version of jemalloc that tames memory quite a bit.

Now that I am aware of this vector I do have my suspicion that some “Raw SQL” helpers we have lurking in Discourse can cause this issue. In particular we have places that return results directly in a PG::Result object. In Sidekiq, under heavy concurrency with a very large heap these objects can sneak into the old generation and be retained for way too long leading to process bloat.

This thorn also makes it very hard for us to tame Active Record memory usage cause we are stuck relying on copying entire result sets so we can stay safe, which is a very high priority for Rails.

That said, I have not given up quite yet and see quite a few paths forward. (none of which conflict):

It would be nice to drive Aaron’s patch home, if libpq provided better hooks for memory allocation we could nip this problem at the bud.

Advantages

This would resolve the problem at the source

Disadvantages

Even if this is accepted today it will be many years till people can lean on this, requires a new version of libpq many people run 5 year old versions of it.

It would be nice to have an API in libpq that allows us to interrogate how many bytes are allocated to a result it returns.

Advantages

This would resolve the problem at the source.

Disadvantages

Same as above, will take many years till people can use it.
More complicated to adapt to Ruby and may require a new API to feed the number in to the GC statistics

PG gem can add a Lazy results object.

In this case we simply extend the PG gem API to return a copy of the results provided by libpq that allocates significantly less Ruby objects. Then once we have the copy we can clear the result we get from libpq.

For example:

r = pg.exec('select * from table')
rs = r.result_set
r.clear

# at this point only 2 RVALUEs are allocated. 
# the new ResultSet object has internal c level storage
# pointing at an array of strings, and an API for access where it defer creates
# objects

row = rs[1]

### ResultSetRow is allocated, it also only allocates 1 RVALUE

row["abc"] # allocates a new RVALUE or returns a cached internal instance 
row[1] # same

rs.get(1,100) # same as above

Advantages

This drops in to ActiveRecord and other ORMs as the best practice for grabbing data if #clear is not guaranteed
Reasonably efficient, only allocates a very minimal number of Ruby objects
We can start using this very soon

Disadvantages

We are forced to make memory copies of results returned via PG, this has a non zero cost (I suspect it is not too high though compared to 1000s of Ruby objects that need to be garbage collected with #values calls)

Build tooling to detect this problem in production apps! It would be amazing if when we saw a Ruby app that is bloated in memory we could run a simple diagnostic on it to figure out where the bloat is coming from.

Is the bloat there due to glibc arenas ?
Is the bloat there cause Ruby is not aware of a bunch of allocated memory?
Is the bloat there due to a simple managed leak, eg: an ever growing array?

It is a hard problem to solve though. jemalloc does provide a lot of internal diagnostics, so we could look at the delta between what jemalloc has allocated and what Ruby knows about!

Advantages

Would increase visibility of this problem and the family of related problems and allow us to alert various gem authors if they are impacted by it.

Disadvantages

Hard to build and may require a custom startup.

What we are doing?

I have invested many hours investigating these issues. Discourse is actively investing in improving the memory story in Ruby. Together with Shopify and Appfolio we are sponsoring Eric Wong to experiment and improve Ruby for the next few months.

Discourse are also looking to throw more dollars behind a project to heavily improve Active Record for the 6.0 release which I plan to blog about soon. We also plan to extract, improve, formalize and share out built in raw SQL helpers.

I hope you found this helpful and as always, enjoy!

Why malloc calls can trigger a GC?

Seeing this in action

Where malloc accounting falls over!

So where does this leave us?

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

What we are doing?

Recommend

Yett - A small library to control the execution of (third party) scripts

Monte Carlo Integration Explanation in 1D

PHP Parser Written in Go

Bistro Streams – new approach to stream analytics (like Kafka Streams)

Node-android – Run Node.js on Android

More Common Voices – Mozilla Open Innovation – Medium

GitHub - emacs-lsp/lsp-python

GitHub - manateelazycat/lazycat-emacs: My emacs newest config/extensions

DeepMind 申请的首批 AI 专利公开

Continuous Delivery with Jenkins and GitHub – Twilio Cloud Communicati...

About Joyk

Ruby&#39;s external malloc problem

Why malloc calls can trigger a GC?

Seeing this in action

Where malloc accounting falls over!

So where does this leave us?

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

Advantages

Disadvantages

What we are doing?

Recommend

About Joyk

Ruby's external malloc problem