Fixing a Tough Memory Leak in Python

A few of our power users reported that long-running backtests would sometimes run out of memory. These power-users are the people who often find new trading strategies and so we wanted to work with them to improve the performance of our backtesting tools. Over the past couple of weeks, our senior engineer found that the problem wasn’t in our code, but in one of the popular Python libraries that we use.

We found the problem in numpy and numba. The leak was ultimately caused by how we were using these libraries. We made the correction and as you can see from the following chart, it really improved the memory utilization for our trade simulator.

The following is the write-up by our senior engineer so that others can learn from our engineering efforts.

Always iterate to completion with Numpy

Finding Python memory leaks using LD_PRELOAD and libunwind

You have a Python process that is consuming memory for an unintended and unknown reason (“leaking”) and you want to investigate. The process may import extension modules that have memory management issues of their own, outside the Python interpreter. You have access to a Linux environment for debugging.

We will:

identify the C call stack where memory is allocated, where this memory is not freed in a timely manner
LD_PRELOAD a custom library to raise a signal when this stack is encountered
handle the signal in python to dump a python call stack

Identify problematic C call stacks

Since we will be using libunwind in step 2, and to avoid some writing of code, we will use memleax — based on libunwind– to get started.

This utility attaches to a running process and produces a report of C call stacks where the allocation is not matched by a deallocation for a given period of time. If the analysis is still running and a matching deallocation happens eventually (even after the configured interval), the associated call stack is pruned from the interesting results.

An example result:

// ...
#define UNW_LOCAL_ONLY
#include  	
// ...
std::set s; // keep track of allocations that have not been freed
std::mutex mut; // protect s

extern "C" {

// LD_PRELOAD will cause the process to call this instead of malloc(3)
void *malloc(size_t size)
{
    // on first call, get a function pointer for malloc(3)
    static void *(*real_malloc)(size_t) = NULL;
    if(!real_malloc)
        real_malloc = (void *(*)(size_t))dlsym(RTLD_NEXT, "malloc");
    assert(real_malloc);

    // call malloc(3)
    void *retval = real_malloc(size);

    static __thread int dont_recurse = 0; // init to zero on first call
    if(dont_recurse)
        return retval;
    dont_recurse=1; // if anything below calls malloc, skip analysis

    // on first call, create cache for symbol at each address
    static thread_local std::map
<unw_word_t std::string="">
  *m = NULL;
    if(!m)
        m = new std::map
 <unw_word_t std::string="">
  ();

    // collect stack symbols, updating cache as needed 
    unw_cursor_t cursor;
    unw_context_t context;
    unw_getcontext(&context);
    unw_init_local(&cursor, &context);
    std::vector trace;
    while (unw_step(&cursor) > 0)
    {
        unw_word_t pc;
        unw_get_reg(&cursor, UNW_REG_IP, &pc);
        if (pc == 0)
            break;

        std::string &str = (*m)[pc];
        if(str=="") // build cache
        {
            unw_word_t offset;
            // started as C, feel free to use std::string/ostringstream
            char sym[1024], line[1024];
            sprintf(line,"0x%lx:", pc);
            if (!unw_get_proc_name(&cursor, sym, sizeof(sym), &offset))
                sprintf(&line[strlen(line)], " (%s+0x%lx)\n",
                        sym, offset);
            else
                sprintf(&line[strlen(line)], " -- no symbol\n");
            str = line;
        }
        trace.push_back(&str);
    }

    // look for our particular stack context
    // - log it
    // - save for free()
    // - raise signal
    if(trace.size() >=4)
    if(strstr(trace[0]->c_str(),"npy_alloc_cache"))
    if(strstr(trace[3]->c_str(),"array_boolean_subscript"))
    {
        fprintf(stderr,"malloc @ %p for %lu\n", retval, size);
        std::lock_guard g(mut);
        s.insert(retval);
        raise(SIGUSR1);
    }

    dont_recurse=0;
    return retval;
}

// report matching free() calls
void free(void *ptr)
{
    static void (*real_free)(void *) = NULL;
    if(!real_free)
        real_free = (void (*)(void *))dlsym(RTLD_NEXT, "free");
    assert(real_free);
    real_free(ptr);

    static __thread int dont_recurse = 0;
    if(dont_recurse)
        return;

    dont_recurse=1;
    mut.lock(); // in case lock_guard would call malloc/free?
    if(s.find(ptr) != s.end())
    {
        fprintf(stderr,"free @ %p\n", ptr);
        s.erase(ptr); // b/c addr will get reused
    }
    mut.unlock(); // before the following line, unlike with lock_guard
    dont_recurse=0;
}

} // extern C






 </unw_word_t>
</unw_word_t>

Compile with something like:

g++ -std=c++11 -g -Wall -fPIC -shared -o stack_signal.so \
stack_signal.cpp -ldl -lpthread -lunwind

Run the python:

LD_PRELOAD=./stack_signal.so python my.py # or with -m pdb

You should see output like the following:

malloc @ 0x55c2db93e360 for 36020

malloc @ 0x55c2d9691b30 for 84100

free @ 0x55c2db93e360

free @ 0x55c2d9691b30

In python, handle the signal by printing the (python!) stack

import signal
import traceback
def debug_signal_handler(signal, frame): traceback.print_stack(frame)
signal.signal(signal.SIGUSR1, debug_signal_handler)

Now your “malloc @” log lines should be followed by stacks in your Python code. In this case it identified the following as the caller of “array_boolean_subscript” in cases where “malloc @” log lines don’t have corresponding “free @” log lines:

... = chunk[to_keep] # a section of a numpy array identified by the boolean array

…now that we know where the problem starts

It turned out that the numpy array resulting from the above operation was being passed to a numba generator compiled in “nopython” mode. This generator was not being iterated to completion, which caused the leak. As per the numba documentation, generators must be compiled with forceobj=True in order for the generator finalizer to handle this case. In our case it made sense to add code to ensure that the generator is always iterated to completion, and we retained the “nopython” compilation.

Hopefully, you now have an additional tool at your disposal when confronted with perplexing Python memory leaks! While this methodology applied to this problem could not directly point the finger at numba, it certainly helped to know the line of python where the problem originated.

Fixing a Tough Memory Leak in Python