Pointers Are Complicated, or: What's in a Byte?

This summer, I am again working on Rust full-time , and again I will work (amongst other things) on a “memory model” for Rust/MIR. However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that “pointers are simple: they are just integers”. Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers.

I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just what is the data that is stored in memory? It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte? Again, it turns out “it’s just an 8-bit integer” does not actually work as the answer.

I hope that by the end of this post, you will agree with me on both of these statements. :)

Pointers Are Complicated

What is the problem with “pointers are just integers”? Let us consider the following example:

(I am using C++ code here mostly because writing unsafe code is easier in C++, and unsafe code is where these problems really show up.)

int test() {
    auto x = new int[8];
    auto y = new int[8];
    y[0] = 42;
    int i = /* some side-effect-free computation */;
    auto x_ptr = &x[i];
    *x_ptr = 23;
    return y[0];
}

It would be beneficial to be able to optimize the final read of y[0] to just return 42 . The justification for this optimization is that writing to x_ptr , which points into x , cannot change y .

However, given how low-level a language C++ is, we can actually break this assumption by setting i to y-x . Since &x[i] is the same as x+i , this means we are actually writing 23 to &y[0] .

Of course, that does not stop C++ compilers from doing these optimizations. To allow this, the standard declares our code to haveundefined behavior.

First of all, it is not allowed to perform pointer arithmetic (like &x[i] does) that goes beyond either end of the array it started in . Our program violates this rule: x[i] is outside of x , so this is undefined behavior. To be clear: Just the computation of x_ptr is already UB, we don’t even get to the part where we want to use this pointer!

But we are not done yet: This rule has a special exception that we can exploit to our advantage. If the arithmetic ends up computing a pointer just past the end of an allocation, that computation is fine. (This exception is necessary to permit computing vec.end() for the usual kind of C++98 iterator loop.)

So let us change this example a bit:

int test() {
    auto x = new int[8];
    auto y = new int[8];
    y[0] = 42;
    auto x_ptr = &x[8]; // one past the end
    if (x_ptr == &y[0])
      *x_ptr = 23;
    return y[0];
}

Now, imagine that x and y have been allocated right next to each other , with y having the higher address. Then x_ptr actually points right at the beginning of y ! The conditional is true and the write happens. Still, there is no UB due to out-of-bounds pointer arithmetic.

This seems to break the optimization. However, the C++ standard has another trick up its sleeve to help compiler writers: It doesn’t actually allow us to use our x_ptr above. According to what the standard says about addition on pointers , x_ptr points “one past the last element of the array object”. It does not point at an actual element of another object even if they have the same address . (At least, that is the common interpretation of the standard based on which LLVM optimizes this code .)

The key point here is that just because x_ptr and &y[0] point to the same address , that does not make them the same pointer , i.e., they cannot be used interchangeably: &y[0] points to the first element of y ; x_ptr points past the end of x . If we replace *x_ptr = 23 by *&y[0] = 0 , we change the meaning of the program, even though the two pointers have been tested for equality.

This is worth repeating:

Just because two pointers point to the same address, does not mean they are equal and can be used interchangeably.

If this sounds subtle, it is. In fact, this still causes miscompilations in both LLVM and GCC .

Also notice that this one-past-the-end rule is not the only part of C/C++ where this effect can be witnessed. Another example is the restrict keyword in C, which can be used to express that pointers do not alias:

int foo(int *restrict x, int *restrict y) {
    *x = 42;
    if (x == y) {
        *y = 23;
    }
    return *x;
}

int test() {
    int x;
    return foo(&x, &x);
}

Calling test() triggers UB because the two accesses in foo must not alias. Replacing *y by *x in foo changes the meaning of the program such that it no longer has UB. So, again, even though x and y have the same address, they cannot be used interchangeably.

Pointers are definitely not integers.

A Simple Pointer Model

So, what is a pointer? I don’t know the full answer to this. In fact, this is an open area of research.

Here’s a simple proposal (in fact, this is the model of pointers used in CompCert and myRustBelt work, and it is also how miri implements pointers): A pointer is a pair of some kind of ID uniquely identifying the allocation , and an offset into the allocation. Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation. Subtracting a pointer from another is only allowed when both point to the same allocation (matching C++ ).

It turns out (and miri shows) that this model can get us very far. We always remember which allocation a pointer points to, so we can differentiate a pointer “one past the end” of one allocation from a pointer to the beginning of another allocation. That’s how miri can detect that our second example (with &x[8] ) is UB.

In this model, pointers are not integers, but they are at least simple. However, this simple model starts to fall apart once you consider pointer-integer casts. In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable whose value is a pointer (i.e., an allocation-offset pair). Multiplying that integer by 2 leads to an error, because it is entirely unclear what it means to multiply such a pair by 2. A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here. To my knowledge, no satisfying solution exists, but we are getting closer . This is why pointers are not simple, either.

From Pointers to Bytes

I hope I made a convincing argument that integers are not the only data one has to consider when formally specifying low-level languages such as C++ or (the unsafe parts of) Rust. However, this means that a simple operation like loading a byte from memory cannot just return a u8 . What if that byte is part of a pointer? When a pointer is a pair of allocation and offset, what is its first byte? We cannot represent this as a u8 . Instead, we will remember both the pointer, and which byte of the pointer we got. If we were to implement our memory model in Rust, this might look as follows:

enum ByteV1 {
  Bits(u8),
  PtrFragment(Pointer, u8),
}

For example, a PtrFragment(ptr, 0) represents the first byte of ptr . This way, we can “take apart” a pointer into the individual bytes that represent this pointer in memory, and assemble it back together. On a 32bit architecture, the full value representing ptr consists of the following 4 bytes:

[PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)]

Such a representation supports performing all byte-level “data moving” operations on pointers, like implementing memcpy by copying one byte at a time. Arithmetic or bit-level operations are not fully supported; as already mentioned above, that requires a more sophisticated pointer representation.

Uninitialized Memory

However, we are not done yet with our definition of a “byte”. To fully describe program behavior, we need to take one more possibility into account: A byte in memory could be uninitialized . The final definition for a byte (assuming we have a type Pointer for pointers) thus looks as follows:

enum Byte {
  Bits(u8),
  PtrFragment(Pointer, u8),
  Uninit,
}

Uninit is the value we use for all bytes that have been allocated, but not yet written to. Reading uninitialized memory is fine, but actually doing anything with those bytes (like, using them in integer arithmetic) is undefined behavior.

This is very similar to the rules LLVM has for its special value called poison . Note that LLVM also has a value called undef , which it uses for uninitialized memory and which works somewhat differently – however, compiling our Uninit to undef is actually correct ( undef is in some sense “weaker”), and there are proposals to remove undef from LLVM and use poison instead .

You may wonder why we have a special Uninit value at all. Couldn’t we just pick some arbitrary b: u8 for each newly allocated byte, and then use Bits(b) as the initial value? That would indeed also be an option. However, first of all, pretty much all compilers have converged to having a sentinel value for uninitialized memory. Not doing that would not only pose trouble when compiling through LLVM, it would also require reevaluating many optimizations to make sure they work correctly with this changed model. The key point is that it is always safe, during compilation, to replace Uninit by any value: Any operation that actually observes this value is UB anyway.

For example, the following C code becomes easier to optimize with Uninit :

int test() {
    int x;
    if (condA()) x = 1;
    // Lots of hard to analyze code that will definitely return when condA()
    // does NOT hold, but will not change x.
    use(x); // want to optimize x to 1.
}

With Uninit , we can easily argue that x is either Uninit or 1 , and since replacing Uninit by 1 is okay, the optimization is easily justified. Without Uninit , however, x is either “some arbitrary bit pattern” or 1 , and doing the same optimization becomes much harder to justify.

Finally, Uninit is also a better choice for interpreters like miri. Such interpreters have a hard time dealing with operations of the form “just choose any of these values” (i.e., non-deterministic operations), because if they want to fully explore all possible program executions, that means they have to try every possible value. Using Uninit instead of an arbitrary bit pattern means miri can, in a single execution, reliably tell you if your programs uses uninitialized values incorrectly.

Conclusion

We have seen that pointers can be different even when they point to the same address, and that a byte is more than just a number in 0..256 .With this, I think we are ready to look at a first draft of my “2018 memory model” (working title ;) – in the next post. :)

Thanks to @rkruppe and @nagisa for help in finding arguments for why Uninit is needed. If you have any questions, feel free to ask in the forums !