4

Falsehoods programmers believe about undefined behavior

 1 year ago
source link: https://predr.ag/blog/falsehoods-programmers-believe-about-undefined-behavior/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Falsehoods programmers believe about undefined behavior

November 27, 2022 compilers intro

Undefined behavior (UB) is a tricky concept in programming languages and compilers. Over the many years I've been an industry mentor for MIT's 6.172 Performance Engineering course,

I've heard many misconceptions about what the compiler guarantees in the presence of UB. This is unfortunate but not surprising!

For a primer on undefined behavior and why we can't just "define all the behaviors," I highly recommend Chandler Carruth's talk "Garbage In, Garbage Out: Arguing about Undefined Behavior with Nasal Demons."

You might also be familiar with my Compiler Adventures blog series on how compiler optimizations work. An upcoming episode is about implementing optimizations that take advantage of undefined behavior like dividing by zero, where we'll see UB "from the other side."

Undefined behavior != implementation-defined behavior

Undefined behavior is not the same as implementation-defined behavior.

Program behaviors fall into three buckets, not two:

  • Specification-defined: The programming language itself defines what happens. This is the vast majority of every program.
  • Implementation-defined: The exact behavior is defined by your compiler, operating system, or hardware. For example: how many bits exactly are in a char or int in C++.

  • Undefined behavior: Anything is allowed to happen, and you might no longer have a computer left after it all happens. No outcome is a bug if caused by UB. For example: signed integer overflow in C, or using unsafe to create two &mut references to the same data in Rust.

Here's the list of guarantees compilers make about the outcomes of undefined behavior:

That's the whole list. No, I didn't forget any items. Yes, seriously.

It is possible to analyze how UB affects a specific program when compiled by a specific compiler or executed on a specific target platform. For example, there exist exotic compilers, operating systems, and hardware that offer additional guarantees

relative to most common platforms, which only guarantee OS-level process isolation. We aren't talking about those in this post.

The mindset for this post is this: "If my program contains UB, and the compiler produced a binary that does X, is that a compiler bug?"

It's not a compiler bug.

All of the following assumptions are wrong

Falsehoods about when UB "happens"

  1. Undefined behavior only "happens" at high optimization levels like -O2 or -O3.
  2. If I turn off optimizations with a flag like -O0, then there's no UB.
  3. If I include debug symbols in the build, there's no UB.
  4. If I run the program under a debugger, there's no UB.
  5. Okay there's still UB with all of these, but my code will "do the right thing" regardless.
  6. It will either "do the right thing" or crash with a Segmentation Fault (SIGSEGV signal).
  7. It will either "do the right thing" or crash somehow.
  8. It will either "do the right thing" or crash or infinite-loop or deadlock.
  9. At least it won't run some unrelated code from elsewhere in the program.
  10. At least it won't run any unreachable code the program might contain.

Falsehoods around "if the UB isn't executed"

  1. If a line with UB previously "did the right thing," then it will continue to "do the right thing" the next time we run the program.
  2. The UB line will at least continue to "do the right thing" while the program is still running.
  3. But if the line with UB isn't executed, then the program will work normally as if the UB wasn't there.
  4. Okay, but if the line with UB is unreachable (dead) code, then it's as if the UB wasn't there.

  5. If the line with UB is unreachable code, then the program won't crash because of the UB.
  6. If the line with UB is unreachable code, then the program will at least stop running somehow and at some point.

Falsehoods about the possible outcomes of UB

  1. At least it won't corrupt the memory of the program.
  2. At least it won't corrupt the memory of the program other than where the UB-affected data was located.
  3. At least it won't corrupt the heap.
  4. At least it won't corrupt the stack.
  5. At least it won't corrupt the current stack frame. (My name for this is the "local variables are safely in registers" fallacy.)
  6. At least it won't corrupt the stack pointer.
  7. At least it won't corrupt the CPU flags register / any other CPU state.
  8. At least it won't corrupt the executable memory of the program.

  9. At least it won't corrupt streams like stdout or stderr.
  10. At least it won't overwrite any files the program already had open.
  11. At least it won't open new files and overwrite them.
  12. At least it won't completely wipe the drive.
  13. At least it won't damage or destroy any hardware components.

  14. At least it won't start playing Doom if the program didn't already have the Doom source code in it.

Falsehoods like "but it worked fine before"

  1. If a UB-containing program "worked fine" previously, recompiling the program without any code changes will still produce a binary that "works fine."
  2. Recompiling without code changes and with the same compiler and flags will produce a binary that still "works fine."
  3. Recompiling as above + on the same machine will produce a binary that still "works fine."
  4. Recompiling as above + if you haven't rebooted the machine since the last compilation will produce a binary that still "works fine."
  5. Recompiling as above + with the same environment variables will produce a binary that still "works fine."
  6. Recompiling as above + at the same time of day and day of week as before, during a Lunar eclipse, having first sacrificed a fresh stick of RAM to the binary gods, will produce a binary that still "works fine."

Falsehoods about self-consistent behavior of UB

  1. Multiple runs of the program compiled as above and with the same inputs will produce the same behavior in each run.
  2. Those multiple runs will produce the same behavior if the program, ignoring the UB, is deterministic.
  3. But they will if the program is also single-threaded.
  4. But they will if the program also doesn't read any external data (files, network, environment variables, etc.).

False expectations around UB, in general

  1. Any kind of reasonable or unreasonable behavior happening with any consistency or any guarantee of any sort.

The moment your program contains UB, all bets are off. Even if it's just one little UB. Even if it's never executed. Even if you don't know it's there at all. Probably even if you wrote the language spec and compiler yourself.

This is not to say that all outcomes in the list above are equally likely, or even plausible.

But they are all allowed, valid, spec-compliant behavior.

It's perfectly possible that your program has UB, and it's been running fine for years without issues. That's great! I'm happy to hear it! I'm not even saying you need to go back and rewrite it to remove the UB. But as you make your decisions, it's good to know the full picture of what the compiler will or won't guarantee for your program.

Honorable mention for one special assumption

"If the program compiles without errors then it doesn't have UB."

This is 100% false in C and C++.

It's also false as stated in Rust, but with one tweak it's almost true. If your Rust program never uses unsafe, then it should be free of UB. In other words: causing UB without unsafe is considered a bug in the Rust compiler. These are rare and you are quite unlikely to run into them.

When Rust unsafe is used, then all bets are off just as in C or C++. But the assumption that "Safe Rust programs that compile are free of UB" is mostly true.

This is not an easy feat. We owe a debt of gratitude to the folks who cumulatively put engineer-centuries into making it so. It's Thanksgiving, and I thank you!

Thanks to arriven, Conrad Ludgate, sharnoff, Brian Graham, and a few folks who preferred to remain unnamed, for feedback on drafts of this post. Any mistakes are mine alone.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK