LLVM internals, part 3

ENOSUCHBLOG

Programming, philosophy, pedaling.

LLVM internals, part 3: from bitcode to IR

Sep 14, 2021

Tags: llvm, rust

Series: llvm-internals

Preword

The last two posts of this series have focused on the nitty-gritty of LLVM’s bitstream format:

We performed a conceptual overview of the bitcode file structure and the bitstream container format that it’s based on, culminating in a bitcursor abstraction for Rust.
We then applied those concepts to the task of actually parsing arbitrary bitstreams, culminating in a crate for turning a bitstream into a sequence of blocks and records.

This post marks a turning point: now that we have reasonable abstractions for the bitstream container itself, we can focus on mapping it into a form that resembles LLVM’s IR. We’ll cover some of the critical steps in that process1 below, introducing a new crate (llvm-mapper) in the process.

Also, critically: this post is the first in the series where our approach to parsing and interpreting LLVM’s bitcode differs significantly from how LLVM’s own codebase does things. The details of the post should still be interesting to anyone who wants to learn the details of how an IR-level LLVM module is constructed, but will not reflect how LLVM itself does that construction2.

Step 1: Unrolling the bitstream

At the bitstream layer, our top-level API looks like this:

// `from` parses both "raw" bitstreams and ones that have a bitcode wrapper,
// so discard the wrapper as _ if we see one.
let (_, stream) = Bitstream::from(raw_bitcode)?;

for entry in stream {
  // This is a lazy iterator, so each entry's parse can fail.
  println!("{:?}", entry?);
}

This produces a linear sequence of stream entries, each of which can be one of the core bitstream types: a new block, a record, or the end of a block:

[BLOCK(n), RECORD, RECORD, BLOCK(k), RECORD, RECORD, END_BLOCK, RECORD, ..., END_BLOCK]

This is easy to emit, but it isn’t easy to consume! Instead, we want a hierarchical structure that reflects the owning relationship between records and blocks, like so:

BLOCK(n)
  RECORD
  RECORD
  BLOCK(k)
    RECORD
    RECORD
  END_BLOCK
  RECORD
  ...
END_BLOCK

To get to the latter, we have to “unroll” the bitstream into a hierarchy of blocks and records. More accurately, we have to conform to LLVM’s understanding of the minimal context associated with every IR module:

A top-level IDENTIFICATION_BLOCK, providing version and epoch metadata
A top-level MODULE_BLOCK, providing the core structure of the ultimate llvm::Module
An top-level STRTAB_BLOCK, providing the string table for the module3
An (optional) top-level SYMTAB_BLOCK, providing an indirect symbol table for the module3

Oh, and a bitstream can contain multiple IR modules to boot (thanks to llvm-cat4 and LTO). So our data flow and model really look like this:

Bitstream -> UnrolledBitstream
UnrolledBitstream -> UnrolledBitcode
UnrolledBitcode -> [BitcodeModule]

The code for actually performing this unrolling is a little too long to paste inline, but you can view it here. To summarize:

We enter the bitstream, starting with an empty set of “partial”5 bitcode modules, each of which looks like this:

 struct PartialBitcodeModule {
     identification: UnrolledBlock,
     module: Option<UnrolledBlock>,
     strtab: Option<UnrolledBlock>,
     symtab: Option<UnrolledBlock>,
 }

We visit each top-level block, collecting all sub-blocks in each top-level block

We attempt to reify6 each “partial” module into a well-formed BitcodeModule:

 pub(self) fn reify(self) -> Result<BitcodeModule, Error> {
     let mut ctx = MapCtx::default();

     // Grab the string table early, so that we can move it into our mapping context and
     // use it for the remainder of the mapping phase.
     let strtab = Strtab::try_map(
         &self.strtab.ok_or_else(|| {
             Error::BadUnroll("missing STRTAB_BLOCK for bitcode module".into())
         })?,
         &mut ctx,
     )?;

     ctx.strtab = Some(strtab);

     let identification = Identification::try_map(&self.identification, &mut ctx)?;
     let module = Module::try_map(
         &self.module.ok_or_else(|| {
             Error::BadUnroll("missing MODULE_BLOCK for bitcode module".into())
         })?,
         &mut ctx,
     )?;
     let symtab = self
         .symtab
         .map(|s| Symtab::try_map(&s, &mut ctx))
         .transpose()?;

     #[allow(clippy::unwrap_used)]
     Ok(BitcodeModule {
         identification: identification,
         module: module,
         // Unwrap safety: we unconditionally assign `strtab` to `Some(...)` above.
         strtab: ctx.strtab.unwrap(),
         symtab: symtab,
     })
 }

We collect all of the well-formed BitcodeModules into an UnrolledBitcode container:

 let modules = partial_modules
     .into_iter()
     .map(|p| p.reify())
     .collect::<Result<Vec<_>, _>>()?;
 let unrolled = UnrolledBitcode { modules };

The end result: a mapped hierarchy of blocks and records, corresponding to one or more LLVM IR modules.

Step 2: From unrolled blocks to IR models

In the process above, one of our end results is a BitcodeModule for each appropriate sequence of top-level blocks in the bitstream. That BitcodeModule looks something like this:

pub struct BitcodeModule {
    /// The identification block associated with this module.
    pub identification: Identification,

    /// The module block associated with this module.
    pub module: Module,

    /// The string table associated with this module.
    pub strtab: Strtab,

    /// The symbol table associated with this module, if it has one.
    pub symtab: Option<Symtab>,
}

But wait: where did Identification, Module, &c come from? We’ve skipped a critical step!

As hinted in the reification step during unrolling, we have to map various features in the bitstream into their corresponding IR-level concepts. This process comes with its own subtleties:

Correct mapping is context sensitive. Records and sub-blocks within the MODULE_BLOCK can’t be mapped in an arbitrary order: record layouts depend on version metadata within the MODULE_BLOCK, and various fields depend on layout information that must be extracted from the datalayout specification. Records and blocks that reference types can only be correctly mapped after the TYPE_BLOCK within the module has been parsed.
Similarly: correct mapping requires extra-modular state: individual records are allowed to reference the string and symbol tables, which exist outside of their associated MODULE_BLOCK for reasons (tm)7. This state must be available during mapping.
Mapping of bitstream entities onto IR features is not 1:1. Information about functions is spread across both MODULE_CODE_FUNCTION records and FUNCTION_BLOCK sub-blocks within the MODULE_BLOCK, as well as references to a fully parsed type table. Correct mapping requires that each of the above be interpreted and associated with each other into a unified model.

Each of these considerations requires significant, dirty-handed engineering, particularly in the context of a pure Rust implementation: we can’t8 leverage god-like global context objects the way LLVM’s C++ internals can.

Mapping individual blocks and records

The heart of our mapping abstraction in llvm-mapper is the Mappable trait, which (currently) looks something like this:

/// A trait for mapping some raw `T` into a model type.
pub(crate) trait Mappable<T>: Sized {
    type Error;

    /// Attempt to map `T` into `Self` using the given [`MapCtx`](MapCtx).
    fn try_map(raw: &T, ctx: &mut MapCtx) -> Result<Self, Self::Error>;
}

That’s it! All it does is take an input T by reference, and (fallibly) produce the implementing type from it, using the associated error type when the operation fails. The associated MapCtx holds the ugly context details that I mentioned above: copies of the module-level string table, version information, and anything else that might influence mapping decisions9.

This abstraction ends up being reasonable for both blocks and records: thanks to unrolling, blocks now contain (and own) all of their constituent records. Records in turn should be independent of their adjacent records, with sparing exceptions handled by the MapCtx.

Because blocks are uniquely identified by their block ID (and all block IDs share the same namespace), we use an additional IrBlock trait with a blanket implementation to strongly enforce that the block ID is checked:

/// A trait implemented by all blocks that correspond to IR models, allowing them
/// to be mapped into their corresponding model.
pub(crate) trait IrBlock: Sized {
    /// The `IrBlockId` that corresponds to this IR model.
    const BLOCK_ID: IrBlockId;

    /// Attempt to map the given block to the implementing type, returning an error if mapping fails.
    ///
    /// This is an interior trait that shouldn't be used directly.
    fn try_map_inner(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, BlockMapError>;
}

impl<T: IrBlock> Mappable<UnrolledBlock> for T {
    type Error = BlockMapError;

    fn try_map(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, Self::Error> {
        if block.id != BlockId::Ir(T::BLOCK_ID) {
            return Err(BlockMapError::BadBlockMap(format!(
                "can't map {:?} into {:?}",
                block.id,
                Identification::BLOCK_ID
            )));
        }

        IrBlock::try_map_inner(block, ctx)
    }
}

In plain English: individual blocks don’t implement Mappable<UnrolledBlock>. Instead, they implement IrBlock, which ensures that the blanket block ID check is always performed.

The simplest example of this is the IDENTIFICATION_BLOCK, corresponding to the Identification struct that we saw referenced earlier:

/// Models the `IDENTIFICATION_BLOCK` block.
#[non_exhaustive]
#[derive(Debug)]
pub struct Identification {
    /// The name of the "producer" for this bitcode.
    pub producer: String,
    /// The compatibility epoch.
    pub epoch: u64,
}

impl IrBlock for Identification {
    const BLOCK_ID: IrBlockId = IrBlockId::Identification;

    fn try_map_inner(block: &UnrolledBlock, _ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
        let producer = {
            let producer = block.one_record(IdentificationCode::ProducerString as u64)?;

            producer.try_string(0)?
        };

        let epoch = {
            let epoch = block.one_record(IdentificationCode::Epoch as u64)?;

            epoch.get_field(0)?
        };

        Ok(Self { producer, epoch })
    }
}

…and rinse and repeat for the dozen plus other blocks that can be present in a valid bitcode file. Plus their internal records, which vary widely in terms of needed context and parsing strategy10. Correctly mapping each of these (and testing them) is going to be a task on the order of months, if not years. LLVM is big!

Next steps

As of this post, here’s what the various mollusc crates are capable of producing:

$ ./target/debug/examples/unroll-bitstream ./sum.bc

UnrolledBitcode {
    modules: [
        BitcodeModule {
            identification: Identification {
                producer: "LLVM10.0.0",
                epoch: 0,
            },
            module: Module {
                version: 2,
                triple: "x86_64-pc-linux-gnu",
                datalayout: DataLayout {
                    endianness: Little,
                    natural_stack_alignment: Some(Align { byte_align: 16 }),
                    program_address_space: AddressSpace(0),
                    global_variable_address_space: AddressSpace(0),
                    alloca_address_space: AddressSpace(0),
                    type_alignments: TypeAlignSpecs(
                        [
                            TypeAlignSpec {
                                aligned_type: Aggregate,
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                            },
                            /* many more */
                        ],
                    ),
                    pointer_alignments: PointerAlignSpecs(
                        [
                            PointerAlignSpec {
                                address_space: AddressSpace(0),
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                                pointer_size: 64,
                                index_size: 64,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(270),
                                abi_alignment: Align { byte_align: 4 },
                                preferred_alignment: Align { byte_align: 4 },
                                pointer_size: 32,
                                index_size: 32,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(271),
                                abi_alignment: Align { byte_align: 4 },
                                preferred_alignment: Align { byte_align: 4 },
                                pointer_size: 32,
                                index_size: 32,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(272),
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                                pointer_size: 64,
                                index_size: 64,
                            },
                        ],
                    ),
                    aggregate_alignment: Align { byte_align: 1 },
                    function_pointer_alignment: None,
                    mangling: Some(Elf),
                    native_integer_widths: [ 8, 16, 32, 64 ],
                    non_integral_address_spaces: [],
                },
                asm: [],
                deplibs: [],
                type_table: TypeTable(
                    [
                        Integer(IntegerType { bit_width: 32 }),
                        Function(
                            FunctionType {
                                return_type: Integer(IntegerType { bit_width: 32 }),
                                param_types: [
                                    Integer(IntegerType { bit_width: 32 }),
                                    Integer(IntegerType { bit_width: 32 }),
                                ],
                                is_vararg: false,
                            },
                        ),
                        Pointer(
                            PointerType {
                                pointee: Function(
                                    FunctionType {
                                        return_type: Integer(IntegerType { bit_width: 32 }),
                                        param_types: [
                                            Integer(IntegerType { bit_width: 32 }),
                                            Integer(IntegerType { bit_width: 32 }),
                                        ],
                                        is_vararg: false,
                                    },
                                ),
                                address_space: AddressSpace(0),
                            },
                        ),
                        Pointer(
                            PointerType {
                                pointee: Integer(IntegerType { bit_width: 32 }),
                                address_space: AddressSpace(0),
                            },
                        ),
                        Metadata,
                        Void,
                    ],
                ),
            },
            strtab: Strtab( [ /* lots of bytes */ ], ),
            symtab: Some(Symtab( [ /* lots of bytes */ ])),
        },
    ],
}

That’s a lot, and we still haven’t even gotten to functions, blocks, and instructions yet!

I’m not 100% sure what the next post in this series will look like, or even if there will be one: the path forwards is less about LLVM’s internals and more a series of grueling engineering efforts to bring the various mollusc crates into compatibility with LLVM’s ability to parse its own bitcode. There are lots of details involved in that, but they boil down to the same fundamental mapping steps discussed above.

Longer term, I don’t really have a goal for mollusc. It’s not intended to be a drop-in replacement for various LLVM internals, and it’s not intended to be a replacement for LLVM itself. Maybe it’ll be useful for doing static analysis in pure Rust; I don’t know! To whit: if the work here interests you, please help me with it! But don’t confuse this for the RIIR to top all RIIRs — it is not and will never be a rewrite of LLVM in Rust.

In other news, I’ve been accumulating LLVM patchsets throughout this work. To highlight a couple:

D107536 modifies llvm-bcanalyzer to include a flag that dumps the BLOCKINFO_BLOCK. This is undesirable for normal users, but is very desirable when writing your own bitstream parser.
D108441 fixes the Clang frontend’s -ast-dump=json behavior when used in conjunction with -ast-dump-filter=... — previous versions would emit invalid JSON due to some logging information that was spat out on the same output stream. The fix still produces invalid JSON, but in an easier to fix format: the output stream is now a contiguous sequence of JSON documents, meaning that a parser can be told to pick up on the next document immediately after the preceding one ends.
D108962 fixes the LangRef documentation for datalayout, making it more clear which fields are optional and what constraints LLVM puts on them during parsing.
D109655 fixes a small logic error in LLVM’s BitcodeReader that could cause a vector in the type table to be misconstructed from a malformed IR file.

All of these need review, but I’m a complete novice when it comes to the LLVM review process. If any readers would be kind enough to point me in the right direction for getting them reviewed and merged, I would be extremely grateful!

Critically, our process, not LLVM’s. LLVM’s bitcode reader is much lazier (and in some ways, much lower level) than the one I’ve implemented; you can find it here. ↩
Maybe in the future. ↩
I’m still not 100% clear on these: LLVM’s code makes it seem like they were introduced for speeding up LTO, but the STRTAB_BLOCK at the very least seems to now be the standard string table representation. SYMTAB_BLOCK appears to be a serialized representation of a C structure that provides indirect references into the corresponding string table, which I’m sure will be absolutely painful delightful to model in Rust. ↩ ↩2
Which, curiously, is increasingly hard to find references to in LLVM’s documentation. But it’s still installed with my distribution’s build of LLVM. ↩
Why partial? Because we want to make invalid states (e.g., a module that’s missing an IDENTIFICATION_BLOCK) unrepresentable in the final API that users are expected to consume. Rust makes us jump through some extra hoops to do that, but the end result that that every BitcodeModule is correct by construction. ↩
This reification step includes mapping into IR models. More on that in the next section of the post. ↩
Presumably LTO reasons. ↩
Of course, we actually can. But it wouldn’t be very idiomatic of us. ↩
Well, not quite: I haven’t fully figured out whether this is the best abstraction to use, particularly when only a small subset of blocks need context from other blocks. This may change in the future, but any changes will be API-internal ones. ↩
My current woe being LLVM’s type table (i.e., TYPE_BLOCK_ID_NEW). ↩

Reddit discussion

LLVM internals, part 3: from bitcode to IR

ENOSUCHBLOG

Programming, philosophy, pedaling.

LLVM internals, part 3: from bitcode to IR

Sep 14, 2021

Preword

Step 1: Unrolling the bitstream

Step 2: From unrolled blocks to IR models

Mapping individual blocks and records

Next steps

Recommend

GitHub - graalvm/sulong: Sulong, the LLVM bitcode implementation of Graal VM.

Xcode 7 Bitcode的工作流程及安全性评估 | WooYun知识库

配置 xcodebuild 命令打包支持 Bitcode

iOS 了解Xcode Bitcode - 简书

Translating an ARM iOS App to Intel macOS Using Bitcode

Compile Metal shader Bitcode to x86 and ARM assembly

16-May-2015: Tweaking LLVM Obfuscator + quick look into some of LLVM internals.

LLVM internals, part 2: parsing the bitstream

LLVM internals, part 1: the bitcode format

LLVM internals, part 4: attributes and attribute groups

About Joyk