

LLVM internals, part 3: from bitcode to IR
source link: https://blog.yossarian.net/2021/09/14/LLVM-internals-part-3-from-bitcode-to-IR
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

ENOSUCHBLOG
Programming, philosophy, pedaling.
LLVM internals, part 3: from bitcode to IR
Sep 14, 2021
Series: llvm-internals
Preword
The last two posts of this series have focused on the nitty-gritty of LLVM’s bitstream format:
- We performed a conceptual overview of the bitcode file structure and the bitstream container format that it’s based on, culminating in a bitcursor abstraction for Rust.
- We then applied those concepts to the task of actually parsing arbitrary bitstreams, culminating in a crate for turning a bitstream into a sequence of blocks and records.
This post marks a turning point: now that we have reasonable abstractions for the bitstream
container itself, we can focus on mapping it into a form that resembles LLVM’s IR. We’ll cover
some of the critical steps in that process1 below, introducing a new crate
(llvm-mapper
) in the process.
Also, critically: this post is the first in the series where our approach to parsing and interpreting LLVM’s bitcode differs significantly from how LLVM’s own codebase does things. The details of the post should still be interesting to anyone who wants to learn the details of how an IR-level LLVM module is constructed, but will not reflect how LLVM itself does that construction2.
Step 1: Unrolling the bitstream
At the bitstream layer, our top-level API looks like this:
// `from` parses both "raw" bitstreams and ones that have a bitcode wrapper,
// so discard the wrapper as _ if we see one.
let (_, stream) = Bitstream::from(raw_bitcode)?;
for entry in stream {
// This is a lazy iterator, so each entry's parse can fail.
println!("{:?}", entry?);
}
This produces a linear sequence of stream entries, each of which can be one of the core bitstream types: a new block, a record, or the end of a block:
[BLOCK(n), RECORD, RECORD, BLOCK(k), RECORD, RECORD, END_BLOCK, RECORD, ..., END_BLOCK]
This is easy to emit, but it isn’t easy to consume! Instead, we want a hierarchical structure that reflects the owning relationship between records and blocks, like so:
BLOCK(n)
RECORD
RECORD
BLOCK(k)
RECORD
RECORD
END_BLOCK
RECORD
...
END_BLOCK
To get to the latter, we have to “unroll” the bitstream into a hierarchy of blocks and records. More accurately, we have to conform to LLVM’s understanding of the minimal context associated with every IR module:
- A top-level
IDENTIFICATION_BLOCK
, providing version and epoch metadata - A top-level
MODULE_BLOCK
, providing the core structure of the ultimatellvm::Module
- An top-level
STRTAB_BLOCK
, providing the string table for the module3 - An (optional) top-level
SYMTAB_BLOCK
, providing an indirect symbol table for the module3
Oh, and a bitstream can contain multiple IR modules to boot (thanks to llvm-cat
4 and LTO).
So our data flow and model really look like this:
Bitstream -> UnrolledBitstream
UnrolledBitstream -> UnrolledBitcode
UnrolledBitcode -> [BitcodeModule]
The code for actually performing this unrolling is a little too long to paste inline, but you can view it here. To summarize:
-
We enter the bitstream, starting with an empty set of “partial”5 bitcode modules, each of which looks like this:
struct PartialBitcodeModule { identification: UnrolledBlock, module: Option<UnrolledBlock>, strtab: Option<UnrolledBlock>, symtab: Option<UnrolledBlock>, }
-
We visit each top-level block, collecting all sub-blocks in each top-level block
-
We attempt to reify6 each “partial” module into a well-formed
BitcodeModule
:pub(self) fn reify(self) -> Result<BitcodeModule, Error> { let mut ctx = MapCtx::default(); // Grab the string table early, so that we can move it into our mapping context and // use it for the remainder of the mapping phase. let strtab = Strtab::try_map( &self.strtab.ok_or_else(|| { Error::BadUnroll("missing STRTAB_BLOCK for bitcode module".into()) })?, &mut ctx, )?; ctx.strtab = Some(strtab); let identification = Identification::try_map(&self.identification, &mut ctx)?; let module = Module::try_map( &self.module.ok_or_else(|| { Error::BadUnroll("missing MODULE_BLOCK for bitcode module".into()) })?, &mut ctx, )?; let symtab = self .symtab .map(|s| Symtab::try_map(&s, &mut ctx)) .transpose()?; #[allow(clippy::unwrap_used)] Ok(BitcodeModule { identification: identification, module: module, // Unwrap safety: we unconditionally assign `strtab` to `Some(...)` above. strtab: ctx.strtab.unwrap(), symtab: symtab, }) }
-
We collect all of the well-formed
BitcodeModule
s into anUnrolledBitcode
container:let modules = partial_modules .into_iter() .map(|p| p.reify()) .collect::<Result<Vec<_>, _>>()?; let unrolled = UnrolledBitcode { modules };
The end result: a mapped hierarchy of blocks and records, corresponding to one or more LLVM IR modules.
Step 2: From unrolled blocks to IR models
In the process above, one of our end results is a BitcodeModule
for each appropriate
sequence of top-level blocks in the bitstream. That BitcodeModule
looks something like this:
pub struct BitcodeModule {
/// The identification block associated with this module.
pub identification: Identification,
/// The module block associated with this module.
pub module: Module,
/// The string table associated with this module.
pub strtab: Strtab,
/// The symbol table associated with this module, if it has one.
pub symtab: Option<Symtab>,
}
But wait: where did Identification
, Module
, &c come from? We’ve skipped a critical step!
As hinted in the reification step during unrolling, we have to map various features in the bitstream into their corresponding IR-level concepts. This process comes with its own subtleties:
- Correct mapping is context sensitive. Records and sub-blocks within the
MODULE_BLOCK
can’t be mapped in an arbitrary order: record layouts depend on version metadata within theMODULE_BLOCK
, and various fields depend on layout information that must be extracted from thedatalayout
specification. Records and blocks that reference types can only be correctly mapped after theTYPE_BLOCK
within the module has been parsed. - Similarly: correct mapping requires extra-modular state: individual records are allowed to
reference the string and symbol tables, which exist outside of their associated
MODULE_BLOCK
for reasons (tm)7. This state must be available during mapping. - Mapping of bitstream entities onto IR features is not 1:1. Information about functions is
spread across both
MODULE_CODE_FUNCTION
records andFUNCTION_BLOCK
sub-blocks within theMODULE_BLOCK
, as well as references to a fully parsed type table. Correct mapping requires that each of the above be interpreted and associated with each other into a unified model.
Each of these considerations requires significant, dirty-handed engineering, particularly in the context of a pure Rust implementation: we can’t8 leverage god-like global context objects the way LLVM’s C++ internals can.
Mapping individual blocks and records
The heart of our mapping abstraction in
llvm-mapper
is the Mappable
trait,
which (currently) looks something like this:
/// A trait for mapping some raw `T` into a model type.
pub(crate) trait Mappable<T>: Sized {
type Error;
/// Attempt to map `T` into `Self` using the given [`MapCtx`](MapCtx).
fn try_map(raw: &T, ctx: &mut MapCtx) -> Result<Self, Self::Error>;
}
That’s it! All it does is take an input T
by reference, and (fallibly) produce the implementing
type from it, using the associated error type when the operation fails. The associated MapCtx
holds the ugly context details that I mentioned above: copies of the module-level string table,
version information, and anything else that might influence mapping decisions9.
This abstraction ends up being reasonable for both blocks and records: thanks to unrolling,
blocks now contain (and own) all of their constituent records. Records in turn should be independent
of their adjacent records, with sparing exceptions handled by the MapCtx
.
Because blocks are uniquely identified by their block ID (and all block IDs share the same
namespace), we use an additional IrBlock
trait with a blanket implementation to strongly enforce
that the block ID is checked:
/// A trait implemented by all blocks that correspond to IR models, allowing them
/// to be mapped into their corresponding model.
pub(crate) trait IrBlock: Sized {
/// The `IrBlockId` that corresponds to this IR model.
const BLOCK_ID: IrBlockId;
/// Attempt to map the given block to the implementing type, returning an error if mapping fails.
///
/// This is an interior trait that shouldn't be used directly.
fn try_map_inner(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, BlockMapError>;
}
impl<T: IrBlock> Mappable<UnrolledBlock> for T {
type Error = BlockMapError;
fn try_map(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, Self::Error> {
if block.id != BlockId::Ir(T::BLOCK_ID) {
return Err(BlockMapError::BadBlockMap(format!(
"can't map {:?} into {:?}",
block.id,
Identification::BLOCK_ID
)));
}
IrBlock::try_map_inner(block, ctx)
}
}
In plain English: individual blocks don’t implement Mappable<UnrolledBlock>
. Instead,
they implement IrBlock
, which ensures that the blanket block ID check is always performed.
The simplest example of this is the IDENTIFICATION_BLOCK
, corresponding to the
Identification
struct that we saw referenced earlier:
/// Models the `IDENTIFICATION_BLOCK` block.
#[non_exhaustive]
#[derive(Debug)]
pub struct Identification {
/// The name of the "producer" for this bitcode.
pub producer: String,
/// The compatibility epoch.
pub epoch: u64,
}
impl IrBlock for Identification {
const BLOCK_ID: IrBlockId = IrBlockId::Identification;
fn try_map_inner(block: &UnrolledBlock, _ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
let producer = {
let producer = block.one_record(IdentificationCode::ProducerString as u64)?;
producer.try_string(0)?
};
let epoch = {
let epoch = block.one_record(IdentificationCode::Epoch as u64)?;
epoch.get_field(0)?
};
Ok(Self { producer, epoch })
}
}
…and rinse and repeat for the dozen plus other blocks that can be present in a valid bitcode file. Plus their internal records, which vary widely in terms of needed context and parsing strategy10. Correctly mapping each of these (and testing them) is going to be a task on the order of months, if not years. LLVM is big!
Next steps
As of this post, here’s what the various mollusc crates are capable of producing:
$ ./target/debug/examples/unroll-bitstream ./sum.bc
UnrolledBitcode {
modules: [
BitcodeModule {
identification: Identification {
producer: "LLVM10.0.0",
epoch: 0,
},
module: Module {
version: 2,
triple: "x86_64-pc-linux-gnu",
datalayout: DataLayout {
endianness: Little,
natural_stack_alignment: Some(Align { byte_align: 16 }),
program_address_space: AddressSpace(0),
global_variable_address_space: AddressSpace(0),
alloca_address_space: AddressSpace(0),
type_alignments: TypeAlignSpecs(
[
TypeAlignSpec {
aligned_type: Aggregate,
abi_alignment: Align { byte_align: 8 },
preferred_alignment: Align { byte_align: 8 },
},
/* many more */
],
),
pointer_alignments: PointerAlignSpecs(
[
PointerAlignSpec {
address_space: AddressSpace(0),
abi_alignment: Align { byte_align: 8 },
preferred_alignment: Align { byte_align: 8 },
pointer_size: 64,
index_size: 64,
},
PointerAlignSpec {
address_space: AddressSpace(270),
abi_alignment: Align { byte_align: 4 },
preferred_alignment: Align { byte_align: 4 },
pointer_size: 32,
index_size: 32,
},
PointerAlignSpec {
address_space: AddressSpace(271),
abi_alignment: Align { byte_align: 4 },
preferred_alignment: Align { byte_align: 4 },
pointer_size: 32,
index_size: 32,
},
PointerAlignSpec {
address_space: AddressSpace(272),
abi_alignment: Align { byte_align: 8 },
preferred_alignment: Align { byte_align: 8 },
pointer_size: 64,
index_size: 64,
},
],
),
aggregate_alignment: Align { byte_align: 1 },
function_pointer_alignment: None,
mangling: Some(Elf),
native_integer_widths: [ 8, 16, 32, 64 ],
non_integral_address_spaces: [],
},
asm: [],
deplibs: [],
type_table: TypeTable(
[
Integer(IntegerType { bit_width: 32 }),
Function(
FunctionType {
return_type: Integer(IntegerType { bit_width: 32 }),
param_types: [
Integer(IntegerType { bit_width: 32 }),
Integer(IntegerType { bit_width: 32 }),
],
is_vararg: false,
},
),
Pointer(
PointerType {
pointee: Function(
FunctionType {
return_type: Integer(IntegerType { bit_width: 32 }),
param_types: [
Integer(IntegerType { bit_width: 32 }),
Integer(IntegerType { bit_width: 32 }),
],
is_vararg: false,
},
),
address_space: AddressSpace(0),
},
),
Pointer(
PointerType {
pointee: Integer(IntegerType { bit_width: 32 }),
address_space: AddressSpace(0),
},
),
Metadata,
Void,
],
),
},
strtab: Strtab( [ /* lots of bytes */ ], ),
symtab: Some(Symtab( [ /* lots of bytes */ ])),
},
],
}
That’s a lot, and we still haven’t even gotten to functions, blocks, and instructions yet!
I’m not 100% sure what the next post in this series will look like, or even if there will be one: the path forwards is less about LLVM’s internals and more a series of grueling engineering efforts to bring the various mollusc crates into compatibility with LLVM’s ability to parse its own bitcode. There are lots of details involved in that, but they boil down to the same fundamental mapping steps discussed above.
Longer term, I don’t really have a goal for mollusc. It’s not intended to be a drop-in replacement for various LLVM internals, and it’s not intended to be a replacement for LLVM itself. Maybe it’ll be useful for doing static analysis in pure Rust; I don’t know! To whit: if the work here interests you, please help me with it! But don’t confuse this for the RIIR to top all RIIRs — it is not and will never be a rewrite of LLVM in Rust.
In other news, I’ve been accumulating LLVM patchsets throughout this work. To highlight a couple:
- D107536 modifies
llvm-bcanalyzer
to include a flag that dumps theBLOCKINFO_BLOCK
. This is undesirable for normal users, but is very desirable when writing your own bitstream parser. - D108441 fixes the Clang frontend’s
-ast-dump=json
behavior when used in conjunction with-ast-dump-filter=...
— previous versions would emit invalid JSON due to some logging information that was spat out on the same output stream. The fix still produces invalid JSON, but in an easier to fix format: the output stream is now a contiguous sequence of JSON documents, meaning that a parser can be told to pick up on the next document immediately after the preceding one ends. - D108962 fixes the LangRef documentation for
datalayout
, making it more clear which fields are optional and what constraints LLVM puts on them during parsing. - D109655 fixes a small logic error in LLVM’s
BitcodeReader
that could cause a vector in the type table to be misconstructed from a malformed IR file.
All of these need review, but I’m a complete novice when it comes to the LLVM review process. If any readers would be kind enough to point me in the right direction for getting them reviewed and merged, I would be extremely grateful!
-
Critically, our process, not LLVM’s. LLVM’s bitcode reader is much lazier (and in some ways, much lower level) than the one I’ve implemented; you can find it here. ↩
-
Maybe in the future. ↩
-
I’m still not 100% clear on these: LLVM’s code makes it seem like they were introduced for speeding up LTO, but the
STRTAB_BLOCK
at the very least seems to now be the standard string table representation.SYMTAB_BLOCK
appears to be a serialized representation of a C structure that provides indirect references into the corresponding string table, which I’m sure will be absolutely painful delightful to model in Rust. ↩ ↩2 -
Which, curiously, is increasingly hard to find references to in LLVM’s documentation. But it’s still installed with my distribution’s build of LLVM. ↩
-
Why partial? Because we want to make invalid states (e.g., a module that’s missing an
IDENTIFICATION_BLOCK
) unrepresentable in the final API that users are expected to consume. Rust makes us jump through some extra hoops to do that, but the end result that that everyBitcodeModule
is correct by construction. ↩ -
This reification step includes mapping into IR models. More on that in the next section of the post. ↩
-
Presumably LTO reasons. ↩
-
Of course, we actually can. But it wouldn’t be very idiomatic of us. ↩
-
Well, not quite: I haven’t fully figured out whether this is the best abstraction to use, particularly when only a small subset of blocks need context from other blocks. This may change in the future, but any changes will be API-internal ones. ↩
-
My current woe being LLVM’s type table (i.e.,
TYPE_BLOCK_ID_NEW
). ↩
Reddit discussion
Recommend
-
74
sulong - Sulong, the LLVM bitcode implementation of Graal VM.
-
18
Xcode 7 Bitcode的工作流程及安全性评估 360NirvanTeam...
-
61
我们通常会把一些公用的模块抽离出来打成一个 .a 静态库或者 .framework 动态库,然后再嵌入到宿主工程中。 最近我们的 App 工程开启 Bitcode 编译选项后(Enable Bitcode = YES),发现在进行 Archive 归档打 Release 包时,报...
-
38
-
32
When Apple introduced Bitcode and made it mandatory on watchOS and tvOS, they kinda hand-wav...
-
8
Here’s how I reverse engineered Apple’s metallib archive format to extract the LLVM Bitcode for compiled Metal shaders. I proved that normal LLVM can read the Bitcode and compile it to x86-64 and ARM64 assembly. Introduc...
-
10
Tweaking LLVM Obfuscator + quick look into some of LLVM internals.16-May-2015: Tweaking LLVM Obfuscator + quick look into some of LLVM internals. Code obfuscating is a good example of
-
7
LLVM internals, part 2ENOSUCHBLOG Programming, philosophy, pedaling. LLVM internals, part 2: parsing the bitst...
-
6
LLVM internals, part 1ENOSUCHBLOG Programming, philosophy, pedaling. LLVM internals, part 1: the bitcode format
-
12
attributes and attribute groups ENOSUCHBLOG Programming, philosophy, pedaling. LLVM inter...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK