Background
This is part 1 in a multi-part series. Part 2 can be found here.
Over the past eight years, I've been writing two projects: OILexer
and Abstraction
. Abstraction
's focus was initially on outlining a Compiler framework for the .NET CLI.
As time went on, it became clear that reflection just wasn't enough for me to get the nitty-gritty details that I needed to roll my own compiler.
"But wait", you say, ".NET has Reflection, Reflection.Emit and AssemblyBuilder built in!"
Yes, it is true, .NET by default allows you to emit assemblies as you please; however, one thing that it requires is a pretty deep understanding of its Common Intermediate Language (CIL.) You are also limited to building for the version of .NET you're running under.
You can adjust the result assembly's configuration file, or use the underlying CLR Native APIs, but the former doesn't really generate a 2.0 library, and the latter is a bit more complex as at that point you're not really using the AssemblyBuilder
/ ILGenerator
any longer.
Introduction
Let's say you want to write your own compiler, and you want complete control, that's the focus we're on today. The first thing you must understand in this: it's not going to be quick, or easy.
As it stands, I've written a library that understands .NET Metadata in all its tables and do basic ILDasm level disassembly on method bodies. I can translate object graphs into code, but at this point, I haven't gotten to the point of taking those graphs and writing Metadata out of them. So, we'll focus this series on what I have done.
There's a lot to understand. I won't repeat the exact nature of things here, because the ECMA-335 explains it more completely than I do; however, I will give a basic overview:
- PE Headers - The parts that make up the portable executable, which is the executable format used by Windows operating systems.
- CLI Header - The extensions to the PE headers that outlines where the Metadata (#3) is, where the entrypoint is, the version of the framework you're targeting, and all so on.
- CLI Data Streams - The actual data that says what what constants you use, the metadata tables, the names of things within the metadata tables, and the binary blobs.
PE Headers
MS DOS Header
This section starts out with details that many may not know still exist as a part of Windows executables: the MS DOS header. For most .NET Applications, they use a fixed set of bytes to represent this MS-DOS based program, with the exception of a portion named 'lfanew
', which is an offset to the Portable Executable signature, which would then be followed by the Windows version of the application.
Windows Headers
The Windows headers tend to be a bit more well defined within the ECMA-335 specification. Since they're not the major focus of this, we'll explain them as simply:
- PEHeader - Lays out the structure of the PE image
- Coff (Common Object File Format) Sections
- PE Optional Header - For .NET Applications, this is mandatory, as the Data Dictionaries point to the details we're interested in. If you're writing your own metadata reader, you must understand the PE Optional header to a certain detail. You must fully grok it if you plan on writing a compiler (or at least fake it enough so the .NET Apps you write don't fail to load.)
- Standard Fields
- NT Specific Fields
- Data Dictionaries - A series of address/size pairs which point to data within the PE image.
- We're interested in the CLI Header pointer, which will give us the location of the data for .NET Applications.
CLI Header
The CLI Header contains:
Cb
- the size of the header
MajorRuntimeVersion
MinorRuntimeVersion
MetaData
- the RVA and Size of the metadata sections
Flags
- Cli-relevant details about the image, such as 'Requries 32-bit process, Signed, native entry-point defined, IL-Only'
EntryPointToken
- Token to the 'Main
' method for applications (non-libraries)
Resources
StrongNameSignature
CodeManagerTable
- Always Zero
VTableFixups
ExportAddressTableJumps
ManagedNativeHeader
RVA
Relative Virtual Address: A relative virtual address is a 'pointer' to where the data should be in a virtual sense. Once a library or application is loaded, the locations of things may change based on how the operating system wants to arrange the data. These virtual addresses give you a sense of where the data is, and usually require you to 'resolve' them. Since we're not actually loading the library to execute it, we can simplify our resolution rules.
You would take the Coff sections of the PE Image, scanning through them until the RVA was within the range of that given Coff section's Virtual Address and the size of its raw data.
Getting to the Point of it All
We resolve the RVA of the CLI Header (See ResolveRelativeVirtualAddress
in PEImage), read it in, then resolve the RVA of Metadata (#5 above) of the CLI Header to get the location of the data streams within the CLI Application.
It took knowing about all of the above, just to know where the data is! We haven't even began grokking it yet.
To Be Continued...
We'll start breaking into the metadata streams and their meanings in the next installment.
- EDIT (June 12, 2016): Added links to the relevant areas
- EDIT (June 15, 2016): Replaced links with (fixed) Github targets.
- EDIT (June 16, 2016): Added link to part 2CodeProject