(untagged)

Whole Program Optimization with Visual C++ .NET

Brandon Bray (MSFT)

0.00/5 (No votes)

10 Dec 2001

Visual C++�.NET goes to an entirely new level with Whole Program Optimization. This article discusses everything that the compiler can do using this new framework for optimization and how little the developer must do.

Introduction

The charter of compiler optimization has always been to produce the fastest running programs possible. Developers trying to write performance-tuned programs are in a continuous endurance trial of writing code in ways that lead to optimization opportunities for the compiler. Historically, compilers have introduced scalar optimizations that work only on isolated pieces of a program, usually only inside functions. Visual C++ .NET goes to an entirely new level with Whole Program Optimization. This article discusses everything that the compiler can do using this new framework for optimization and how little the developer must do.

Before Visual C++ .NET

The first things that a person typically learns in a C++ course are using the compiler to compile code and understanding that the compiler is responsible for creating object files for each source file. The linker then brings all object files together into something useful, such as an executable or a DLL. From the beginning, the compiler is at a disadvantage - it can only see a small piece of the program at any point in time. Unable to see other pieces, the compiler must use a conservative approach that results in slowing a program down. A classic example of this is calling conventions.

The interfaces to each module of a program need to remain consistent. Common calling conventions are cdecl, stdcall, and fastcall. Mixing and matching calling conventions inside a program was possible, but it required the developer to annotate the function signature with keywords. The developer was not necessarily the best person to make the decision for what would be the best calling convention for each function. The compiler could not really make this decision either, however, without breaking the interface to other modules. There are many similar examples where programs could be improved if the compiler had access to the whole program. For example, inlining could only happen inside individual object files. This program generates two unnecessary functions:

myclass.h
class MyClass {
private:
     int i;
public:
     void set_i(int n);
     void print_i();
};

myclass.cpp
#include <stdio.h>

#include "myclass.h"


void MyClass::set_i(int n) {
     i = n;
}

void MyClass::print_i() {
     printf("\"i\" is %i.\n", i);
}

main.cpp

#include "myclass.h"


int main(int argc, char* argv[])
{
     MyClass myclass;

     myclass.set_i(42);
     myclass.print_i();

     return 0;
}

Both the functions set_i and print_i are inline candidates. Unfortunately, when the compiler is working on main.cpp, it does not have access to the implementation in myclass.cpp. Developers can work around this by putting inline candidates in the header file, but it is better coding practice to leave the header file free of implementation details. In addition, not every user inline candidate should be inlined. This is also true for functions not marked with the inline keyword; some of these are great inline candidates. Again, the compiler is always at a disadvantage because it does not have access to the entire program.

With Visual C++ .NET

Link time code generation (LTCG), the Visual C++ .NET framework that makes whole program optimization possible, mitigates the difficulty a compiler has in performing optimizations. As the name implies, code generation does not occur until the linking stage. The steps that the compiler uses during an LTCG build can be summarized as follows:

The compiler takes each source file and does the usual parsing and type checking. It then generates intermediate representations of the source file and shuffles that off to the optimizer and the code generator.
Instead of optimizing the intermediate representation, as it would normally do without LTCG, the compiler puts the intermediate representation in an object file. Note that the compiler basically does nothing to the code. Instead of containing assembly language, the object file has a higher level view of the program.
The linker now starts as usual trying to pull all the object files together to form a program. Because the object files do not contain assembly code, the linker must invoke the compiler to finish the job of compiling the code. The linker has the compiler optimize and generate code for one function at a time. The compiler can ask the linker for information about other parts of the program and thus make informed decisions rather than always assuming the worst case.

The linking stage will take longer than usual, but the compiling stage will be much faster. Also note that the object files produced by the compiler through LTCG are not as portable as object files that contain assembly code. The intermediate representation stored in LTCG object files is likely to change with each version of Visual C++, so these object files would need to be regenerated every time that the compiler is upgraded. This situation only presents itself if the developer is trying to produce a .lib file. For that reason, unless the plan is to regenerate a new library for each future version of Visual C++, publicly distributing static-link libraries using LTCG for the object files is not recommended. Another consequence of including intermediate representations of the code in the object files, rather than assembly code, is that tools such as dumpbin.exe and editbin.exe do not work.

Optimizations Available to Whole Program Optimization

Cross-module inlining

As the previous example showed, cross-module inlining is perhaps the best reason to use whole program optimization. Instead of placing implementation details in header files, developers can now keep things neatly organized in an appropriate source file. It is not necessary to mark functions with the inline keyword, because the compiler can determine if it is beneficial to inline that function. This will happen when using the /Ob2 switch, which is implied by both /O1 and /O2. Sometimes, the release build in Visual Studio .NET will include /Ob1 on the command line; to enable cross-module inlining, do not include /Ob1, which only allows user-declared inline candidates to be inlined.

Cross-module bottom-up information

Often, the individual optimizations that the compiler can do are completely safe, but the information about the program is too conservative, and the compiler opts to not do the optimization in favor of accuracy. The compiler always generates information from the bottom of the call-tree. With whole program optimization, the scope of the information includes the entire program including information collected about each function�s register usage, memory usage, and information to improve inlining heuristics. With accurate information, the compiler does not need to make pessimistic decisions about whether a certain optimization is done.

Region based stack double alignment

Just as integers and pointers should be 4-byte aligned, doubles should be 8-byte aligned. By default, the stack in Win32 s 4-byte aligned. Misaligning data types results in significant performance loss. Without whole program optimization, the compiler has to generate code to dynamically align doubles on a per-function basis. Doing this is a challenge; the compiler cannot assume the position of the current stack frame. With whole program optimization, the compiler knows much more about the call-tree, and therefore, it can align the stack frame in a root function and keep things aligned through nested calls. Each function is not penalized with figuring the position of its stack frame.

Custom calling convention

As previously mentioned, a single calling convention is not the best for every function. For example, functions passing only a few small arguments benefit greatly from fastcall, but using fastcall also strains the optimizer. The compiler is certainly a better judge of when to use a particular calling convention. With whole program optimization, the compiler knows about all the call sites for a particular function. This lets the compiler customize the calling convention. For example, function arguments could be passed through an available register rather than on the stack. Functions that are exposed outside the program, as would happen in a DLL, will necessarily retain their default calling convention.

Improved memory disambiguation for non-address taken globals

Before whole program optimization, the compiler had a hard time optimizing global variables. This is worthwhile because global variables live in memory and are highly susceptible to cache misses. Unfortunately, because it does not have access to the whole program, the compiler often must assume that global variables can be written to through an assignment to a pointer. With whole program optimization, the compiler and the linker can determine with better accuracy whether the address of a global variable is taken so the compiler knows about pointers to the global variable. If the variable does not change, it can be treated more like a local variable and opened to standard code optimizations.

Small TLS offset encoding

The x86 instruction set uses smaller instruction encodings when an offset is within 128 bytes of a pointer. When organizing the layout of variables in thread-local storage, it is better to place frequently used variables in the first 128 bytes of storage. The linker is the utility that organizes the layout for thread-local variables. Determining which variables are more frequently used requires knowing about the whole program. Knowing the position of the variables in thread-local storage allows the compiler to use a smaller instruction encoding for the variable offset. If a program is heavily threaded, whole program optimization could dramatically reduce the image size.

Using Whole Program Optimization

Fortunately, developers need to do very little to enable whole program optimization. On the command-line, adding the /GL switch is all that is needed. When the /c switch is used to separate the compiling and linking stage, the linker will need the /LTCG switch when any object files were compiled with the /GL switch. When using the Visual Studio integrated development environment, to enable whole program optimization, set this property in the General property page of the project properties� configuration folder.

Using whole program optimization restricts the ability to use other features of Visual C++ .NET. When compiling with the /GL switch, edit and continue (/ZI), automatic precompiled headers (/YX), and targeting the .NET common language runtime (/clr) are not available.

In real-world code, whole program optimizations have boosted performance as much as 10% to 15%. Of course, this can vary; some programs will benefit more than others. On x86 architectures, 3% to 5% improvement is common.

Common Question About Whole Program Optimization

Can whole program optimization be used on some files, but not others?

Yes. Each source file that is compiled with the /GL switch produces an object file that will use whole program optimization. If an object file is not compiled with /GL, it will contain optimized assembly code using the traditional approach to compiling. Mixing object files built with and without /GL does not have any known issues.

Can I generate assembly files? What do they look like?

Assembly files (.asm) can be generated with LTCG, but because code generation is not done till link time, the assembly file will not be produced until link time as well. The .asm files produced with LTCG are just like without LTCG, but cannot be consumed by MASM.

What does this do to overall build time?

Overall build time does not change significantly. The shorter time in the compiling stage is shuffled to the linking stage, which now includes optimization and assembly code generation.

Conclusion

Link time code generation is a framework that enables whole program optimization. For developers, this means that the Visual C++ team is continuously examining even more ways to improve code through this framework. At the moment, whole program optimizations in Visual C++ .NET provide a significant advance toward making C and C++ programs the best that they can be.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here