How does C++ Compilation Work?

Feb 25, 2024
#cs


Overview

To describe how C++ compilation works, I'll use an example with these 3 files:

// helper.hpp
#define PI 3.14
double times_two(double x);

// helper.cpp
double times_two(double x) {
    return x * 2;
}

// main.cpp
#include <iostream>
#include "main.hpp"

int main() {
    std::cout << "tau=" << times_two(PI) << std::endl;
    return 0;
}

I can compile and run the code like so:

$ clang++ main.cpp helper.cpp -o main
$ ./main
tau=6.28

Even though I just ran a single command to compile the code into an executable program, there's actually 3 steps happening behind the scenes:

  1. Preprocessing
  2. Code Generation
  3. Linking

The word build is often used to refer to all 3 steps. The word compile can refer to all 3 steps, or just steps 1&2. Next, I'll describe each of the steps in more detail.

I am using the Clang C++ compiler since it's the default on my MacBook, but the information should be applicable to GCC (the GNU Compiler Collection) as well.

Preprocessing

The preprocessor performs text-manipulation on source code. It looks at lines of code that start with # (the hashtag symbol). These lines of code are called preprocessor directives. There are 3 main types of directives:

  1. Definitions
  2. Includes
  3. Conditionals

Definitions

In our example, #define PI 3.14 in helper.hpp is a definition. This directive tells the preprocessor to replace the token "PI" with the token "3.14" in the source code. Note that this will not replace every occurrence of the string "PI" in the source code, because sometimes the string "PI" appears in a token that has other characters. For example, "PI" will not be replaced with "3.14" in the following lines:

int PIN = 1; // the token is PIN
std::cout << "I LIKE PI"; // the token is "I LIKE PI"

Includes

In our example, #include "helper.hpp" in main.cpp is an include directive. This directive essentially tells the preprocessor to copy the contents of the file helper.hpp into main.cpp. If helper.hpp itself had include statements, those would also get copied into main.cpp. In the libraries section, I'll describe how the preprocessor finds the file to include for a directive like #include <iostream>.

Conditionals

There are no conditional directives in our example, but it would look something like:

#ifdef ENVVAR
{code block}
#endif

This tells the preprocessor to only include the code block if the environment variable is defined. Often, conditional directives are used to tweak the program for different platforms (e.g. Linux vs Windows) or different build modes (e.g. DEBUG vs RELEASE).

Running the preprocessor

We can run the preprocessor by passing the flag -E to clang++:

$ clang++ -E helper.cpp -o helper.i && \
    clang++ -E main.cpp -o main.i

The output of the preprocessor is called a translation unit. By convention, it has a .i file extension. Note that the preprocessor takes a single source file as input, so it can be run in parallel on multiple source files.

A translation unit is a text file, not a binary file, so we can read it without special tools:

$ cat main.i
...
the contents of iostream
...
# 2 "main.cpp" 2
# 1 "./helper.hpp" 1

double times_two(double x);
# 3 "main.cpp" 2

int main() {
  std::cout << "tau=" << times_two(3.14) << std::endl;
  return 0;
}

As we can see, the preprocessor copied the contents of the included files <iostream> and "helper.hpp", and it also replaced PI with 3.14.

Code Generation

The code generation step turns the translation unit into an object file. Like preprocessing, code generation is performed on each translation unit separately, so it can be parallelized.

This step translates our C++ code into machine code. However, the code generator does not know how to translate symbols (e.g. variables, classes, functions) that are defined in another translation unit, so it'll leave a placeholder. In the next section we'll talk about the linker, whose job it is to resolve these placeholders.

The object file contains the generated machine code, as well as metadata like the location of the placeholders. To generate the object file, we provide the flag -c to clang++:

$ clang++ -c helper.i -o helper.o && \
    clang++ -c main.i -o main.o

By convention, object files have a .o file extension. Object files are binary files, not text files, so we use specialized tools like objdump to read them:

$ objdump --reloc -dC main.o

In the output, you will see the generated assembly code as well as a placeholder for the times_two symbol which is not defined in main.o.

Linking

The linker takes all of the object files and turns them into a single executable program. Unlike the previous steps, linking cannot be parallelized, because it does not act on each object file independently.

When combining all of the object files, the linker will also resolve the placeholders in the object files for references to external symbols. If the linker cannot find any definition for a symbol, or if it finds multiple definitions for a symbol (that is not marked inline), it'll throw an error. The linker also adds code for the C/C++ runtime (e.g. intializing global variables).

To perform the linking step, we can run clang++ without any extra flags. The compiler toolchain infers that the inputs are object files, so it knows it doesn't need to run the preprocessor or the code generator:

$ clang++ main.o helper.o -o main
$ ./main
tau=6.28

Most compiler optimizations are performed at the code generation step, but there are some optimizations that can only be done by the linker because it has visibility into the whole program. For example, the linker has a better idea of how to arrange the code in memory to improve cache locality, and which functions to inline. These are known as link-time optimizations (LTO). Even though LTO will probably make your program faster, it can greatly increase the amount of time and memory required to build your program. To enable LTO, you have to pass the flag -flto to both the code generator and the linker.

Libraries

Speaking informally, a library is code written by someone else that you can use in your own program. There are two types of libraries, characterized by how you link them into your program: static libraries get linked into your program at build time, and dynamic libraries get linked into your program at run time.

Static Libraries

A static library is essentially just a collection of object files. To create a static library, you first compile the library code to object files, and then you use the ar command line tool to create an archive that contains all the object files. By convention, static libraries start with lib and have the .a file extension:

$ clang++ helper.cpp -o helper.o
$ ar -rs libhelper.a helper.o

Then, to use the library in your program, you just have to pass it to the linker:

$ clang++ main.o libhelper.a -o main_static
$ ./main_static
tau=6.28

Dynamic Libraries

Dynamic Libraries are also called Shared Libraries, Shared Objects, and (on Windows) Dynamic Link Libraries (DLLs). Conceptually, a dynamic library contains the defintions of symbols. The instructions for creating a dynamic library can differ based on your operating system and compiler. For clang++ on MacOS, you pass the flag -dynamiclib and a list of object files. By convention, dynamic libraries start with lib and have the .dylib file extension. On Linux they have the .so file extension, and on Windows they have the .DLL file extension.

$ clang++ -dynamiclib helper.o -o libhelper.dylib

There are two ways to use the dynamic library in your program. The first way, called load time dynamic linking, is to pass the flag -l{library name} to the linker:

$ clang++ main.o -L./ -lhelper -o main_dynamic_load
$ ./main_dynamic_load
tau=6.28

Note that the linker automatically prepends lib and appends .dylib to the name of the library, which is why it's important to follow the naming conventions. The flag -L./ tells the linker to add the current directory to the search path for dynamic libraries. The executable contains metadata about the locations of the dynamic libraries it requires, which can be viewed using otool:

$ otool -L ./main_dynamic_load
./main_dynamic_load:
        libhelper.dylib (compatibility version 0.0.0, current version 0.0.0)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.36.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0)

When you run the executable, the operating system will read this metadata and import the needed symbols from the dynamic libraries, before calling the main function. Note that the metadata can contain relative or absolute file paths. Generally, you should not use relative file paths, because then the behavior of the executable depends on where it is run. For example, if we run our program from a different directory, it will fail because it can't find libhelper.dylib in that directory.

$ mkdir dummy
$ cd dummy
$ ../main_dynamic_load
dyld[69753]: Library not loaded: libhelper.dylib
...
zsh: abort      ../main_dynamic_load

The second way to use a dynamic library, called runtime dynamic linking, requires modifying the source code of the program:

// helper.cpp
// extern "C" prevents the compiler from mangling the name of this symbol
extern "C" double times_two(double x) { return x * 2; }
    
// main_dynamic_run.cpp
#include <iostream>
#include <dlfcn.h> // for dlopen, dlsym

// Do not include helper.hpp because the symbol times_two will be found at runtime
// #include "helper.hpp"
#define PI 3.14

int main() {
    void* handle;
    double (*times_two)(double);
    char *error;

    handle = dlopen("./libhelper.dylib", RTLD_LAZY);
    if (!handle) {
        std::cerr << "Cannot open library: " << dlerror() << std::endl;
        return 1;
    }

    *(void**)(&times_two) = dlsym(handle, "times_two");
    if ((error = dlerror()) != nullptr) {
        std::cerr << "Cannot load symbol 'times_two': " << error << std::endl;
        dlclose(handle);
        return 1;
    }

    std::cout << "tau=" << times_two(PI) << std::endl;
    return 0;
}

In this approach, we use the dlopen system call to load our dynamic library from a file path, and the dlsym system call to look up the address of a symbol in the dynamic library using its name. To compile and run this program:

$ clang++ -c helper.cpp -o helper.o
$ clang++ -dynamiclib helper.o -o libhelper.dylib
$ clang++ main_dynamic_run.cpp -o main_dynamic_run
$ ./main_dynamic_run
tau=6.28

Load time dynamic linking requires that the dynamic library exists when you build your program, because the linker needs to find the symbols you are referencing in the dynamic library, while runtime dynamic linking just requires that the dynamic library exists when you run your program.

Pros and Cons

One benefit of using dynamic libraries is that the operating system only needs to keep one copy of each dynamic library in memory, which can be shared by all the running programs that depend on that library. Another benefit of using dynamic libraries is that you can upgrade the version without rebuilding your program.

On the flip side, programs that use static libraries are self-contained. Also, LTO applies to static libraries but not dynamic libraries.

A drawback of both approaches is that the code for the library comes pre-compiled, so you don't have control over the compilation options used. If you want more granular control, you have to download the source files for the library and compile them yourself.

Standard Libraries

The C++ standard library is a collection of useful classes and functions provided by the standard (such as iostream). The headers and library files for the standard library can be installed using the package manager for your operating system. For example, on MacOS, the standard library is installed as part of the XCode command line tools.

In general, header files are in directories named include and library files are in directories named lib. On my MacBook, the iostream header file is at /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/iostream. Even though this file does not have a .h suffix, it is a normal header file. Clang has a list of directories that it looks in to search for header files. You can use the flag -v (verbose) to see that list:

$ clang++ -E main.cpp -o main.i -v
...
#include "..." search starts here:
#include <...> search starts here:
 /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1
 /Library/Developer/CommandLineTools/usr/lib/clang/14.0.0/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include
 /Library/Developer/CommandLineTools/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks (framework directory)
End of search list.

The implementation of the C++ standard library is libc++ for Clang and libstdc++ for GCC. By default, Clang will use load time dynamic linking to link the standard library into your program:

$ otool -L main
main:
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.36.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0)

The file /usr/lib/libc++.1.dylib does not actually exist on my system, so MacOS must be hiding the file.

Debug Symbols

The symbol table is a data structure generated by the linker. It maps symbol names to their corresponding addresses in the program. The symbol table is stored in the final executable, and can be viewed using the nm command line tool. When you set a breakpoint on a function in a debugger like gdb, it uses the symbol table to find the address of the function.

Debug symbols contain more information than the symbol table, such as the line number and source file where symbols are defined. They make debuggers a lot more useful, but they are not generated by default. You have to pass the flag -g to the compiler. The most common data format for debug symbols is called DWARF. On Linux, the debug symbols are stored in the executable, but on MacOS they are stored in a separate directory with the .dSYM suffix. You can view the debug symbols using the dwarfdump command line tool. For example:

$ clang++ main.cpp helper.cpp -o main -g
$ dwarfdump -a main.dSYM/Contents/Resources/DWARF/main
...
0x00006df5:   DW_TAG_subprogram
                DW_AT_low_pc    (0x0000000100003cc0)
                DW_AT_high_pc   (0x0000000100003cdc)
                DW_AT_APPLE_omit_frame_ptr      (true)
                DW_AT_frame_base        (DW_OP_reg31 WSP)
                DW_AT_linkage_name      ("_Z9times_twod")
                DW_AT_name      ("times_two")
                DW_AT_decl_file ("/Users/antoncao/Documents/code/helper.cpp")
                DW_AT_decl_line (1)
                DW_AT_type      (0x0000000000006e22 "double")
                DW_AT_external  (true)
...




Comment