Tuesday, 24 June 2014

The C Building Process

Hey guys, most of you might have wondered how does a C/C++ program actually get transformed into an executable code. Well, that involves a number of steps and also a few components, viz.

1. The source code – The textual version of the code, written by the programmer. It includes the C or  C++ statements, along with the preprocessor statements ( those starting with a '#' ). The source code has the extension .c or .cpp (and some other like .C, ...)
2. The Preprocessor – The source code is fed to the preprocessor, which replaces the # statements with the respective statements to form the expanded/extended source code. Generally it has the extension *.i
3. The Compiler – The *.i file is fed to the compiler, which checks the code for any syntax error , and reports error(s) if any. Remember that the compiler never checks for any dependencies or relation in the code with the other code. It will check for syntax error only. It produces the assembly code (*.asm or *.s).
4. The Assembler – The assembler converts the assembly code to “Relocatable object code” having the extension *.o or *.obj
5. The Linker – Final and one of the most important blocks of the entire procedure. It checks for the dependencies and resolves them and hence combine two or more object codes to final executable code, or the *.exe (*.out in some OS's) file.


This entire conversion from source code to executable code is called the build process.

 
Now let us take a very simple example. Here we write a small C program named “sample.c” :
We will be using linux operating system as it has simple and useful tools to show the step by step process.


Here you may stuck at two points:
  • We have not used any preprocessor statements like #include ,etc. This is just to keep the code and description simple. So remember that we can exclude preprocessing here. Hence we can save our file as sample.i which indicates that we don't need it get preprocessed.
  • There is no definition (body) of the function “func(int)”. This is to show the work of linker and compiler only. It would get clear below.

Now compile the code using following command:

gcc -c sample.c

Remember, the command “gcc -o sample.out sample.c” will do the eintire build process. But gcc with -c option only compiles and it is not linked, to produce the sample.o file (This is the object code)

Viola! Compilation was successful, though we know that the function “func” has no body, but still our code is semantically correct [correct by syntax]. Hence it was compiled properly.

Now type in your terminal the following code:

nm sample.o

The nm command shows you the symbolic version of the object code, which is supposed to be fed to the linker.
You will get the following output:



 
'U' means unresolved dependency. This shows that 'func' has an unresolved dependency.

Now try this :

gcc -o sample.out sample.c

This will do the entire build process.
As expected, you will get the following error:


ld is the GNU C linker, which when tries to find the body of 'func', inside the code and the standard C library, fails, hence, generates this error.

Hence it is the linker which searches and resolves the dependencies of the functions in our code.

Now let us come to a standard question. Why we use #include and how does this work?
Let us take #include <stdio.h>

Have you guys ever opened the file stdio.h ?

If you open it, you will find that it contains only the declaration (prototype) of the printf() or scanf() (and so on) functions. It does not contain the body of those functions. 

The compiler actually needs only the prototype, so as to check whether we are supplying correct arguments to the function or not. The compiler never checks for the body of the printf() function.
Hence, for the prototype to be included in our code, we are using the #include<stdio.h> statement.

The body of the function is in object code format, in the C library [ here it is with the GNU C library glibc ]

At the linking time, the linker searches for the body of printf() function in standard C library , which is already in object code. These object codes come with the compiler set itself.
Thus the linker when finds the object code for printf(), comnbines it with the object code of our C program to generate the final executable file.

Now what are loader and debugger??

Loader loads the executable code from the secondary memory to RAM. 

The Debugger is a feature included with the IDE in general, and helps the user to insert breakpoints in our code. The compiler compiles the code and stops at each breakpoint. This facilates in removing any user made errors (bugs) from the code, and hence the name Debugger.

Final interresting point:

Try the following code in your terminal:

objdump –disassemble sample.o

The “OBJect DUMP” command with the disassemble option shows you the object code in assembly language [Remember the *.asm or *.s file? ]



Isn't it interesting?




2 comments:

  1. Very informative! Did all this, and never cared too see what happens behind the scenes :D

    ReplyDelete
    Replies
    1. Thanks, sometimes getting to know "behind the scenes" is more interesting ;)

      Delete