Friday, 4 November 2016

Technical Blog 4- More Compilers

In this post I am going to continue on directly from my previous post and continue to talk about the inner workings of compilers. Specifically today I will describe each of the phases of the actual compilation of the code and how they relate to each other.

Error Handler
It is worth mentioning that much of what a compiler ends up doing is checking for errors within the code, most of the stages in the process are able to identify errors of various kinds in the code. When error s do occur, depending on the severity the compiler will either continue or will fail and feedback to the user. It can be thought of as a side process of the compiler that is the error handler which receives errors from the various stages and ends the compilation process. This phase is connected to all the other phases and is not necessarily used at all during a successful compilation.

Lexical Analysis
This is the first stage of compiling, it involves evaluating the source code at its simplest level, It goes through the code and divides statements up based on the presence of white-space or some other divider. It then outputs these as tokens to the syntactical analyser. This process effectively removes spaces from the code and separates out each relevant item for example in the statement "x=4", '4', 'x' and '=' would each be a separate token.

Syntactical Analysis
This stage of the compiling process takes in the tokens from the lexical analysis and checks the grammar of the statements based on a predefined grammar for the language being compiled. It checks the validity of statements as a whole assuming the tokens themselves have been validated by the lexical analysis. In addition to checking for grammatical errors this stage outputs a parse tree which can then be used by the semantic analyser. The parse tree is made up of all of the tokens that were provided by the lexical phase, they are positioned based on the grammar of the language and a series of production rules.

Semantic Analysis
Semantic analysis is the phase were the actual usability of the code begins to be evaluated. Whereas previous phases were not concerned with the functionality of the code, merely the structure of it, semantics checks things like appropriate variable assignments and usage of reserved identifiers. For example the statement "int x = G" would not produce any errors in lexical or syntactical analysis, but would be picked out as an error by semantics as G is not an integer. This phase also checks that variables are properly defined if necessary depending on the language.

Code Optimization
This stage is the first where errors are not really checked for anymore. It is assumed that most errors have already been picked up by this point. Instead it begins the process of actually converting the code into a more machine-friendly language. This intermediary code is then evaluated for efficiency and optimization. That is to say it removes unnecessary duplicity and optimises memory usage where possible. This stage is one of the main differences between various compilers for the same language, as it contains the most room for actual improvement. This stage should help to speed up the program's execution and reduce memory usage. Frequently compilers will allow the user to choose a level of optimisation for their code which can drastically affect compilation time so can be useful for debugging and similar.

Machine Code Generation
This stage finally generates some actual machine code from the intermediary code used for optimisation. Frequently this will actually be assembly code which can then be rapidly assembled into machine code but either way it is still usable. It is often machine independent at this stage, that is to say it is designed for a specific machine code language but not optimised for use on the specific machine. It is only after this has been created that it is optimised for the machine that is currently being used. This enables the code to run as fast as possible whilst also having a relocatable stage which can allow it to run on similar but not identical processors.

So ultimately it is clear that compilers are highly complex pieces of software with many different layers and phases. There is also the potential for a variation in quality of compilers for the same language. What this actually shows is that high-level languages are remarkable in their ability to enhance human interaction with computers relative to the original interaction which involved hand-written assembly and machine code. 

References
1. Bolton D. What is a Compiler? [Internet]. About.com Tech. [cited 2 November 2016]. Available from: http://cplus.about.com/od/introductiontoprogramming/p/compiler.htm
2. Compiler Design Tutorial [Internet]. www.tutorialspoint.com. [cited 2 November 2016]. Available from: https://www.tutorialspoint.com/compiler_design/index.htm

Saturday, 22 October 2016

Technical Blog 3-Compilers

I have up to this point mentioned compilers a few times in previous blog posts, this is because they are a core part of understanding how computers are able to function and utilize the code we actually write. As I have said before, compilers are programs which convert the code in other programs into assembly language and machine code which can be read and executed by the computer itself. However whilst it is fairly obvious why we need compilers, most people do not understand how compilers actually go about doing this in a practical sense, which is what I will attempt to cover in this post today.

Ultimately a compiler takes in source code which is written in a high-level language and converts it to machine code. However this is a non-trivial exercise as unlike with assembly code there is no easy one to one conversion between high-level languages and machine code. Therefore many stages have to be individually performed by the compiler before outputting anything useful. Notably due to the nature of most coding languages allowing referencing to various sections of code throughout a program, often a compiler will have to pass over the code multiple times to accurately understand what is happening.

It is also important to understand that compilers do not exist in a vacuum. They are actually just one part of the overall language processing system. In general after a compiler has actually been through the code and produced the assembly language version of the original program, there are still a few stages that remain before the code itself is actually usable:

  • An assembler is required to convert the assembly code into machine code which will actually be readable by the computer. This process is minimal in effort but still notable as most compilers will output in assembly rather than machine code.
  • A linker is a very important part of this process as it collates together the various files associated and referenced within the program and makes sure they are all present and accessible. It will generally attempt to collect this data into a single executable file where possible. It also deals with memory allocation and ensuring that memory is available.
  • The loader loads the program into the machine's available memory and calculates the size of the program itself. It is generally a part of the operating system, which means the surrounding parts of the system need to be able to communicate with it and account for the specific demands of the loader.
This shows that the actual process of getting any code to run on a machine extends beyond the compiler itself. This is a relevant consideration when thinking about the optimisation of running code through any machine. It is even more interesting when you consider that each of these additional processes, the linker, the loader and the assembler each have to at some point have been compiled in some form or another. This means that somewhere in the past they were all written in machine code of some form, likely built up over time as I described in my last blog post. My next blog post will go into greater detail about compilers themselves, which are the most complex part of this language processing system. I will look into breaking down the various phases the compiler goes through and why each of them is important.

References
1. Bolton D. What is a Compiler? [Internet]. About.com Tech. [cited 20 October 2016]. Available from: http://cplus.about.com/od/introductiontoprogramming/p/compiler.htm
2. Compiler Design Tutorial [Internet]. www.tutorialspoint.com. [cited 20 October 2016]. Available from: https://www.tutorialspoint.com/compiler_design/index.htm

Thursday, 6 October 2016

Technical Blog 2- Assembly and Assemblers

Introduction
Assembly is at its simplest the first computer programming language designed to be written by people. It generally translates to machine code on a one for one basis with very simple compilers called "assemblers" being used. The primary advantage of assembly over directly writing in machine code is the lack of requirement for remembering lots of long numerical codes for each action to be performed. Instead, assembly has a series of simple words which each correspond directly to a line of machine code. For example:
mov [2352], 245
For some assembly variants this small snippet of code tells the computer to transfer the number 245 to the memory location 2352. This is much more readable than the binary equivalent which is a simple string of ones and zeros. However this does translate directly to an operation in machine code which makes it very simple to assemble. This contrasts to the high-level programming languages people are used to today where the logic being written is far removed from the machine operations being performed and required large amounts of complex compiling to become readable by a computer.

The main advantage then that assembly has over machine code is a much greater ability to be written and read directly by people. The key though is that there is minimal downside. Assuming an assembler is available, you do not lose any of the control gained from programming directly in machine code which enables you to develop the incredibly tight and efficient programs associated with direct memory manipulation as discussed in my first post. This means that in any practical sense if that was desired, assembly would likely be as low-level as anyone would want to go.

History

Assembly of various forms was historically very quick to follow the development of computers. Even back when people  first began using computers machine code was considered finicky and highly impractical. The question that springs to my mind when considering this is how the first assemblers were created as surely they must have been written in machine code of some form. It turns out that the earliest assemblers were themselves written in assembly, they were just translated by hand into machine code. Due to the relative simplicity of translating assembly to machine code on a one to one basis, hand conversion was common when computers were such that machine code had to input directly, through punch cards or similar. This meant that programming time was split between writing the code in assembly and then translating it word for word into binary.

In the modern world of course, assemblers are well developed and commonly used to perform the final translations in a compiler to actually output the bit code to be read by the computer. In general these compilers are written in languages for different operating systems or processors than the code being assembled. This is known as cross-compiling and enables quick access from high-level languages to machine code without having to build up the assembler in machine code which often takes many iterations in which the next level of assembler has to be converted by the current assembler until an actually useful language is built up over time.

References
1. Landley R. Introduction to cross-compiling for Linux [Internet]. Landley.net. [cited 6 October 2016]. Available from: https://landley.net/writing/docs/cross-compiling.html

2. Fomin A. Introduction to Assembly Language [Internet]. Swansontec.com. 2001 [cited 6 October 2016]. Available from: http://www.swansontec.com/sprogram.html

Thursday, 22 September 2016

Technical Blog 1-Machine code and Assembly

Introduction
In this blog I aim to learn and describe the lowest level coding language there is: Machine code. Invariably tied to machine code is the idea of compilers which act as a translator from the high level code that we can write and read relatively easily into the dense and impractical language of the machines. Therefore I will also be learning about compilers and exactly how they function on a practical level.

The most basic question that arises when talking about machine code is “why?”. Why would anyone in this day and age want to learn or even understand machine code? Computing as a whole has essentially developed beyond the point where anyone will ever write a program in machine code or even assembly for any practical reason. These languages are incredibly error prone, inflexible, difficult to debug and take forever to write due to having to look up the binary codes required to do each command. However there are reasons to at least understand the basics of these languages, primarily it is good from a general understanding point of view. It shows higher-level coding in a new light and gives a better appreciation for how easy coding can be these days. In a more practical sense there are times when low level languages can achieve efficiency that compiler-driven high-level code cannot as the code can be tailored to perform very specific tasks in very specific ways. Personally though I just find the whole topic interesting, like a dark and mysterious cavern filled with 1s and 0s where most people do not dare to enter.

Decimal
Hexadecimal
Binary
00
00
000000
12
0C
001100
32
20
100000
45
2D
101111







Table 1: Shows a series of numbers being expressed in different forms including binary

So what is machine code? Machine code is the language which a computer can actually read. It is transmitted to the computer in binary form which is incredibly difficult to comprehend by humans. Normally when being directly interacted with by people it is viewed or written in hexadecimal form or similar. Table 1 shows how binary ends up being very unwieldy as lots of digits are required to show relatively simple numbers. Whilst it also looks alien at first, hexadecimal is generally preferred to decimal as it is base 16 which means it lines up better with binary being base 2 and enables much cleaner correlation between the two languages than decimal which is base ten.

Instruction sets
Different processors read machine code differently, that is to say that even at the lowest level there are a variety of different languages that can be written in, although in this case they the choice is based entirely on which language the processor is designed to read. This does mean that machine code that functions on one machine will likely be completely unusable on another, which is another one of the many reasons to not write code directly in machine language. These languages are known as instruction sets, they are a collection of instructions which essentially tell the computer: “if given this string of binary digits, perform this task.”

Instruction sets are themselves divided into many sub-categories such as Complex instruction sets and Reduced instruction sets which each have different attitudes to different types of instructions. For example the reduced instruction set only implements simple, common commands directly and performs complex commands using combinations of the commons commands. Contrastingly, complex instruction sets will be able to directly implement some of the more specialised tasks. This leads to being able to more efficiently perform those specialised tasks at the cost of being less efficient on average for the simple common tasks.

Opcode
Addresses
Other Information
Defines the operation type and how to divide the other 26 bits
Most operations will require some number of addresses, each 5 bits long they point to a specific memory address to either read or write
Different operations will require additional inputs of varying lengths to describe exactly what to do.
001100
10011
01010
0101000110101010
Table 2: Shows an example of an instruction in MIPS form.





Instruction sets are set up to take in a set number of bits per instruction to enable them to be read as a simple binary stream i.e. the computer will know that every so many bits is a new instruction. The most common instruction sets use either 32-bit or 64-bit instructions although pretty much any number can be used. These instructions will generally be divided into sub sequences of information depending on the language. For example, in MIPS which is a 32-bit reduced instruction set, the first 6 bits will determine the type of instruction being performed and how to use the rest of the bits, this includes defining the addresses to be read from and stored to, any additional information required and what to do with this information, Table 2 shows the structure of a MIPS instruction. All possible tasks for the computer to perform including file reading and writing, arithmetic operations and even jumping about within the program are defined within the 32 bit instructions in this way, although as I touched on before, many instructions will be combined to perform more complex operations as needed depending on the instruction set.

Conclusion

I have only begun to scratch the surface of machine code in this post. But it seems clear to me that there is a lot to talk about relating to the lowest level languages. Over this series I will continue to develop my own understanding of this topic. I will be investigating assembly language, compilers and other related subjects in order to break through the rift between people and machines.

References
1. MIPS® Architecture For Programmers Volume I-A: Introduction to the MIPS32® Architecture [Internet]. 1st ed. Imagination Technologies; 2014 [cited 22 September 2016]. Available from: https://imagination-technologies-cloudfront-assets.s3.amazonaws.com/documentation/MD00082-2B-MIPS32INT-AFP-06.01.pdf

2. Fairhead H. Hexadecimal [Internet]. I-programmer.info. 2016 [cited 22 September 2016]. Available from: http://www.i-programmer.info/babbages-bag/478-hexadecimal.html?start=1