What is Floating point Arithmetic?

Floating-point arithmetic is considered an esoteric subject by many people. Floating point arithmetic derives its name from something that happens when you use exponential notation. It's also surprising that floating-point is ubiquitous in computer systems.

Almost every computer language has a floating-point data type - from PCs to supercomputers have floating-point accelerators, most compilers will be called upon to compile floating-point algorithms from time to time, even virtually every operating system must respond to floating-point exceptions such as overflow.

In 1914, Leonardo Torres y Quevedo designed an electro-mechanical version of Charles Babbage's Analytical Engine, and included floating-point arithmetic. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.

A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale. The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations. The first commercial computer with floating-point hardware was Zuse's Z4 computer, designed in 1942–1945.

Consider the number 158 - it can be written using the exponential notation as:

1.58 * 102

15.8 * 101

158 * 100

.158 * 103

1580 * 10-1 etc.


All of these representations of the number 158 are numerically equivalent. They differ only in their normalization - where the decimal point appears in the first number. In each case, the number before the multiplication operator "*" represents the significant figures in the number, which call this number the significand.

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits.


In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

Recommended for you