Floating Point Arithmetic

Representation +-Significand * Base ^ +-Exponent

Significand use sign bit
Exponent use bias
normalised number, left most bit is 1.

Example 0.1bbbb * 2 E
therefore left most bit of significand is always 1 and is "implicit" (no need to store this bit).

Range of representable number :

negative number
positive number
neg overflow, neg underflow
pos overflow, pos underflow

The numbers represented are not spaced evenly along the number line. (the larger the more spacing).

IEEE standard 754

single 32 bits, double 64 bits, double extended >= 79 bits
exponent base 2
                                        single        double        double extended
word width (bits)             32                 64                 >= 79
significand width (bits) 23                 52                 >=63
exponent width (bits)       8                  11                 >=15
exponent bias                 127             1023                 unspec.
E = 00... S = 0   zero
E = 11... S = 0   pos infinity, neg infinity
E = 00... S /= 0 denormalized number
E = 11... S /= 0 Not a Number (NaN)

Floating point arithmetic

x = xs B xe
y = ys B ye
let xe <= ye
x + y = (xs B xe - ye + ys) B ye
x - y = (xs B xe - ye - ys) B ye
x * y = (xs * ys) B xe + ye
x / y = (xs / ys) B xe - ye

Addition Substraction
msd = most significant digit
S = significand
E = exponent

made implicit bit explicit
check operand 0
align by shifting smaller number to the right (increment its E) until two E are equal
check 0
add signed S
check 0
check S overflow if so shift right
check E overflow if so report error
normalize result, shift S left until msd is not zero, decrement E, E may underflow
rounded off the result

Multiplication

check operand 0
add E
substract bias
check E overflow, underflow
sign-magnitude mul S
normalized result and rounded (E may underflow)

Division

check operand 0
xe - ye
add bias
check E overflow, underflow
divide S
normalized and rounded result

Precision
-- Guard bits

pad out the right end of S with 0s

-- Rounding

round to nearest
round toward pos inf
round toward neg inf
round toward 0 (truncated)

-- Denormalized number
to handle E underflow, the result is denormalized by right-shifting S and increment E until E is within
representable range. This method is also referred to as "gradual underflow".