Floating Point Arithmetic


Representation   +-Significand * Base ^ +-Exponent

Significand use sign bit
Exponent use bias
normalised number, left most bit is 1.
Example    0.1bbbb * 2 E
therefore left most bit of significand is always 1 and is "implicit" (no need to store this bit).

Range of representable number :

negative number
positive number
neg overflow, neg underflow
pos overflow, pos underflow
The numbers represented are not spaced evenly along the number line. (the larger the more spacing).

IEEE standard 754

single 32 bits, double 64 bits, double extended >= 79 bits
exponent base 2
                                        single        double        double extended
word width (bits)             32                 64                 >= 79
significand width (bits)  23                 52                 >=63
exponent width (bits)       8                  11                 >=15
exponent bias                 127             1023                 unspec.
E = 00...  S = 0   zero
E = 11...  S = 0   pos infinity, neg infinity
E = 00...  S /= 0  denormalized number
E = 11...  S /= 0  Not a Number (NaN)

Floating point arithmetic

x = xs B xe
y = ys B ye
let xe <= ye
x + y = (xs B xe - ye + ys)  B ye
x - y = (xs B xe - ye - ys)  B ye
x * y = (xs * ys) B xe + ye
x / y = (xs / ys) B xe - ye

Addition Substraction
msd = most significant digit
S = significand
E = exponent

  1. made implicit bit explicit
  2. check operand 0
  3. align by shifting smaller number to the right (increment its E) until two E are equal
  4. check 0
  5. add signed S
  6. check 0
  7. check S overflow if so shift right
  8. check E overflow if so report error
  9. normalize result, shift S left until msd is not zero, decrement E, E may underflow
  10. rounded off the result
Multiplication
  1. check operand 0
  2. add E
  3. substract bias
  4. check E overflow, underflow
  5. sign-magnitude mul S
  6. normalized result  and rounded (E may underflow)
Division
  1. check operand 0
  2. xe - ye
  3. add bias
  4. check E overflow, underflow
  5. divide S
  6. normalized and rounded result


Precision
-- Guard bits

pad out the right end of S with 0s
-- Rounding
round to nearest
round toward pos inf
round toward neg inf
round toward 0  (truncated)
-- Denormalized number
to handle E underflow, the result is denormalized by right-shifting S and increment E until E is within
representable range. This method is also referred to as "gradual underflow".