## Floating Point Arithmetic

Representation   +-Significand * Base ^ +-Exponent

Significand use sign bit
Exponent use bias
normalised number, left most bit is 1.
Example    0.1bbbb * 2 E
therefore left most bit of significand is always 1 and is "implicit" (no need to store this bit).

### Range of representable number :

negative number
positive number
neg overflow, neg underflow
pos overflow, pos underflow
The numbers represented are not spaced evenly along the number line. (the larger the more spacing).

### IEEE standard 754

single 32 bits, double 64 bits, double extended >= 79 bits
exponent base 2
single        double        double extended
word width (bits)             32                 64                 >= 79
significand width (bits)  23                 52                 >=63
exponent width (bits)       8                  11                 >=15
exponent bias                 127             1023                 unspec.
E = 00...  S = 0   zero
E = 11...  S = 0   pos infinity, neg infinity
E = 00...  S /= 0  denormalized number
E = 11...  S /= 0  Not a Number (NaN)

### Floating point arithmetic

x = xs B xe
y = ys B ye
let xe <= ye
x + y = (xs B xe - ye + ys)  B ye
x - y = (xs B xe - ye - ys)  B ye
x * y = (xs * ys) B xe + ye
x / y = (xs / ys) B xe - ye

msd = most significant digit
S = significand
E = exponent

2. check operand 0
3. align by shifting smaller number to the right (increment its E) until two E are equal
4. check 0
6. check 0
7. check S overflow if so shift right
9. normalize result, shift S left until msd is not zero, decrement E, E may underflow
10. rounded off the result
Multiplication
1. check operand 0
3. substract bias
4. check E overflow, underflow
5. sign-magnitude mul S
6. normalized result  and rounded (E may underflow)
Division
1. check operand 0
2. xe - ye
4. check E overflow, underflow
5. divide S
6. normalized and rounded result

Precision
-- Guard bits

pad out the right end of S with 0s
-- Rounding
round to nearest
round toward pos inf
round toward neg inf
round toward 0  (truncated)
-- Denormalized number
to handle E underflow, the result is denormalized by right-shifting S and increment E until E is within
representable range. This method is also referred to as "gradual underflow".