Floating Point Arithmetic
Representation +-Significand * Base ^ +-Exponent
Significand use sign bit
Exponent use bias
normalised number, left most bit is 1.
Example 0.1bbbb * 2 E
therefore left most bit of significand is always 1 and is "implicit"
(no need to store this bit).
Range of representable number :
negative number
positive number
neg overflow, neg underflow
pos overflow, pos underflow
The numbers represented are not spaced evenly along the number line. (the
larger the more spacing).
IEEE standard 754
single 32 bits, double 64 bits, double extended >= 79 bits
exponent base 2
single double
double extended
word width (bits)
32
64
>= 79
significand width (bits) 23
52
>=63
exponent width (bits) 8
11
>=15
exponent bias
127
1023
unspec.
E = 00... S = 0 zero
E = 11... S = 0 pos infinity, neg infinity
E = 00... S /= 0 denormalized number
E = 11... S /= 0 Not a Number (NaN)
Floating point arithmetic
x = xs B xe
y = ys B ye
let xe <= ye
x + y = (xs B xe - ye + ys) B ye
x - y = (xs B xe - ye - ys) B ye
x * y = (xs * ys) B xe + ye
x / y = (xs / ys) B xe - ye
Addition Substraction
msd = most significant digit
S = significand
E = exponent
-
made implicit bit explicit
-
check operand 0
-
align by shifting smaller number to the right (increment its E) until two
E are equal
-
check 0
-
add signed S
-
check 0
-
check S overflow if so shift right
-
check E overflow if so report error
-
normalize result, shift S left until msd is not zero, decrement E, E may
underflow
-
rounded off the result
Multiplication
-
check operand 0
-
add E
-
substract bias
-
check E overflow, underflow
-
sign-magnitude mul S
-
normalized result and rounded (E may underflow)
Division
-
check operand 0
-
xe - ye
-
add bias
-
check E overflow, underflow
-
divide S
-
normalized and rounded result
Precision
-- Guard bits
pad out the right end of S with 0s
-- Rounding
round to nearest
round toward pos inf
round toward neg inf
round toward 0 (truncated)
-- Denormalized number
to handle E underflow, the result is denormalized by right-shifting
S and increment E until E is within
representable range. This method is also referred to as "gradual underflow".