95% Completely Clueless: " of The Folks Out There Are About Floating-Point."

Quote of the day

“95% of the
folks out there are
completely clueless about

James Gosling
Sun Fellow
Java Inventor
CS 314 Chapter 3.1 CSE, 2016
Goals for Floating Point
 Standard arithmetic for reals for all computers
 Like two’s complement

 Keep as much precision as possible in formats

 Help programmer with errors in real arithmetic
 +∞, -∞, Not-A-Number (NaN), exponent overflow,
exponent underflow
 Keep encoding that is somewhat compatible with two’s
 E.g., 0 in Fl. Pt. is 0 in two’s complement
 Make it possible to sort without needing to do floating
point comparison

CS 314 Chapter 3.2 CSE, 2016

Scientific Notation (e.g., Base 10)

 Normalized scientific notation (aka standard form or exponential

 r x Ei, E is exponent (usually 10), i is a positive or negative
integer, r is a real number ≥ 1.0, < 10
 Normalized => No leading 0s
 61 is 6.10 x 102, 0.000061 is 6.10 x10-5

CS 314 Chapter 3.3 CSE, 2016

Scientific Notation (e.g., Base 10)
 (r x ei) x (s x ej) = (r x s) x ei+j
(1.999 x 102) x (5.5 x 103) = (1.999 x 5.5) x 105
= 10.9945 x 105
= 1.09945 x 106
 (r x ei) / (s x ej) = (r / s) x ei-j
(1.999 x 102) / (5.5 x 103) = 0.3634545… x 10-1
= 3.634545… x 10-2
 For addition/subtraction, you first must align:
(1.999 x 102) + (5.5 x 103)
= (.1999 x 103) + (5.5 x 103) = 5.6999 x 103

CS 314 Chapter 3.4 CSE, 2016

Floating Point:
Representing Very Small Numbers

 Zero: Bit pattern of all 0s is encoding for 0.000

 But 0 in exponent should mean most negative
exponent (want 0 to be next to smallest real)
 Can’t use two’s complement (1000 0000two)
 Bias notation: subtract bias from exponent
 Single precision uses bias of 127; DP uses 1023

 0 uses 0000 0000two => 0-127 = -127;

∞, NaN uses 1111 1111two => 255-127 = +128
Smallest SP real can represent: 1.00…00 x 2-126
 Largest SP real can represent: 1.11…11 x 2+127

CS 314 Chapter 3.5 CSE, 2016

Bias Notation (+127)
How it is interpreted How it is encoded

∞, NaN

closer to


CS 314 Chapter 3.6 CSE, 2016

What About Real Numbers in Base 2?
r x Ei, E where exponent is (2), i is a positive or
negative integer, r is a real number ≥ 1.0, < 2
 Computers version of normalized scientific notation
called Floating Point notation

CS 314 Chapter 3.7 CSE, 2016

Floating Point Numbers
 32-bit word has 232 patterns, so must be approximation of real
numbers ≥ 1.0, < 2
 IEEE 754 Floating Point Standard:
 1 bit for sign (s) of floating point number

 8 bits for exponent (E)

 23 bits for fraction (F)

(get 1 extra bit of precision if leading 1 is implicit)
(-1)s x (1 + F) x 2E
 Can represent from 2.0 x 10-38 to 2.0 x 1038

CS 314 Chapter 3.8 CSE, 2016

Floating Point Numbers

 What about bigger or smaller numbers?

 IEEE 754 Floating Point Standard:
Double Precision (64 bits)
 1 bit for sign (s) of floating point number

 11 bits for exponent (E)

 52 bits for fraction (F)
(get 1 extra bit of precision if leading 1 is implicit)
(-1)s x (1 + F) x 2E
 Can represent from 2.0 x 10-308 to 2.0 x 10308
 32 bit format called Single Precision

CS 314 Chapter 3.9 CSE, 2016

Representing Big (and Small) Numbers
 What if we want to encode the approx. age of the earth?
4,600,000,000 or 4.6 x 109
or the weight in kg of one a.m.u. (atomic mass unit)
0.0000000000000000000000000166 or 1.6 x 10-27

There is no way we can encode either of the above in a

32-bit integer.

 Floating point representation (-1)sign x F x 2E

 Still have to fit everything in 32 bits (single precision)
s E (exponent) F (fraction)
1 bit 8 bits 23 bits
 The base (2, not 10) is hardwired in the design of the FPALU
 More bits in the fraction (F) or the exponent (E) is a trade-off
between precision (accuracy of the number) and range (size of
the number)
CS 314 Chapter 3.10 CSE, 2016
Exception Events in Floating Point
 Overflow (floating point) happens when a positive
exponent becomes too large to fit in the exponent field
 Underflow (floating point) happens when a negative
exponent becomes too large to fit in the exponent field
-∞ +∞

- largestE -smallestF - largestE +smallestF

+ largestE -largestF + largestE +largestF

 One way to reduce the chance of underflow or overflow

is to offer another format that has a larger exponent field
 Double precision – takes two MIPS words
s E (exponent) F (fraction)
1 bit 11 bits 20 bits
F (fraction continued)
32 bits
CS 314 Chapter 3.11 CSE, 2016
“Father” of the Floating point standard

IEEE Standard 754

for Binary Floating-
Point Arithmetic.

ACM Turing
Award Winner! Prof. Kahan
CS 314 Chapter 3.12 CSE, 2016
IEEE 754 FP Standard
 Most (all?) computers these days conform to the IEEE 754
floating point standard (-1)sign x (1+F) x 2E-bias
 Formats for both single and double precision
 F is stored in normalized format where the msb in F is 1 (so there
is no need to store it!) – called the hidden bit
 To simplify sorting FP numbers, E comes before F in the word and
E is represented in excess (biased) notation where the bias is -127
(-1023 for double precision) so the most negative is 00000001 =
21-127 = 2-126 and the most positive is 11111110 = 2254-127 = 2+127

 Examples (in normalized format)

 Smallest+: 0 00000001 1.00000000000000000000000 = 1 x 21-127
 Zero: 0 00000000 00000000000000000000000 = true 0
 Largest+: 0 11111110 1.11111111111111111111111 =
2-2-23 x 2254-127
 1.02 x 2-1 = 0 01111110 1.00000000000000000000000
 0.7510 x 24 = 0 10000010 1.10000000000000000000000
CS 314 Chapter 3.14 CSE, 2016
Ex: Converting Binary FP to Decimal
BEE00000H is the hex. Rep. Of an IEEE 754 SP FP number

10111 1101 110 0000 0000 0000 0000 0000

(-1)S x (1 + Significand) x 2(Exponent-127)
°Sign: 1 => negative
• 0111 1101two = 125ten
• Bias adjustment: 125 - 127 = -2
1 + 1x2-1+ 1x2-2 + 0x2-3 + 0x2-4 + 0x2-5 +...
=1+2-1 +2-2 = 1+0.5 +0.25 = 1.75
°Represents: -1.75tenx2-2 = -0.4375 (= -4.375x10-1 )
CS 314 Chapter 3.15 CSE, 2016
Ex: Converting Decimal to FP
-1.275 x 101
1. Denormalize: -12. 75
2. Convert integer part:
12 = 8 + 4 = 11002
3. Convert fractional part:
.75 = .5 + .25 = .112
4. Put parts together and normalize:
1100.11 = 1.10011 x 23
5. Convert exponent: 127 + 3 = 128 + 2 = 1000 00102

11000 0010 100 1100 0000 0000 0000 0000

The Hex rep. is C14C0000H
CS 314 Chapter 3.16 CSE, 2016
Representation for 0

How to represent 0?
exponent: all zeros
significand: all zeros
What about sign? Both cases valid.
+0: 0 00000000 00000000000000000000000
-0: 1 00000000 00000000000000000000000

CS 314 Chapter 3.17 CSE, 2016

Representation for +∞/-∞ ∞ :infinity

How to represent +∞/-∞?

• Exponent : all ones (11111111B = 255)
• Significand: all zeros
+∞ : 0 11111111 00000000000000000000000
-∞ : 1 11111111 00000000000000000000000
5 / 0 = +∞, -5 / 0 = -∞
5+(+∞) = +∞, (+∞)+(+∞) = +∞
5 - (+∞) = -∞, (-∞) - (+∞) = -∞ etc

CS 314 Chapter 3.18 CSE, 2016

Representation for “Not a Number”
Sqrt (- 4.0) = ? 0/0 = ?
 Called Not a Number (NaN) - “非数”
How to represent NaN
Exponent = 255
Significand: nonzero
NaNs can help with debugging
sqrt (-4.0) = NaN 0/0 = NaN
op (NaN,x) = NaN +∞+(-∞) = NaN
+∞- (+∞) = NaN ∞/∞ = NaN
CS 314 Chapter 3.19 CSE, 2016
Representation for Denorms(非规格化数)

What have we defined so far? (for SP)

Exponent Significand Object Used to represent

0 0 +/-0 numbers

0 nonzero Denorms
1-254 anything Norms
implicit leading 1
255 0 +/- infinity
255 nonzero NaN

CS 314 Chapter 3.20 CSE, 2016

Group Discussion 1: Questions about IEEE 754
Four students form a group and discuss the following
 What about following type converting: will it output
if ( i == (int) ((float) i) ) {
printf (“true”);
if ( f == (float) ((int) f) ) {
printf (“true”);
CS 314 Chapter 3.21 CSE, 2016
Question II about IEEE 754

 How about FP add associative? (X+Y)+Z=X+(Y+Z)

x = – 1.5 x 1038, y = 1.5 x 1038, z = 1.0
(x+y)+z = (–1.5x1038+1.5x1038 ) +1.0 = 1.0
x+(y+z) = –1.5x1038+ (1.5x1038+1.0) = 0.0

CS 314 Chapter 3.22 CSE, 2016

IEEE 754 FP Standard Encoding
 Special encodings are used to represent unusual events
 ± infinity for division by zero
 NAN (not a number) for the results of invalid operations such as
 True zero is the bit string all zero

Single Precision Double Precision Object

E (8) F (23) E (11) F (52) Represented
0000 0000 0 0000 … 0000 0 true zero (0)
0000 0000 nonzero 0000 … 0000 nonzero ± denormalized
0111 1111 to anything 0111 …1111 to anything ± floating point
+127,-126 +1023,-1022 number
1111 1111 +0 1111 … 1111 -0 ± infinity
1111 1111 nonzero 1111 … 1111 nonzero not a number

CS 314 Chapter 3.23 CSE, 2016

Support for Accurate Arithmetic
 IEEE 754 FP rounding modes
 Always round up (toward +∞)
 Always round down (toward -∞)
 Truncate
 Round to nearest even (when the Guard || Round || Sticky are
100) – always creates a 0 in the least significant (kept) bit of F

 Rounding (except for truncation) requires the hardware to

include extra F bits during calculations
 Guard bit – used to provide one F bit when shifting left to normalize
a result (e.g., when normalizing F after division or subtraction)
 Round bit – used to improve rounding accuracy
 Sticky bit – used to support Round to nearest even; is set to a 1
whenever a 1 bit shifts (right) through it (e.g., when aligning F
during addition/subtraction)
F = 1 . xxxxxxxxxxxxxxxxxxxxxxx G R S
CS 314 Chapter 3.24 CSE, 2016
Floating Point Addition
 Addition (and subtraction)
(F1  2E1) + (F2  2E2) = F3  2E3
 Step 0: Restore the hidden bit in F1 and in F2
 Step 1: Align fractions by right shifting F2 by E1 - E2 positions
(assuming E1  E2) keeping track of (three of) the bits shifted out
in G R and S
 Step 2: Add the resulting F2 to F1 to form F3
 Step 3: Normalize F3 (so it is in the form 1.XXXXX …)
- If F1 and F2 have the same sign  F3 [1,4)  1 bit right shift F3
and increment E3 (check for overflow)
- If F1 and F2 have different signs  F3 may require many left shifts
each time decrementing E3 (check for underflow)
 Step 4: Round F3 and possibly normalize F3 again
 Step 5: Rehide the most significant bit of F3 before storing the

CS 314 Chapter 3.25 CSE, 2016

Floating Point Addition Example
 Add
(0.5 = 1.0000  2-1) + (-0.4375 = -1.1100 2-2)
 Step 0: Hidden bits restored in the representation above
 Step 1: Shift significand with the smaller exponent (1.1100) right
until its exponent matches the larger exponent (so once)

 Step 2: Add significands

1.0000 + (-0.111) = 1.0000 – 0.111 = 0.001

 Step 3: Normalize the sum, checking for exponent over/underflow

0.001 x 2-1 = 0.010 x 2-2 = .. = 1.000 x 2-4

 Step 4: The sum is already rounded, so we’re done

 Step 5: Rehide the hidden bit before storing

CS 314 Chapter 3.27 CSE, 2016

 Given A=2.6125×101, B=4.150390625×10-1, Calculate
the sum of A and B by hand, assuming A and B are
stored by the following format, Assume 1 guard, 1 round
bit, and 1 sticky bit, and round to the nearest even. Show
all the steps.

Sign Exponent Fraction

1 bit 5 bits 10 bits

CS 314 Chapter 3.28 CSE, 2016

 Solution:
2.6125×101 + 4.150390625×10–1
2.6125×101 = 26.125 = 11010.001 = 1.1010001000×24
4.150390625×10–1 = .4150390625 = .011010100111
Shift binary point 6 to the left to align exponents,
1.1010001000 00
+.0000011010 10 0111 (Guard = 1, Round = 0, Sticky = 1)
1.1010100010 10
In this case the extra bits (G,R,S) are more than half of the least significant bit
Thus, the value is rounded up.
1.1010100011 × 24 = 11010.100011 × 20 = 26.546875
= 2.6546875 × 101

CS 314 Chapter 3.29 CSE, 2016

Floating Point Multiplication
 Multiplication
(F1  2E1) x (F2  2E2) = F3  2E3
 Step 0: Restore the hidden bit in F1 and in F2
 Step 1: Add the two (biased) exponents and subtract the bias
from the sum, so E1 + E2 – 127 = E3
also determine the sign of the product (which depends on the
sign of the operands (most significant bits))
 Step 2: Multiply F1 by F2 to form a double precision F3
 Step 3: Normalize F3 (so it is in the form 1.XXXXX …)
- Since F1 and F2 come in normalized  F3 [1,4)  1 bit right shift
F3 and increment E3
- Check for overflow/underflow
 Step 4: Round F3 and possibly normalize F3 again
 Step 5: Rehide the most significant bit of F3 before storing the
CS 314 Chapter 3.30 CSE, 2016
Floating Point Multiplication Example
 Multiply
(0.5 = 1.0000  2-1) x (-0.4375 = -1.1100 2-2)
 Step 0: Hidden bits restored in the representation above
 Step 1: Add the exponents (not in bias would be -1 + (-2) = -3
and in bias would be (-1+127) + (-2+127) – 127 = (-1
-2) + (127+127-127) = -3 + 127 = 124

 Step 2: Multiply the significands

1.0000 x 1.110 = 1.110000
 Step 3: Normalized the product, checking for exp over/underflow
1.110000 x 2-3 is already normalized

 Step 4: The product is already rounded, so we’re done

 Step 5: Rehide the hidden bit before storing

CS 314 Chapter 3.32 CSE, 2016

MIPS Floating Point Instructions
 MIPS has a separate Floating Point Register File
($f0, $f1, …, $f31) (whose registers are used in
pairs for double precision values) with special instructions
to load to and store from them
lwcl $f1,54($s2) #$f1 = Memory[$s2+54]
swcl $f1,58($s4) #Memory[$s4+58] = $f1
 And supports IEEE 754 single
add.s $f2,$f4,$f6 #$f2 = $f4 + $f6
and double precision operations
add.d $f2,$f4,$f6 #$f2||$f3 =
$f4||$f5 + $f6||$f7
similarly for sub.s, sub.d, mul.s, mul.d, div.s,
CS 314 Chapter 3.33 CSE, 2016
MIPS Floating Point Instructions, Con’t
 And floating point single precision comparison operations
c.x.s $f2,$f4 #if($f2 < $f4) cond=1;
else cond=0
where x may be eq, neq, lt, le, gt, ge
and double precision comparison operations
c.x.d $f2,$f4 #$f2||$f3 < $f4||$f5
cond=1; else cond=0
 And floating point branch operations
bclt 25 #if(cond==1)
go to PC+4+25
bclf 25 #if(cond==0)
go to PC+4+25
CS 314 Chapter 3.34 CSE, 2016
Frequency of Common MIPS Instructions
 Only included those with >3% and >1%
addu 5.2% 3.5% add.d 0.0% 10.6%
addiu 9.0% 7.2% sub.d 0.0% 4.9%
or 4.0% 1.2% mul.d 0.0% 15.0%
sll 4.4% 1.9% add.s 0.0% 1.5%
lui 3.3% 0.5% sub.s 0.0% 1.8%
lw 18.6% 5.8% mul.s 0.0% 2.4%
sw 7.6% 2.0% l.d 0.0% 17.5%
lbu 3.7% 0.1% s.d 0.0% 4.9%
beq 8.6% 2.2% l.s 0.0% 4.2%
bne 8.4% 1.4% s.s 0.0% 1.1%
slt 9.9% 2.3% lhu 1.3% 0.0%
slti 3.1% 0.3%
sltu 3.4% 0.8%
CS 314 Chapter 3.35 CSE, 2016
Assignment III
 3.6, 3.8, 3.11, 3.14
 Coding Assignment
 Objective: Understanding the applications of IEEE 754 floating points in real-
world machine
 Task 1: In your machine, what is the accuracy for single precision and
double precision (or the number of bits required for single/double precision
floating)? Please use a simple program to demonstrate it.
 Task 2: Run a program to obtain the results of “-8.0/0”and“sqrt(-4.0)”in
your machine.
 Reports:
 1. Submit your codes and execution results by printing your screen.
 2. Answer the following questions:
 1)What are the accuracy of float and double in your machine.
 2)How to represent infinite and NAN in your machine.

 Due: Nov. 17
CS 314 Chapter 3.36 CSE, 2016

