Content
Floating-point
The most important floating-point representation is that defined in IEEE 754 (ANSI/IEEE Std 754-1985)
This standard was developed to facilitate the portability of programs from one processor to another and to encourage the development of programs, sophisticated numerical
This standard has been widely adopted and is used practically in all processors and coprocessors arithmetic current
There are two floating-point representations:
- Simple format (32 bit)
- Dual format (64 bit)
Simple format (32 bit)
The following format is used in the simple format to store a word (32 bit):
Sign Bit | Exponent | Mantissa |
1 bit | 8 bits | 23 bits |
Given a E_1=\text{ exponent } + 127 (127 is the maximum positive number of the exponent) we have to:
00000000_{(2}\leq E_1\leq 11111111_{(2}
If we denote (E1)_{(10} to conversion decimal E_1, we need to:
0\leq (E1)_{(10}\leq 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=255
Or what would be the same:
-127\leq (E_1)_{(10} -127\leq 128
If we denote \alpha=(E_1)_{(10} -127 we have to:
-127\leq\alpha\leq 128
(E_1)=\alpha+127
We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:
- Numbers normal
-127 < \alpha < 128Conversion to normal number: (-1)\cdot s \cdot 1,M \cdot 2^\alpha
- Numbers subnormal
\alpha = (-127)Conversion to subnormal number: (-1)\cdot s \cdot 0,M \cdot 2^{-126}
\alpha = 128
Gives rise to exceptions of type +\infty, -\infty\text{ y NaN} (not a number)
A bit hidden in the mantissa is used. You take the first digit that is always a_1 = 1. This hidden bit will not need to be stored, thus getting a few more numbers
Number | Decimal | Binary |
\tiny \text{Maximum number normal positive} | \tiny 3,40282347 \cdot 10^{38} | \tiny\text{0 11111110 11111111111111111111111} |
\tiny \text{Maximum number normal negative} | \tiny -3,40282347 \cdot 10^{38} | \tiny \text{1 11111110 11111111111111111111111} |
\tiny \text{Minimum number normal positive} | \tiny 1,17549435 \cdot 10^{-38} | \tiny \text{0 00000001 00000000000000000000000} |
\tiny \text{Minimum number normal negative} | \tiny -1,17549435 \cdot 10^{-38} | \tiny \text{1 00000001 00000000000000000000000} |
\tiny \text{Maximum number subnormal positive} | \tiny 1,17549421 \cdot 10[{-38} | \tiny \text{0 00000000 11111111111111111111111} |
\tiny \text{Maximum number subnormal negative} | \tiny -1,17549421 \cdot 10[{-38} | \tiny \text{1 00000000 11111111111111111111111} |
\tiny \text{Minimum number subnormal positive} | \tiny 1,40129846 \cdot 10^{-45} | \tiny \text{0 00000000 00000000000000000000001} |
\tiny \text{Minimum number subnormal negative} | \tiny -1,40129846 \cdot 10^{-45} | \tiny \text{1 00000000 00000000000000000000001} |
\tiny \text{+0} | \tiny 0,0 | \tiny \text{0 00000000 00000000000000000000000} |
\tiny \text{-0} | \tiny -0,0 | \tiny \text{1 00000000 00000000000000000000000} |
\tiny +\infty | \tiny +\infty | \tiny \text{0 11111111 00000000000000000000000} |
\tiny -\infty | \tiny -\infty | \tiny \text{1 11111111 00000000000000000000000} |
\tiny \text{NaN} | \tiny NaN | \tiny \text{(0 or 1) 11111111 (some 1)} |
Conversions
Example: converting the 3737 to simple format (32 bit)
It's positive, so the sign is 0
Number | Ratio | Rest |
\frac{3737}{2} | 1868 | 1 |
\frac{1868}{2} | 934 | 0 |
\frac{934}{2} | 467 | 0 |
\frac{467}{2} | 233 | 1 |
\frac{233}{2} | 116 | 1 |
\frac{116}{2} | 58 | 0 |
\frac{58}{2} | 29 | 0 |
\frac{29}{2} | 14 | 1 |
\frac{14}{2} | 7 | 0 |
\frac{7}{2} | 3 | 1 |
\frac{3}{2} | 1 | 1 |
So we have that the binary part is:
3737_{(10} = 111010011001_{(2}
For the exponent we will need to move 12 decimal places, therefore we must make 127 + 11 x 138 (we add 11 because the hidden bit is not counted)
Number | Ratio | Rest |
\frac{138}{2} | 69 | 0 |
\frac{69}{2} | 34 | 1 |
\frac{34}{2} | 17 | 0 |
\frac{17}{2} | 8 | 1 |
\frac{8}{2} | 4 | 0 |
\frac{4}{2} | 2 | 0 |
\frac{2}{2} | 1 | 0 |
So we have that the exponent is:
138_{(10} = 10001010_{(2}
So we have to:
Dual format (64 bit)
The following format is used in the simple format to store two words (64 bit):
Sign Bit | Exponent | Mantissa |
1 bit | 11 bits | 52 bits |
Given a E_1=\text{ exponent } + 1023 (1023 is the maximum positive number of the exponent) we have to:
00000000000_{(2}\leq E_1\leq 11111111111_{(2}
If we denote (E1)_{(10} to conversion decimal E_1, we need to:
0\leq (E1)_{(10}\leq 2^{10}+2^9+2^8+2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=2047
Or what would be the same:
-1023\leq (E_1)_{(10} -1023\leq 1024
If we denote \alpha=(E_1)_{(10} -1023 we have to:
-1023\leq\alpha\leq 1024
(E_1)=\alpha+1023
We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:
- Numbers normal
-1023 < \alpha < 1024Conversion to normal number: (-1)\cdot s \cdot 1,M \cdot 2^\alpha
- Numbers subnormal
\alpha = (-1023)Conversion to subnormal number: (-1)\cdot s \cdot 0,M \cdot 2^{-1022}
\alpha = 1024
Gives rise to exceptions of type +\infty, -\infty\text{ y NaN} (not a number)
A bit hidden in the mantissa is used. You take the first digit that is always a_1 = 1. This hidden bit will not need to be stored, thus getting a few more numbers
Number | Decimal | Binary |
\tiny \text{Maximum number}\\ \text{normal positive} | \tiny 1,7976931 \cdot 10^{308} | \tiny \text{0 11111111110 1111111111111111111111111111111111111111111111111111} |
\tiny \text{Maximum number}\\ \text{normal negative} | \tiny -1,7976931 \cdot 10^{308} | \tiny \text{1 11111111110 1111111111111111111111111111111111111111111111111111} |
\tiny \text{Minimum number}\\ \text{normal positive} | \tiny 2,2250738 \cdot 10^{-308} | \tiny \text{0 00000000001 0000000000000000000000000000000000000000000000000000} |
\tiny \text{Minimum number}\\ \text{normal negative} | \tiny -2,2250738 \cdot 10^{-308} | \tiny \text{1 00000000001 0000000000000000000000000000000000000000000000000000} |
\tiny \text{Maximum number}\\ \text{subnormal positive} | \tiny 2,2250738 \cdot 10^{-308} | \tiny \text{0 00000000000 1111111111111111111111111111111111111111111111111111} |
\tiny \text{Maximum number}\\ \text{subnormal negative} | \tiny -2,2250738 \cdot 10^{-308} | \tiny \text{1 00000000000 1111111111111111111111111111111111111111111111111111} |
\tiny \text{Minimum number}\\ \text{subnormal positive} | \tiny 4,9406564 \cdot 10^{-324} | \tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000001} |
\tiny \text{Minimum number}\\ \text{subnormal negative} | \tiny -4,9406564 \cdot 10^{-324} | \tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000001} |
\tiny +0 | \tiny 0,0 | \tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000000} |
\tiny -0 | \tiny -0,0 | \tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000000} |
\tiny +\infty | \tiny +\infty | \tiny \text{0 11111111111 0000000000000000000000000000000000000000000000000000} |
\tiny -\infty | \tiny -\infty | \tiny \text{1 11111111111 0000000000000000000000000000000000000000000000000000} |
\tiny \text{NaN} | \tiny \text{NaN} | \tiny \text{(0 or 1) 11111111111 (some 1)} |
Conversions
Example: converting the 3737 to double format (64 bit)
It's positive, so the sign is 0
Number | Ratio | Rest |
\frac{3737}{2} | 1868 | 1 |
\frac{1868}{2} | 934 | 0 |
\frac{934}{2} | 467 | 0 |
\frac{467}{2} | 233 | 1 |
\frac{233}{2} | 116 | 1 |
\frac{116}{2} | 58 | 0 |
\frac{58}{2} | 29 | 0 |
\frac{29}{2} | 14 | 1 |
\frac{14}{2} | 7 | 0 |
\frac{7}{2} | 3 | 1 |
\frac{3}{2} | 1 | 1 |
So we have that the binary part is:
3737_{(10} = 111010011001_{(2}
For the exponent we will need to move 12 decimal places, therefore we must do 1023 + 11 x 1034 (we add 11 because the hidden bit is not counted)
Number | Ratio | Rest |
\frac{1034}{2} | 517 | 0 |
\frac{517}{2} | 258 | 1 |
\frac{258}{2} | 129 | 0 |
\frac{129}{2} | 64 | 1 |
\frac{64}{2} | 32 | 0 |
\frac{32}{2} | 16 | 0 |
\frac{16}{2} | 8 | 0 |
\frac{8}{2} | 4 | 0 |
\frac{4}{2} | 2 | 0 |
\frac{2}{2} | 1 | 0 |
So we have that the exponent is:
138_{(10} = 10000001010_{(2}
So we have to: