Floating-point

Floating-point

The most important floating-point representation is that defined in IEEE 754 (ANSI/IEEE Std 754-1985)

This standard was developed to facilitate the portability of programs from one processor to another and to encourage the development of programs, sophisticated numerical

This standard has been widely adopted and is used practically in all processors and coprocessors arithmetic current

There are two floating-point representations:

  • Simple format (32 bit)
  • Dual format (64 bit)

Simple format (32 bit)

The following format is used in the simple format to store a word (32 bit):

Sign Bit Exponent Mantissa
1 bit 8 bits 23 bits

Given a E_1=\text{ exponent } + 127 (127 is the maximum positive number of the exponent) we have to:

00000000_{(2}\leq E_1\leq 11111111_{(2}

If we denote (E1)_{(10} to conversion decimal E_1, we need to:

0\leq (E1)_{(10}\leq 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=255

Or what would be the same:

-127\leq (E_1)_{(10} -127\leq 128

If we denote \alpha=(E_1)_{(10} -127 we have to:

-127\leq\alpha\leq 128
(E_1)=\alpha+127

We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:

  • Numbers normal
    -127 < \alpha < 128

    Conversion to normal number: (-1)\cdot s \cdot 1,M \cdot 2^\alpha

  • Numbers subnormal
    \alpha = (-127)

    Conversion to subnormal number: (-1)\cdot s \cdot 0,M \cdot 2^{-126}

  • \alpha = 128

    Gives rise to exceptions of type +\infty, -\infty\text{ y NaN} (not a number)

A bit hidden in the mantissa is used. You take the first digit that is always a_1 = 1. This hidden bit will not need to be stored, thus getting a few more numbers

Number Decimal Binary
\tiny \text{Maximum number normal positive} \tiny 3,40282347 \cdot 10^{38} \tiny\text{0 11111110 11111111111111111111111}
\tiny \text{Maximum number normal negative} \tiny -3,40282347 \cdot 10^{38} \tiny \text{1 11111110 11111111111111111111111}
\tiny \text{Minimum number normal positive} \tiny 1,17549435 \cdot 10^{-38} \tiny \text{0 00000001 00000000000000000000000}
\tiny \text{Minimum number normal negative} \tiny -1,17549435 \cdot 10^{-38} \tiny \text{1 00000001 00000000000000000000000}
\tiny \text{Maximum number subnormal positive} \tiny 1,17549421 \cdot 10[{-38} \tiny \text{0 00000000 11111111111111111111111}
\tiny \text{Maximum number subnormal negative} \tiny -1,17549421 \cdot 10[{-38} \tiny \text{1 00000000 11111111111111111111111}
\tiny \text{Minimum number subnormal positive} \tiny 1,40129846 \cdot 10^{-45} \tiny \text{0 00000000 00000000000000000000001}
\tiny \text{Minimum number subnormal negative} \tiny -1,40129846 \cdot 10^{-45} \tiny \text{1 00000000 00000000000000000000001}
\tiny \text{+0} \tiny 0,0 \tiny \text{0 00000000 00000000000000000000000}
\tiny \text{-0} \tiny -0,0 \tiny \text{1 00000000 00000000000000000000000}
\tiny +\infty \tiny +\infty \tiny \text{0 11111111 00000000000000000000000}
\tiny -\infty \tiny -\infty \tiny \text{1 11111111 00000000000000000000000}
\tiny \text{NaN} \tiny NaN \tiny \text{(0 or 1) 11111111 (some 1)}

Conversions

Example: converting the 3737 to simple format (32 bit)

It's positive, so the sign is 0

Number Ratio Rest
\frac{3737}{2} 1868 1
\frac{1868}{2} 934 0
\frac{934}{2} 467 0
\frac{467}{2} 233 1
\frac{233}{2} 116 1
\frac{116}{2} 58 0
\frac{58}{2} 29 0
\frac{29}{2} 14 1
\frac{14}{2} 7 0
\frac{7}{2} 3 1
\frac{3}{2} 1 1

So we have that the binary part is:

3737_{(10} = 111010011001_{(2}

For the exponent we will need to move 12 decimal places, therefore we must make 127 + 11 x 138 (we add 11 because the hidden bit is not counted)

Number Ratio Rest
\frac{138}{2} 69 0
\frac{69}{2} 34 1
\frac{34}{2} 17 0
\frac{17}{2} 8 1
\frac{8}{2} 4 0
\frac{4}{2} 2 0
\frac{2}{2} 1 0

So we have that the exponent is:

138_{(10} = 10001010_{(2}

So we have to:

3737_{(10} = \text{0 10001010 11010011001000000000000}_{(2}

Dual format (64 bit)

The following format is used in the simple format to store two words (64 bit):

Sign Bit Exponent Mantissa
1 bit 11 bits 52 bits

Given a E_1=\text{ exponent } + 1023 (1023 is the maximum positive number of the exponent) we have to:

00000000000_{(2}\leq E_1\leq 11111111111_{(2}

If we denote (E1)_{(10} to conversion decimal E_1, we need to:

0\leq (E1)_{(10}\leq 2^{10}+2^9+2^8+2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=2047

Or what would be the same:

-1023\leq (E_1)_{(10} -1023\leq 1024

If we denote \alpha=(E_1)_{(10} -1023 we have to:

-1023\leq\alpha\leq 1024
(E_1)=\alpha+1023

We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:

  • Numbers normal
    -1023 < \alpha < 1024

    Conversion to normal number: (-1)\cdot s \cdot 1,M \cdot 2^\alpha

  • Numbers subnormal
    \alpha = (-1023)

    Conversion to subnormal number: (-1)\cdot s \cdot 0,M \cdot 2^{-1022}

  • \alpha = 1024

    Gives rise to exceptions of type +\infty, -\infty\text{ y NaN} (not a number)

A bit hidden in the mantissa is used. You take the first digit that is always a_1 = 1. This hidden bit will not need to be stored, thus getting a few more numbers

Number Decimal Binary
\tiny \text{Maximum number}\\ \text{normal positive} \tiny 1,7976931 \cdot 10^{308} \tiny \text{0 11111111110 1111111111111111111111111111111111111111111111111111}
\tiny \text{Maximum number}\\ \text{normal negative} \tiny -1,7976931 \cdot 10^{308} \tiny \text{1 11111111110 1111111111111111111111111111111111111111111111111111}
\tiny \text{Minimum number}\\ \text{normal positive} \tiny 2,2250738 \cdot 10^{-308} \tiny \text{0 00000000001 0000000000000000000000000000000000000000000000000000}
\tiny \text{Minimum number}\\ \text{normal negative} \tiny -2,2250738 \cdot 10^{-308} \tiny \text{1 00000000001 0000000000000000000000000000000000000000000000000000}
\tiny \text{Maximum number}\\ \text{subnormal positive} \tiny 2,2250738 \cdot 10^{-308} \tiny \text{0 00000000000 1111111111111111111111111111111111111111111111111111}
\tiny \text{Maximum number}\\ \text{subnormal negative} \tiny -2,2250738 \cdot 10^{-308} \tiny \text{1 00000000000 1111111111111111111111111111111111111111111111111111}
\tiny \text{Minimum number}\\ \text{subnormal positive} \tiny 4,9406564 \cdot 10^{-324} \tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000001}
\tiny \text{Minimum number}\\ \text{subnormal negative} \tiny -4,9406564 \cdot 10^{-324} \tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000001}
\tiny +0 \tiny 0,0 \tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000000}
\tiny -0 \tiny -0,0 \tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000000}
\tiny +\infty \tiny +\infty \tiny \text{0 11111111111 0000000000000000000000000000000000000000000000000000}
\tiny -\infty \tiny -\infty \tiny \text{1 11111111111 0000000000000000000000000000000000000000000000000000}
\tiny \text{NaN} \tiny \text{NaN} \tiny \text{(0 or 1) 11111111111 (some 1)}

Conversions

Example: converting the 3737 to double format (64 bit)

It's positive, so the sign is 0

Number Ratio Rest
\frac{3737}{2} 1868 1
\frac{1868}{2} 934 0
\frac{934}{2} 467 0
\frac{467}{2} 233 1
\frac{233}{2} 116 1
\frac{116}{2} 58 0
\frac{58}{2} 29 0
\frac{29}{2} 14 1
\frac{14}{2} 7 0
\frac{7}{2} 3 1
\frac{3}{2} 1 1

So we have that the binary part is:

3737_{(10} = 111010011001_{(2}

For the exponent we will need to move 12 decimal places, therefore we must do 1023 + 11 x 1034 (we add 11 because the hidden bit is not counted)

Number Ratio Rest
\frac{1034}{2} 517 0
\frac{517}{2} 258 1
\frac{258}{2} 129 0
\frac{129}{2} 64 1
\frac{64}{2} 32 0
\frac{32}{2} 16 0
\frac{16}{2} 8 0
\frac{8}{2} 4 0
\frac{4}{2} 2 0
\frac{2}{2} 1 0

So we have that the exponent is:

138_{(10} = 10000001010_{(2}

So we have to:

\tiny 3737_{(10} = \text{0 10000001010 1101001100100000000000000000000000000000000000000000}_{(2}