Content

1 Floating-point
- 1.1 Simple format (32 bit)
  - 1.1.1 Conversions
    - 1.1.1.1 Example: converting the 3737 to simple format (32 bit)
- 1.2 Dual format (64 bit)
  - 1.2.1 Conversions
    - 1.2.1.1 Example: converting the 3737 to double format (64 bit)

Floating-point

The most important floating-point representation is that defined in IEEE 754 (ANSI/IEEE Std 754-1985)

This standard was developed to facilitate the portability of programs from one processor to another and to encourage the development of programs, sophisticated numerical

This standard has been widely adopted and is used practically in all processors and coprocessors arithmetic current

There are two floating-point representations:

Simple format (32 bit)
Dual format (64 bit)

Simple format (32 bit)

The following format is used in the simple format to store a word (32 bit):

Sign Bit	Exponent	Mantissa
1 bit	8 bits	23 bits

Given a $E_1=\text{ exponent } + 127$ (127 is the maximum positive number of the exponent) we have to:

$00000000_{(2}\leq E_1\leq 11111111_{(2}$

If we denote $(E1)_{(10}$ to conversion decimal $E_1$ , we need to:

$0\leq (E1)_{(10}\leq 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=255$

Or what would be the same:

$-127\leq (E_1)_{(10} -127\leq 128$

If we denote $\alpha=(E_1)_{(10} -127$ we have to:

$-127\leq\alpha\leq 128$
$(E_1)=\alpha+127$

We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:

Numbers normal
$-127 < \alpha < 128$

Conversion to normal number: $(-1)\cdot s \cdot 1,M \cdot 2^\alpha$
Numbers subnormal
$\alpha = (-127)$

Conversion to subnormal number: $(-1)\cdot s \cdot 0,M \cdot 2^{-126}$

$\alpha = 128$

Gives rise to exceptions of type $+\infty, -\infty\text{ y NaN}$ (not a number)

A bit hidden in the mantissa is used. You take the first digit that is always $a_1 = 1$ . This hidden bit will not need to be stored, thus getting a few more numbers

Number	Decimal	Binary
$\tiny \text{Maximum number normal positive}$	$\tiny 3,40282347 \cdot 10^{38}$	$\tiny\text{0 11111110 11111111111111111111111}$
$\tiny \text{Maximum number normal negative}$	$\tiny -3,40282347 \cdot 10^{38}$	$\tiny \text{1 11111110 11111111111111111111111}$
$\tiny \text{Minimum number normal positive}$	$\tiny 1,17549435 \cdot 10^{-38}$	$\tiny \text{0 00000001 00000000000000000000000}$
$\tiny \text{Minimum number normal negative}$	$\tiny -1,17549435 \cdot 10^{-38}$	$\tiny \text{1 00000001 00000000000000000000000}$
$\tiny \text{Maximum number subnormal positive}$	$\tiny 1,17549421 \cdot 10[{-38}$	$\tiny \text{0 00000000 11111111111111111111111}$
$\tiny \text{Maximum number subnormal negative}$	$\tiny -1,17549421 \cdot 10[{-38}$	$\tiny \text{1 00000000 11111111111111111111111}$
$\tiny \text{Minimum number subnormal positive}$	$\tiny 1,40129846 \cdot 10^{-45}$	$\tiny \text{0 00000000 00000000000000000000001}$
$\tiny \text{Minimum number subnormal negative}$	$\tiny -1,40129846 \cdot 10^{-45}$	$\tiny \text{1 00000000 00000000000000000000001}$
$\tiny \text{+0}$	$\tiny 0,0$	$\tiny \text{0 00000000 00000000000000000000000}$
$\tiny \text{-0}$	$\tiny -0,0$	$\tiny \text{1 00000000 00000000000000000000000}$
$\tiny +\infty$	$\tiny +\infty$	$\tiny \text{0 11111111 00000000000000000000000}$
$\tiny -\infty$	$\tiny -\infty$	$\tiny \text{1 11111111 00000000000000000000000}$
$\tiny \text{NaN}$	$\tiny NaN$	$\tiny \text{(0 or 1) 11111111 (some 1)}$

Conversions

Example: converting the 3737 to simple format (32 bit)

It's positive, so the sign is 0

Number	Ratio	Rest
$\frac{3737}{2}$	1868	1
$\frac{1868}{2}$	934	0
$\frac{934}{2}$	467	0
$\frac{467}{2}$	233	1
$\frac{233}{2}$	116	1
$\frac{116}{2}$	58	0
$\frac{58}{2}$	29	0
$\frac{29}{2}$	14	1
$\frac{14}{2}$	7	0
$\frac{7}{2}$	3	1
$\frac{3}{2}$	1	1

So we have that the binary part is:

$3737_{(10} = 111010011001_{(2}$

For the exponent we will need to move 12 decimal places, therefore we must make 127 + 11 x 138 (we add 11 because the hidden bit is not counted)

Number	Ratio	Rest
$\frac{138}{2}$	69	0
$\frac{69}{2}$	34	1
$\frac{34}{2}$	17	0
$\frac{17}{2}$	8	1
$\frac{8}{2}$	4	0
$\frac{4}{2}$	2	0
$\frac{2}{2}$	1	0

So we have that the exponent is:

$138_{(10} = 10001010_{(2}$

So we have to:

3737_{(10} = \text{0 10001010 11010011001000000000000}_{(2}

Dual format (64 bit)

The following format is used in the simple format to store two words (64 bit):

Sign Bit	Exponent	Mantissa
1 bit	11 bits	52 bits

Given a $E_1=\text{ exponent } + 1023$ (1023 is the maximum positive number of the exponent) we have to:

$00000000000_{(2}\leq E_1\leq 11111111111_{(2}$

If we denote $(E1)_{(10}$ to conversion decimal $E_1$ , we need to:

$0\leq (E1)_{(10}\leq 2^{10}+2^9+2^8+2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0=2047$

Or what would be the same:

$-1023\leq (E_1)_{(10} -1023\leq 1024$

If we denote $\alpha=(E_1)_{(10} -1023$ we have to:

$-1023\leq\alpha\leq 1024$
$(E_1)=\alpha+1023$

We denoted s as the sign bit and M as the mantissa. As a result, two types of numbers arise:

Numbers normal
$-1023 < \alpha < 1024$

Conversion to normal number: $(-1)\cdot s \cdot 1,M \cdot 2^\alpha$
Numbers subnormal
$\alpha = (-1023)$

Conversion to subnormal number: $(-1)\cdot s \cdot 0,M \cdot 2^{-1022}$

$\alpha = 1024$

Gives rise to exceptions of type $+\infty, -\infty\text{ y NaN}$ (not a number)

A bit hidden in the mantissa is used. You take the first digit that is always $a_1 = 1$ . This hidden bit will not need to be stored, thus getting a few more numbers

Number	Decimal	Binary
$\tiny \text{Maximum number}\\ \text{normal positive}$	$\tiny 1,7976931 \cdot 10^{308}$	$\tiny \text{0 11111111110 1111111111111111111111111111111111111111111111111111}$
$\tiny \text{Maximum number}\\ \text{normal negative}$	$\tiny -1,7976931 \cdot 10^{308}$	$\tiny \text{1 11111111110 1111111111111111111111111111111111111111111111111111}$
$\tiny \text{Minimum number}\\ \text{normal positive}$	$\tiny 2,2250738 \cdot 10^{-308}$	$\tiny \text{0 00000000001 0000000000000000000000000000000000000000000000000000}$
$\tiny \text{Minimum number}\\ \text{normal negative}$	$\tiny -2,2250738 \cdot 10^{-308}$	$\tiny \text{1 00000000001 0000000000000000000000000000000000000000000000000000}$
$\tiny \text{Maximum number}\\ \text{subnormal positive}$	$\tiny 2,2250738 \cdot 10^{-308}$	$\tiny \text{0 00000000000 1111111111111111111111111111111111111111111111111111}$
$\tiny \text{Maximum number}\\ \text{subnormal negative}$	$\tiny -2,2250738 \cdot 10^{-308}$	$\tiny \text{1 00000000000 1111111111111111111111111111111111111111111111111111}$
$\tiny \text{Minimum number}\\ \text{subnormal positive}$	$\tiny 4,9406564 \cdot 10^{-324}$	$\tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000001}$
$\tiny \text{Minimum number}\\ \text{subnormal negative}$	$\tiny -4,9406564 \cdot 10^{-324}$	$\tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000001}$
$\tiny +0$	$\tiny 0,0$	$\tiny \text{0 00000000000 0000000000000000000000000000000000000000000000000000}$
$\tiny -0$	$\tiny -0,0$	$\tiny \text{1 00000000000 0000000000000000000000000000000000000000000000000000}$
$\tiny +\infty$	$\tiny +\infty$	$\tiny \text{0 11111111111 0000000000000000000000000000000000000000000000000000}$
$\tiny -\infty$	$\tiny -\infty$	$\tiny \text{1 11111111111 0000000000000000000000000000000000000000000000000000}$
$\tiny \text{NaN}$	$\tiny \text{NaN}$	$\tiny \text{(0 or 1) 11111111111 (some 1)}$

Conversions

Example: converting the 3737 to double format (64 bit)

It's positive, so the sign is 0

Number	Ratio	Rest
$\frac{3737}{2}$	1868	1
$\frac{1868}{2}$	934	0
$\frac{934}{2}$	467	0
$\frac{467}{2}$	233	1
$\frac{233}{2}$	116	1
$\frac{116}{2}$	58	0
$\frac{58}{2}$	29	0
$\frac{29}{2}$	14	1
$\frac{14}{2}$	7	0
$\frac{7}{2}$	3	1
$\frac{3}{2}$	1	1

So we have that the binary part is:

$3737_{(10} = 111010011001_{(2}$

For the exponent we will need to move 12 decimal places, therefore we must do 1023 + 11 x 1034 (we add 11 because the hidden bit is not counted)

Number	Ratio	Rest
$\frac{1034}{2}$	517	0
$\frac{517}{2}$	258	1
$\frac{258}{2}$	129	0
$\frac{129}{2}$	64	1
$\frac{64}{2}$	32	0
$\frac{32}{2}$	16	0
$\frac{16}{2}$	8	0
$\frac{8}{2}$	4	0
$\frac{4}{2}$	2	0
$\frac{2}{2}$	1	0

So we have that the exponent is:

$138_{(10} = 10000001010_{(2}$

So we have to:

\tiny 3737_{(10} = \text{0 10000001010 1101001100100000000000000000000000000000000000000000}_{(2}

Cookie	Duration	Description
CookieLawInfoConsent	Until the end of the browser session	Controla la visualización del consentimiento de Cookies, su gestión y visualización en la página web por parte del usuario
qtrans_admin_language	Until the end of the browser session	Permite al administrador gestionar la traducción de la página web a varios idiomas
qtrans_edit_language	Until the end of the browser session	Permite al administrador editar la traducción de la página web a varios idiomas
viewed_cookie_policy	Until the end of the browser session	Controla si la visualización del consentimiento de Cookies es visible actualmente en la página web para el usuario o por el contrario está oculta

Secarcam's Computer Science Web

Floating-point

Floating-point

Simple format (32 bit)

Conversions

Example: converting the 3737 to simple format (32 bit)

Dual format (64 bit)

Conversions

Example: converting the 3737 to double format (64 bit)

Web page of Sergio Cárcamo Garcia dedicated to the computing and related topics such as programming languages, statistics, mathematics, etc

Floating-point

Simple format (32 bit)

Conversions

Example: converting the 3737 to simple format (32 bit)

Dual format (64 bit)

Conversions

Example: converting the 3737 to double format (64 bit)

Web page of Sergio Cárcamo Garcia dedicated to the computing and related topics such as programming languages, statistics, mathematics, etc

Cookie policy