1. 程式人生 > >關於IEEE754二進位制浮點數算術標準的介紹

關於IEEE754二進位制浮點數算術標準的介紹

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.

clip_image001

Bit values for the the IEEE 754 32bit float 0.15625

The exponent is biased by 28 1 1 = 127 in this case (Exponents in the range 126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of

127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.

For normalised numbers, the most common, exponent

is the biased exponent and fraction is the significand minus the most significant bit.

The number has value v:

v = s × 2e× m

Where

s = +1 (positive numbers) when the sign bit is 0

s = 1 (negative numbers) when the sign bit is 1

e = Exp 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 m < 2.

In the example shown above, the sign is zero, the exponent is 3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 23, which is +0.15625.

Notes:

1.Denormalized numbers are the same except that e = 126 and m is 0.fraction. (e is NOT 127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to 126 for the calculation.)

2.126 is the smallest exponent for a normalized number

3.There are two Zeroes, +0 (s is 0) and 0 (s is 1)

4.There are two Infinities + (s is 0) and (s is 1)

5.NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish signaling NaNs from quiet NaNs

6.NaNs and Infinities have all 1s in the Exp field.

7.The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2149≈ ±1.4012985×1045

8.The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2126≈ ±1.175494351×1038

9.The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)24)2128) [2]≈ ±3.4028235×1038


Here is the summary table from the previous section with some example 32-bit single-precision examples:

Type

Exponent

Significand

Value

Zero

0000 0000

000 0000 0000 0000 0000 0000

0.0

One

0111 1111

000 0000 0000 0000 0000 0000

1.0

Denormalized number

0000 0000

100 0000 0000 0000 0000 0000

5.9×10-39

Large normalized number

1111 1110

111 1111 1111 1111 1111 1111

3.4×1038

Small normalized number

0000 0001

000 0000 0000 0000 0000 0000

1.18×10-38

Infinity

1111 1111

000 0000 0000 0000 0000 0000

Infinity

NaN

1111 1111

non zero

NaN

A more complex example

clip_image002

Bit values for the IEEE 754 32bit float -118.625

Let us encode the decimal number 118.625 using the IEEE 754 system.

1.First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".

2.Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101.

3.Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.

4.The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.