Wichita State University Logo

Math 451: Computational Mathematics

3.2 Floating Point Arithmetic


3.2.1 IEEE 754 Floating Point

We need to understand how floating point numbers are stored and manipulated in a computer in order to write better numerical algorithms. Modern computers use the Institute of Electrical and Electronics Engeers IEEE 754 standard for storing floating point numbers .

  • Single Precision means the number is stored with 32 bits, 8 bytes, or 4 words
  • Double Precision means the number is stored with 64 bits, 16 bytes, 8 words, or 4 dwords

Modern digital computers and MATLAB store floating point numbers using the double precision or 64 bit format. If you are interested in reading more about the IEEE 754 double precision floating point specification you will find that it is a hardware specification as well as a software specification.

In this class we will be much more interested in understanding only the basics of the software format. This is primarily what the 64 bits represent and what operations we can perform on instances of this data type. The topics that interest us are

  1. The layout of the bits in the 16 bytes and what they represent
  2. The accuracy, range for floating point numbers
  3. The operations that are written and implemented in hardware to be accurate to greatest precision.
  4. Exception Handling

We will talk about parts one, two, and three in this section. We will include exception handling in our MATLAB functions in future chapters.

3.2.2 Format of Floating Point Numbers

When we count the bits in a binary string we count them from right-to-left. The lowest order bit is the one farthest to the right and the highest order bit is the one farthest to the left. We also number them from 0, so a 64 bit binary string has bits 0-63.

Basic Conditional Flowchart Figure 1



$\ \ \ $bits$\ \ \ $ $\ $usage$\ \ \ \ \ \ $ $\ $range$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $
0-51 mantissa 1 - (2 - 4.440892098500626e-16)
52-62 exponent -1022 - 1023
63 sign 0 - 1

These bit ranges to encode the sign, exponent of 2, and the binary mantissa. They are not equal to the binary values they represent.

3.2.3 The Sign Bit

The encoding for the sign of the floating point number is straightforward.

$$ \begin{align*} 0b &= (-1)^0 = \ \ 1 = \text{ positive} \\ 1b &= (-1)^1 = -1 = \text{ negative} \end{align*} $$

3.2.4 The Mantissa

The encoding for the mantissa is in base 2. As the name suggests it represents the fractional part of a number in engineering for exponential notation . Recall Avogadro's number in scientific notation,

$$ 6.022\ \times\ 10^{23} $$
The mantissa is $6.022$ and the exponent is $23$. The first digit must not be zero. For example
$$ \begin{align*} \frac{1}{2} &= 0.5 = 5.0\times 10^{-1} \\ 13.5 &= 1.35\times 10^{1} \\ 0 &= 0\times 10^0 \\ 0.000001754 &= 1.754\times 10^{-6} \\ 5280 &= 5.280\times 10^3 \end{align*} $$
There is exactly one digit to the left of the decimal point in the mantissa, and it must not be zero unless the number is zero.

Likewise the first digit of the binary mantissa must not be zero unless the number it represents is zero (or a special value!), and it is the only digit to the left of the decimal point. In binary the only nonzero digit is 1.

Therefore this digit is not stored in the 52 bits!


Let's look at some examples with real numbers and their base 2 exponential notation.

$$ \begin{align*} 0.5 &= 1.00 b\times 2^{-1} = \frac{1}{2} \\ 1.5 &= 1.10 b\times 2^0 = 1 + \frac{1}{2} \\ 1.25 &= 1.010 b\times 2^0 = 1 + \frac{1}{4} \\ .75 &= 1.100 b\times 2^{-1} = \frac{1}{2} + \frac{1}{4} \\ 4.25 &= 1.0001 b\times 2^2 = 2^2 + 0\cdot2 + 0\cdot1 + 0\cdot\frac{1}{2} + \frac{1}{4} \\ 0 &= 0.0 b\times 2^0 = 0 \\ \end{align*} $$

Notice that only dyadic rationals can be expressed as terminating decimals in binary notation. Every other rational number is a repeating, nonterminating decimal. Now let's consider the encoding of the mantissa in IEEE 764

$$ \begin{align*} 1.00 b\times 2^{-1} &= 0000000000000000000000000000000000000000000000000000b\cdot2^{-1} \\ 1.10 b\times 2^0 &= 1000000000000000000000000000000000000000000000000000b\cdot2^0 \\ 1.01 b\times 2^0 &= 0100000000000000000000000000000000000000000000000000b\cdot2^0 \\ 1.10 b\times 2^{-1} &= 1000000000000000000000000000000000000000000000000000b\cdot2^{-1} \\ 1.0001 b\times 2^{2} &= 0010000000000000000000000000000000000000000000000000b\cdot2^2 \\ 0 b\times 2^{0} &= 0000000000000000000000000000000000000000000000000000b\cdot2^0 \\ \end{align*} $$
Some of the mantissas look the same because we are not yet including the exponent, and the most significant digit is not part of the representation. The use of a separate sign bit allows the mantissa to encode only absolute values of numbers with a leading 1. The encoding for the $f$ = mantissa is given by

$$ \begin{align*} f &= 1.b_{51}b_{50}b_{49}\dots b_1b_0 \\ &= 1 + \displaystyle\sum_{k=1}^{52} b_{52-k}2^{-k} \end{align*} $$
Zero, Infinity, and NaN are special values. The encoding for mantissa $f$ differs for these special values. Finally, 52 is evenly divisible by 4 so we usually group our 52 mantissa bits into nibbles and write the mantissa in hexadecimal.

$$ \begin{align*} 1.00 b\cdot 2^{-1} &= 0x0.0000000000000\cdot 2^{-1} = 0.0_{16}\cdot 2^{-1} \\ 1.10 b\cdot 2^0 &= 0x0.8000000000000\cdot 2^0 \ \ = 0.8_{16}\cdot 2^0 \\ 1.01 b\cdot 2^0 &= 0x0.4000000000000\cdot 2^0 \ \ = 0.4_{16}\cdot 2^0 \\ 1.10 b\cdot 2^{-1} &= 0x0.8000000000000\cdot 2^{-1} = 0.8_{16}\cdot 2^{-1} \\ 1.0001 b\cdot 2^{2} &= 0x0.1000000000000\cdot 2^2 \ \ = 0.1_{16}\cdot 2^2 \\ 0 b\cdot 2^{0} &= 0x0.0000000000000\cdot 2^0 \ \ = 0.0_{16}\cdot 2^0 \\ \end{align*} $$
Recall that in the encoding, the most significant digit (1) is assumed as is not part of the 52 bits of the mantissa stored by the computer.

Notice also that when using the $0x$ prefix one is customarily required to list all of the hexadecimal digits. However the subscript notation does not require listing all of the significant digits. Since the mantissa does encodes the significant digits (in binary) it is also called the significand .

3.2.5 The Exponent

The encoding for the exponent must allow for negative exponents so the encoding uses a bias . The eleven (11) bits used for the exponent is treated as a 11 bit unsigned integer with 1023 (the bias) subtracted from it. For example

$$ 2^0 \rightarrow 0\ {\color{red}=}\ 01111111111b = 011 1111 1111b = 3\text{f}\text{f}_{16} $$
The equal sign ${\color{red}=}$ is justified because the binary string $01111111111b = 3\text{f}\text{f}_{16}$ represents the number 0. That is because the formula for computing the base 2 exponent from the bits is given by

$$ 0 = 3\text{f}\text{f}_{16} - 1023_{10} = 3\text{f}\text{f}_{16} - 3\text{f}\text{f}_{16} $$
To compute the encoding $e$ for other exponents we have

$$ \begin{array}{rcrcrcrrll} -1022 & = & 1 - 1023 & \rightarrow & e & = & 1_{16} & \rightarrow & 2^{-1022} & \text{ the smallest exponent} \\ -12 & = & 1011 - 1023 & \rightarrow & e & = & 1\text{fd}_{16} & \rightarrow & 2^{-12} \\ 0 & = & 1023 - 1023 & \rightarrow & e & = & 3\text{ff}_{16} & \rightarrow & 2^0 \\ 8 & = & 1031 - 1023 & \rightarrow & e & = & 407_{16} & \rightarrow & 2^8 \\ 1023 & = & 2046 - 1023 & \rightarrow & e & = & 7\text{fe}_{16} & \rightarrow & 2^{1023} & \text{ the largest exponent} \end{array} $$
The exponents $000_{16}$ and $7\text{ff}_{16}$ have a special meaning:

$000\ 0000\ 0000_2 = 000_{16}$ is used to represent a signed zero (if $f = 0$) and subnormals (if f ≠0)
$111\ 1111\ 1111_2 = 7\text{ff}_{16}$ is used to represent $\infty$ (if $f = 0$) and NaN (if $f\neq 0$).

3.2.6 Double Precision Floating Point Representation

All bit patterns are valid encoding of normal floating point numbers are described

$$ \begin{align*} (−1)^{\text{sign}} \cdot 2^{e − 1023} \cdot f &= (-1)^{\text{sign}}\cdot 2^{e-1023}\cdot \left(1 + \displaystyle\sum_{k=1}^{52} b_{52-k}2^{-k} \right) \\ \\ &= (-1)^{\text{sign}}\cdot 2^{e-1023}\cdot\left(1.b_{51}b_{50}b_{49}\dots b_1b_0\right) \end{align*} $$

In the case of subnormals ($e = 000_{16}$) the double-precision number is described by:

$$ \begin{align*} (−1)^{\text{sign}}\cdot 2^{1−1023} \cdot f &= (−1)^{\text{sign}} \cdot 2^{−1022} \cdot \left( 0 + \displaystyle\sum_{k=1}^{52} b_{52-k}2^{-k} \right) \\ \\ &= (-1)^{\text{sign}}\cdot 2^{-1022}\cdot\left(0.b_{51}b_{50}b_{49}\dots b_1b_0\right) \end{align*} $$
Notice that the exponent for a subnormal number is $-1022$ just as in the case that $e = 001_{16}$. However the most significant digit is 0 instead 1. Subnormal numbers prevent underflow from occurring as often when adding or multiplying numbers near zero. That means numerical operations that produce numbers smaller that $2^{-1022}$. An underflow condition creates an exception . We will introduce exceptions and exception handling in a future section.

Computing with subnormal numbers in MATLAB can reduce the performance of your functions by as much as 50 times. Thus we will try to avoid performing computations too close to zero.

$0\ 00000000000\ 000000000000000000000000000000000000b = +0$
$1\ 00000000000\ 000000000000000000000000000000000000b = -0$
$0\ 11111111111\ 000000000000000000000000000000000000b = +\infty$
$1\ 11111111111\ 000000000000000000000000000000000000b = -\infty$

If the exponent is $7\text{ff}_{16} = 111\ 1111\ 1111b$ and $f\neq 0$, then the result is not a number , or NaN.

$0\ 11111111111\ 100000000000000000000000000000000001b = $ NaN
$0\ 11111111111\ 000000000000000000000000000000000001b = $ NaN
$0\ 11111111111\ 111111111111111111111111111111111111b = $ NaN

3.2.7 Errors in Numerical Computations

There will always be error in numerical computations in MATLAB and similar computational environments. This error is due to rounding off our results at a value that depends on the data type we are using. The round-off error for integer arithmetic is 0.5 or 1. For example, we may truncate results

$$ 5 \div 2 = 2 = \left\lfloor \frac{5}{2} \right\rfloor $$

We may also compute the next digit of our result and round it to the nearest integer

$$ 5 \div 2 = 2.5 \approx 3 $$

Double precision numbers round in the $52^{\text{rd}}$ binary decimal digit. However the size of this digit depends on the exponent! The MATLAB funtion eps will return the step size or value of the $52^{\text{nd}}$ of the number. This step tells us the gap between the input number and the next number to round to in computations.

In [1]:
format long
e_1 = eps(1)
e_2 = eps(2)
e_4 = eps(4)
e_8 = eps(8)
e_1 =

     2.220446049250313e-16


e_2 =

     4.440892098500626e-16


e_4 =

     8.881784197001252e-16


e_8 =

     1.776356839400250e-15


Each gap doubles because the exponent implies that the last digit is in the next larger position in our decimal system. Let's consider a base 10 example

$$ \begin{align*} 0.1 \cdot 10^{-15} &\ \\ 0.1 \cdot 10^{-14} &\ \\ 0.1 \cdot 10^{-13} &\ \\ 0.1 \cdot 10^{-12} &\ \end{align*} $$
The step size or gap for each floating point number grows exponentially!


We will need to design our numerical algorithms so that they do not attempt to compute number too large to avoid losses in accuracy. We also need to realize that any subnormal number is for all practical purposes equal to zero. That is why we never check to see if a floating point number is equal to zero . Instead we check to see if it is smaller than some error bound that we decide we can accept in our mathematical computations.

3.2.8 Rounding

There are four types of rounding specified by the IEEE 754 standard. For a real number $x$ that can be stored in this standard, but cannot be stored exactly , then an approximate floating point number $\hat{x}$ is stored according to one of the following rules:

  • Round down - $\hat{x}$ is the largest exact (dyadic) floating point number less than or equal to $x$.

  • Round up - $\hat{x}$ is the smallest exact floating point number greater than or equal to $x$.

  • Round towards 0 - $\hat{x}$ is round down if $x$ is positive, and round up if $x$ is negative. This ensures that $\hat{x}$ is between $x$ and 0.

  • Round to nearest - $\hat{x}$ is either round down or round up, whichever is closer. In case of a tie, $\hat{x}$ is the exact floating point number whose least significant bit (rightmost) is 0.

The default is round to nearest.

Example 1

To compute the IEEE representation for $\frac{1}{10}$ we can divide using long division

$$ \begin{align*} &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0.00011001\overline{1001} \\ \frac{1}{10} &= 1010\ |\overline{1.00000000000000000\dots} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ - 1010 \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \overline{\ \ 0110}{\color{magenta}0} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ - 1010 \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \overline{\ \ \ \ 0010}{\color{magenta}0} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ - 1010 \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \overline{\ \ 0110}{\color{magenta}0} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ - 1010 \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \overline{\ \ \ \ 0010}{\color{magenta}0} {\color{magenta}0} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \dots \\ \\ &= 0.0001100110011001100110011001100110011001100110011001{\color{magenta}1}\dots \\ &= 0.0001\overline{1001}_2 \\ &= 1.\overline{1001}_2\ \cdot\ 2^{-4} \\ \end{align*} $$
The double precision number $\frac{1}{10}$ is approximated by

$$ \begin{align*} \hat{x} &= 0\ 01111111011\ 1001100110011001100110011001100110011001100110011001 \end{align*} $$
when using round down or round toward 0. There error here is

$$ \begin{align*} x - \hat{x} &= \dots1001{\color{magenta}1}001\overline{1001}_2 - \dots10010000\overline{0000}_2 \\ &= \dots0000\overline{1001}_2 \\ &\gt {\color{magenta}1}\cdot 2^{-53} \end{align*} $$

However $\frac{1}{10}$ is represented by

$$ \begin{align*} \hat{x} &= 0\ 01111111011\ 1001100110011001100110011001100110011001100110011010 \end{align*} $$
using round up. The error is given by

$$ \begin{align*} \hat{x} - x &= \dots10100000\overline{0000}_2 - \dots1001{\color{magenta}1}001\overline{1001}_2 \\ &= 00000110\overline{0110}_2 = 1.10\overline{0110}\cdot 2^{-54} \\ &\lt {\color{magenta}1}\cdot 2^{-53} \end{align*} $$
Hence round to nearest $\hat{x}$ is equal to round up.

Definition

The absolute rounding error is defined by

$$ \left|\hat{x}-x\right| $$
In double precision the rounding error is less than $2^{-52}\cdot 2^{E}$ where $-1022 < E < 1023$ is the exponent for $x$ in binary.

This error computation is relative to the size or magnitude of each floating point number so we will usually use _ relative rounding error

Definition

The relative rounding error is defined by

$$ \dfrac{\left|\hat{x} - x\right|}{|x|} \le \dfrac{\epsilon}{2} $$

where $\epsilon = 2^{-53}\approx 2.22\cdot 10^{-16}$. This value $\epsilon$ is referred to as machine epsilon for double precision floating point. The error is less than $\frac{\epsilon}{2}$ because each real number will lie in an interval between two dyadic rationals of length $\epsilon\cdot2^E$. So the relative error will lie in an interval of length $\epsilon$. Since any real number in that interval will get rounded to the dyadic rational closest to it the relative error is bounded by $\frac{\epsilon}{2}$.

We can write

$$ \hat{x} = x(1 + \delta) $$
where $|\delta|\le\frac{\epsilon}{2}$ for round to nearest.

3.2.9 Floating Point Arithmetic

The IEEE 754 specification requires that the result of an algebraic operation on two floating point numbers must be correctly rounded to the exact result . We usually create new operators for addition $\oplus$, subtraction $\ominus$, multiplication $\otimes$, and division $\oslash$ as

$$ \begin{array}{rcccccc} a\oplus b & = & \text{round}(a+b) & = & a(1 + \delta_1) + b(1 + \delta_2) & = & (a+b)(1 + \delta_3) \\ a\ominus b & = & \text{round}(a-b) & = & a(1 + \delta_1) - b(1 + \delta_2) & = & (a-b)(1 + \delta_4) \\ a\otimes b & = & \text{round}(a\cdot b) & = & a(1 + \delta_1)\cdot b(1 + \delta_2) & = & (a\cdot b)(1 + \delta_5) \\ a\oslash b & = & \text{round}\left(\frac{a}{b}\right) & = & \dfrac{a(1 + \delta_1)}{b(1 + \delta_2)} & = & \left(\frac{a}{b}\right)(1 + \delta_6) \\ \end{array} $$
where $\left|\delta_i\right|\le\frac{\epsilon}{2}$, $i=1,\dots,6$.

In MATLAB, floating point approximations can cause surprising results. When people compute the following

$$ 1 - 3*(4/3 - 1) $$
we typically attain the value 0. However in MATLAB

In [2]:
format short
1 - 3*(4/3 - 1)
ans =

   2.2204e-16


The result is due to rounding error as the value $\frac{4}{3}$ is not a dyadic rational. Hence the repeating, nonterminating binary decimal must be rounded to the nearest dyadic rational number.

As we saw above, $\frac{1}{10}$ is also a repeating, nonterminating binary decimal therefore if we add $\frac{1}{10}(1 + \delta_1)$ to itself ten (10) times we obtain $1(1 + \delta_2)$; not 1. This is in spite of the fact that the decimal approximation is 1 followed by 15 zeros. Notice $\delta_3 \approx 1.110223024625157e-16$.

In [3]:
a = 0.0;
for i = 1:10
  a = a + 0.1;
end
a == 1
format long
a
abs(a-1)
ans =

  logical

   0


a =

   1.000000000000000


ans =

     1.110223024625157e-16


Example 2 - Catastrophic Cancellation

When we add 1 to $1\cdot 10^{-16}$ we obtain a binary number equal to 1 because 1e-16 is less that $\frac{\epsilon}{2}$ and no bits show up in the sum

$\ \ \ \ $1.0000 0000 0000 0000 0000 0000 0000 0000 0000
$+\ \ $ .0000 0000 0000 0000 0000 0000 0000 0000 0000 1


$=\ $1.0000 0000 0000 0000 0000 0000 0000 0000 0000

No sum, difference, product, or quotient may span more than 52 bits, or 14 hexadecimal digits.

In [4]:
format long
(1e-16 + 1) - 1
1e-16 + (1 - 1)
ans =

     0


ans =

     1.000000000000000e-16


Order of operations matters! We will pay particular attention to these details when designing numerical algorithms and implementing them in MATLAB.

Definition

The accuracy of an algorithm is the expected relative error performing the computations in the algorithm.

The measure of accuracy is sometimes called stability . In the next course in this sequence Math 551 we will introduce forward stability and backward stability . In this course we will measure accuracy from the order and number of numerical computations and by comparing computed against known results .

Creative Commons Logo - White


Your use of this self-initiated mediated course material is subject to our Creative Commons License .


Creative Commons Attribution-NonCommercial-ShareAlike 4.0

Creative Commons Logo - Black
Attribution
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Creative Commons Logo - Black
Noncommercial
You may not use the material for commercial purposes.

Creative Commons Logo - Black
Share Alike
You are free to share, copy and redistribute the material in any medium or format. If you adapt, remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.