reveal.js

# Binary Arithmetic
## and Floating Point Numbers

---

CS 130 // 2021-11-08

## Administrivia
- Midterm Exam 2 was returned today
    + Be sure to look at comments I gave on problems
- 
 Quiz 4 today (administered on Gradescope, unlimited time)
- 
 Assignment 4
    + Due Wednesday
    + 
 Am letting you turn it in by midnight instead of before class

# Questions
## ...about anything?

# Binary Arithmetic

## Counting in Binary
- Recall that circuits perform all computation in **binary** using the contrast of low/high voltages to mean 0 and 1

---

$437_\text{ten}$

</div>
<div>

$110110101_\text{two}$
</div>
</div>

## Counting in Binary

<table>
  <tbody>
    <tr> <th>Decimal</th> <th>Binary</th><th>Hex</th></tr>
    <tr> <td>0</td> <td>0</td> <td>0</td></tr>
    <tr> <td>1</td> <td>1</td> <td>1</td></tr>
    <tr> <td>2</td> <td>10</td><td>2</td></tr>
    <tr> <td>3</td> <td>11</td> <td>3</td></tr>
    <tr> <td>4</td> <td>100</td><td>4</td></tr>
    <tr> <td>5</td> <td>101</td><td>5</td></tr>
    <tr> <td>6</td> <td>110</td><td>6</td></tr>
    <tr> <td>7</td> <td>111</td><td>7</td></tr>
    <tr> <td>8</td> <td>1000</td> <td>8</td></tr>
  </tbody>
</table>

</div>
<div>

<table>
  <tbody>
    <tr> <th>Decimal</th> <th>Binary</th><th>Hex</th></tr>
    <tr> <td>9</td> <td>1001</td> <td>9</td></tr>
    <tr> <td>10</td> <td>1010</td><td>A</td></tr>
    <tr> <td>11</td> <td>1011</td> <td>B</td></tr>
    <tr> <td>12</td> <td>1100</td><td>C</td></tr>
    <tr> <td>13</td> <td>1101</td><td>D</td></tr>
    <tr> <td>14</td> <td>1110</td><td>E</td></tr>
    <tr> <td>15</td> <td>1111</td><td>F</td></tr>
    <tr> <td>16</td> <td>10000</td><td>10</td></tr>
    <tr> <td>17</td> <td>10001</td><td>11</td></tr>
  </tbody>
</table>

</div>
</div>

## Addition in Binary
- Recall that we can add two decimal numbers by appropriately "carrying" to the next significant digit:

$$
\begin{array}{r}
137\\\\
\underline{\text{+ }934}
\end{array}
$$

## Addition in Binary
- We can do **exactly** the same thing in binary:

$$
\begin{array}{r}
0010001001\\\\
\underline{\text{+ }1110100110}
\end{array}
$$

# Negative Numbers

## Negative Numbers in Binary
- Recall that most computers use **two's complement** to encode signed integer values
- 
 A 32-bit, signed, two's complement number changes the meaning of the most significant bit so that it contributes $-2^{31}$ instead of $+2^{31}$

## Binary Addition Revisited
- Suppose we have two signed 32-bit numbers in two's complement encoding
- Do we need any special machinery to do addition?
    $$
    \begin{array}{r}
    0111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\\\
    \underline{\text{+ }0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0011}
    \end{array}
    $$
- 
 Here an "overflow" occurred which means when we added two positive numbers the result was negative---but that is to be expected

## Binary Addition Revisited
- What if one of the numbers is negative?
    $$
    \begin{array}{r}
    0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;1010\\\\
    \underline{\text{+ }1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1101}
    \end{array}
    $$
- 
 It still works great!

## Subtraction in Binary
- Using two's complement, it is possible to compute $X-Y$ by doing the following:
    1. 
 Compute $-Y$ by flipping all bits and adding one
    2. 
 Add $X$ with the computed $-Y$ value

# Overflow

## Overflow
- **Overflow** occurs when one of these things happen:
    + 
 Adding two positive numbers results in a negative
    + 
 Adding two negative numbers results in a positive
- 
 Overflow can also occur during subtraction, but it reduces to one of the above two cases
- 
 Most processors generate an **exception** (or **interrupt**) when this occurs to handle the error

# Fractional Numbers

## Factional Numbers
- We've seen how integers are represented in binary
    + $17_\text{ten} = 10001_\text{two}$
- 
 But how are fractional numbers represented?
    + $3.1415_\text{ten} = \ldots$

## Idea 1: Fixed-Point Representation
- We have 32 bits to represent a number
- One idea is to place a fixed decimal point somewhere within the 32 bits
- 
 For example, we might put it here:
    + `00000000000000000000000000.000000`

## Idea 1: Fixed-Point Representation
- Pros:
    + Hardware is simple and cheap to implement
    + 
 Integer arithmetic still works on this representation
    + 
 Very clear what numbers can be represented
- 
 Cons:
    + Cannot represent very large or small numbers
    + 
 Limited precision

## Idea 2: Rational Representation
- Can represent a rational $\frac{p}{q}$ by storing two integers: the numerator $p$ and the denominator $q$
- 
 Also called **arbitrary precision** representation
- 
 Used in various libraries to to ensure perfect precision arithmetic
    + `java.math.BigDecimal` in Java
    + `decimal.Decimal` in Python

## Idea 2: Rational Representation
- Pros:
    + Ensures perfect precision---even for infinite decimal expansions
- 
 Cons:
    + Difficult to implement in hardware---usually implemented at the software level instead
    + 
 $\frac{1}{99} - \frac{1}{100} = \frac{1}{9900}$
    + 
 $\frac{1}{9900} + \frac{1}{101} = \frac{10001}{999900}$

## Idea 3: Scientific Notation
- We can represent a number like $0.00312$ in **scientific notation**
    + $3.12\times 10^{-4}$
- 
 We can reserve bits for the $3.12$ part and reserve other bits for the exponent $-4$
- 
 Notice how the location of the decimal point is **floating** in this representation---its location varies based on the exponent

# Floating Point Numbers

## Floating Point Numbers
- Almost all general-purpose computers use the **IEEE 754 Standard for Floating Point Arithmetic**
- Uses binary scientific notation

$$
\begin{align}
1.010_\text{two} \times 2^{-3}
    &= 0.001010_\text{two}\\\\
    &= 1/8 + 1/32
\end{align}
$$

- 
 In general: $(-1)^{X} \times 1.Y \times 2^{Z}$ where $X, Y, Z$ are the **sign**, **mantissa**, and **exponent**, respectively

## Floating Point Numbers
- Suppose I have a 32-bit floating point number:
    + `00111110001000000000000000000000`
- 
 What is the sign?
    + 
 Leftmost bit: `0`
- 
 What is the mantissa?
    + 
 Last 23 bits: `01000000000000000000000`
    + 
 Actually means `1.01`
- 
 What is the exponent?
    + 
 Remaining 8 bits: `01111100`

## Exponent Bias
- To simplify comparing two floating point numbers, it is convenient to make `11111111` the largest exponent and `00000000` the smallest
- 
 Therefore a **bias** of 127 is subtracted from the exponent value
    + 
 `0000000` means $2^{-127}$
    + 
 `0111111` means $2^{0}$
    + 
 `1111111` means $2^{128}$
- 
 Therefore `01111100` means $2^{-3}$

# Decimal to Float Conversion

## Step 1: Convert to Binary
- Example: $9.25_\text{ten}$
- 
 Break number into integer and fraction part
    + 
 $9_\text{ten} = 1001_\text{two}$
    + 
 $0.25_\text{ten} = 0.01_\text{two}$
- 
 Therefore $9.25_\text{ten} = 1001.01_\text{two}$

## Step 2: Normalize
- Shift the decimal so that it is normalized
- $1001.01 = 1.00101 \times 2^{3}$
- 
 The mantissa is then $00101$, but since it needs to be 23 bits it actually is:
    + $00101000000000000000000$

## Step 3: Determine Exponent
- Remember that the bias is $127$
- 
 Our desired exponent is $3$
- 
 Exponent + bias is: $130$
- 
 $130_\text{ten} = 10000010_\text{two}$

## Step 4: Combine Pieces
- Original number: $9.25$ 
- 
 Sign bit: $0$
- 
 Exponent: $10000010$
- 
 Mantissa: $00101000000000000000000$

---

$0\\;10000010\\;00101000000000000000000$

## Exercises
- Convert the following number into IEEE 754 floating point representation
    + $-3.75$
    + 
 $11000000011100000000000000000000$
- 
 Convert the following 32-bit number into its floating point decimal value
    + $11000001111000000000000000000000$
    + 
 $-28.0$

## Special Numbers
- Zero is represented as 32 zeros: 000...0
- 
 $+\infty$ is represented as 0111111110....0
- 
 $-\infty$ is represented as 1111111110....0
- 
 `NaN` is represented with an exponent of 11111111 and a non-zero mantissa
    + Stands for "Not a Number"
    + 
 Used to handle cases like $0/0$

## Double Precision
- Frequently, we need more than 32 bits of precision when doing floating point arithmetic
- 
 The IEEE 754 standard also has a **double precision** representation which uses 64-bits instead of 32
    + 
 1 bit for sign
    + 
 52 bits for mantissa
    + 
 11 bits for exponent
    + 
 Exponent bias of 1023