# Binary Arithmetic ## and Floating Point Numbers --- CS 130 // 2021-11-08 ## Administrivia - Midterm Exam 2 was returned today + Be sure to look at comments I gave on problems - Quiz 4 today (administered on Gradescope, unlimited time) - Assignment 4 + Due Wednesday + Am letting you turn it in by midnight instead of before class # Questions ## ...about anything? # Binary Arithmetic ## Counting in Binary - Recall that circuits perform all computation in **binary** using the contrast of low/high voltages to mean 0 and 1 ---
$437_\text{ten}$
$110110101_\text{two}$
## Counting in Binary
Decimal
Binary
Hex
0
0
0
1
1
1
2
10
2
3
11
3
4
100
4
5
101
5
6
110
6
7
111
7
8
1000
8
Decimal
Binary
Hex
9
1001
9
10
1010
A
11
1011
B
12
1100
C
13
1101
D
14
1110
E
15
1111
F
16
10000
10
17
10001
11
## Addition in Binary - Recall that we can add two decimal numbers by appropriately "carrying" to the next significant digit: $$ \begin{array}{r} 137\\\\ \underline{\text{+ }934} \end{array} $$ ## Addition in Binary - We can do **exactly** the same thing in binary: $$ \begin{array}{r} 0010001001\\\\ \underline{\text{+ }1110100110} \end{array} $$ # Negative Numbers ## Negative Numbers in Binary - Recall that most computers use **two's complement** to encode signed integer values - A 32-bit, signed, two's complement number changes the meaning of the most significant bit so that it contributes $-2^{31}$ instead of $+2^{31}$ ## Binary Addition Revisited - Suppose we have two signed 32-bit numbers in two's complement encoding - Do we need any special machinery to do addition? $$ \begin{array}{r} 0111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\\\ \underline{\text{+ }0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0011} \end{array} $$ - Here an "overflow" occurred which means when we added two positive numbers the result was negative---but that is to be expected ## Binary Addition Revisited - What if one of the numbers is negative? $$ \begin{array}{r} 0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;0000\\;1010\\\\ \underline{\text{+ }1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1111\\;1101} \end{array} $$ - It still works great! ## Subtraction in Binary - Using two's complement, it is possible to compute $X-Y$ by doing the following: 1. Compute $-Y$ by flipping all bits and adding one 2. Add $X$ with the computed $-Y$ value # Overflow ## Overflow - **Overflow** occurs when one of these things happen: + Adding two positive numbers results in a negative + Adding two negative numbers results in a positive - Overflow can also occur during subtraction, but it reduces to one of the above two cases - Most processors generate an **exception** (or **interrupt**) when this occurs to handle the error # Fractional Numbers ## Factional Numbers - We've seen how integers are represented in binary + $17_\text{ten} = 10001_\text{two}$ - But how are fractional numbers represented? + $3.1415_\text{ten} = \ldots$ ## Idea 1: Fixed-Point Representation - We have 32 bits to represent a number - One idea is to place a fixed decimal point somewhere within the 32 bits - For example, we might put it here: + `00000000000000000000000000.000000` ## Idea 1: Fixed-Point Representation - Pros: + Hardware is simple and cheap to implement + Integer arithmetic still works on this representation + Very clear what numbers can be represented - Cons: + Cannot represent very large or small numbers + Limited precision ## Idea 2: Rational Representation - Can represent a rational $\frac{p}{q}$ by storing two integers: the numerator $p$ and the denominator $q$ - Also called **arbitrary precision** representation - Used in various libraries to to ensure perfect precision arithmetic + `java.math.BigDecimal` in Java + `decimal.Decimal` in Python ## Idea 2: Rational Representation - Pros: + Ensures perfect precision---even for infinite decimal expansions - Cons: + Difficult to implement in hardware---usually implemented at the software level instead + $\frac{1}{99} - \frac{1}{100} = \frac{1}{9900}$ + $\frac{1}{9900} + \frac{1}{101} = \frac{10001}{999900}$ ## Idea 3: Scientific Notation - We can represent a number like $0.00312$ in **scientific notation** + $3.12\times 10^{-4}$ - We can reserve bits for the $3.12$ part and reserve other bits for the exponent $-4$ - Notice how the location of the decimal point is **floating** in this representation---its location varies based on the exponent # Floating Point Numbers ## Floating Point Numbers - Almost all general-purpose computers use the **IEEE 754 Standard for Floating Point Arithmetic** - Uses binary scientific notation $$ \begin{align} 1.010_\text{two} \times 2^{-3} &= 0.001010_\text{two}\\\\ &= 1/8 + 1/32 \end{align} $$ - In general: $(-1)^{X} \times 1.Y \times 2^{Z}$ where $X, Y, Z$ are the **sign**, **mantissa**, and **exponent**, respectively ## Floating Point Numbers - Suppose I have a 32-bit floating point number: + `00111110001000000000000000000000` - What is the sign? + Leftmost bit: `0` - What is the mantissa? + Last 23 bits: `01000000000000000000000` + Actually means `1.01` - What is the exponent? + Remaining 8 bits: `01111100` ## Exponent Bias - To simplify comparing two floating point numbers, it is convenient to make `11111111` the largest exponent and `00000000` the smallest - Therefore a **bias** of 127 is subtracted from the exponent value + `0000000` means $2^{-127}$ + `0111111` means $2^{0}$ + `1111111` means $2^{128}$ - Therefore `01111100` means $2^{-3}$ # Decimal to Float Conversion ## Step 1: Convert to Binary - Example: $9.25_\text{ten}$ - Break number into integer and fraction part + $9_\text{ten} = 1001_\text{two}$ + $0.25_\text{ten} = 0.01_\text{two}$ - Therefore $9.25_\text{ten} = 1001.01_\text{two}$ ## Step 2: Normalize - Shift the decimal so that it is normalized - $1001.01 = 1.00101 \times 2^{3}$ - The mantissa is then $00101$, but since it needs to be 23 bits it actually is: + $00101000000000000000000$ ## Step 3: Determine Exponent - Remember that the bias is $127$ - Our desired exponent is $3$ - Exponent + bias is: $130$ - $130_\text{ten} = 10000010_\text{two}$ ## Step 4: Combine Pieces - Original number: $9.25$ - Sign bit: $0$ - Exponent: $10000010$ - Mantissa: $00101000000000000000000$ --- $0\\;10000010\\;00101000000000000000000$ ## Exercises - Convert the following number into IEEE 754 floating point representation + $-3.75$ + $11000000011100000000000000000000$ - Convert the following 32-bit number into its floating point decimal value + $11000001111000000000000000000000$ + $-28.0$ ## Special Numbers - Zero is represented as 32 zeros: 000...0 - $+\infty$ is represented as 0111111110....0 - $-\infty$ is represented as 1111111110....0 - `NaN` is represented with an exponent of 11111111 and a non-zero mantissa + Stands for "Not a Number" + Used to handle cases like $0/0$ ## Double Precision - Frequently, we need more than 32 bits of precision when doing floating point arithmetic - The IEEE 754 standard also has a **double precision** representation which uses 64-bits instead of 32 + 1 bit for sign + 52 bits for mantissa + 11 bits for exponent + Exponent bias of 1023