- The designing of a digital filter means figuring out its coefficients. A fact we first saw in the designing of IIR filters. We store the values of these coefficients in binary registers. These registers are just digital memories in the DSP system.
- Generally, we use infinite precision arithmetic for describing filter coefficients in the interest of accuracy. But practically, it is not possible to store a large chain of bits in a register. Thus we need to find a way to pack these filter coefficients into a fixed word size register.
- So we ditch infinite precision arithmetic and go with the fixed-point representation of binary numbers. In the fixed-point representation, the number of digits before and after the decimal point is fixed. In this representation, the MSB is said to represent the sign of the number.
- Within fixed-point representation, we have three different ways of representing numbers. This is just for your information. The three different formats are:
- Sign magnitude: The leading binary digit represents the sign. (If MSB = 1, number negative. If MSB = 0, number positive).
- 1s complement: All the bits are complemented.
- 2s complement: ‘1’ is added to 1s complement.

- Now that we have a fixed number of bits, we need to make sure that these bits match the word-size of the register memory we use to store the coefficient values. If the number of bits is more, we quantize them.
- Thus, quantization is the process of reducing the number of bits to ensure the storage of the filter coefficients in the Digital Signal Processing system’s register.
- In this post, we will study two types of Quantization methods:
- Truncation
- Rounding

**What is Truncation?**

- Truncation is a type of quantization where extra bits get ‘truncated.’
- Basically, in the truncation process, all bits less significant than the desired LSB (Least Significant Bit) are discarded.
- For example, suppose we wish to truncate the following 8-bit number to 4-bits.
- X = 0.01101011 truncates to X = 0.0110
- Converting the above to decimal we can see that there is a large change in value. (0.01101011 equals 0.418 and 0.0110 equals 0.375).
- Thus, truncation is an inferior method of quantization since it has a high margin for error.

- The error from quantization using truncation is given by the formula:
- (For a positive number/2s complement)
- (For a negative number/1s complement)

**What is Rounding?**

- Rounding is a quantization method where we ’round-up’ a particular number to the desired number of bits.
- Basically, rounding is the process of reducing the size of a binary number to some desirable finite size. This is done in such a way that the rounded off number is as close to the original unquantized number as possible.
- Interestingly, the rounding process is a
*combination of truncation and addition.* - In rounding a number to say b-bits, first, the number is truncated to the desired number of bits. Then depending on the number that existed next to the LSB before truncation, an addition to the LSB is performed.
- If that particular number (previously next to the LSB) was 0, then 0 is added to the LSB. If that number was 1, then a 1 is added to the LSB.
- Consider the same example as above, suppose we wish to truncate the following 8-bit number to 4-bits.
- X = 0.01101011 truncates to X = 0.0110
- Since the number next to the current LSB was 1, we add 1 to the current LSB.
- Thus X is now 0.0111
- Converting both the unquantized and rounded off numbers to decimal, we notice that the magnitude of error is less relative to truncation. (0.01101011 equals 0.418 and 0.0111 equals 0.438).

- Thus rounding is preferable than truncation.
- The magnitude of error in rounding is given by the formula: