More topics in Digital Signal Processing

# Quantization in DSP – Truncation and Rounding

• The designing of a digital filter means figuring out its coefficients. A fact we first saw in the designing of IIR filters. We store the values of these coefficients in binary registers. These registers are just digital memories in the DSP system.
• Generally, we use infinite precision arithmetic for describing filter coefficients in the interest of accuracy. But practically, it is not possible to store a large chain of bits in a register. Thus we need to find a way to pack these filter coefficients into a fixed word size register.
• So we ditch infinite precision arithmetic and go with the fixed-point representation of binary numbers. In the fixed-point representation, the number of digits before and after the decimal point is fixed. In this representation, the MSB is said to represent the sign of the number.
• Within fixed-point representation, we have three different ways of representing numbers. This is just for your information. The three different formats are:
• Sign magnitude: The leading binary digit represents the sign. (If MSB = 1, number negative. If MSB = 0, number positive).
• 1s complement: All the bits are complemented.
• 2s complement: ‘1’ is added to 1s complement.
• Now that we have a fixed number of bits, we need to make sure that these bits match the word-size of the register memory we use to store the coefficient values. If the number of bits is more, we quantize them.
• Thus, quantization is the process of reducing the number of bits to ensure the storage of the filter coefficients in the Digital Signal Processing system’s register.
• In this post, we will study two types of Quantization methods:
• Truncation
• Rounding

## What is Truncation?

• Truncation is a type of quantization where extra bits get ‘truncated.’
• Basically, in the truncation process, all bits less significant than the desired LSB (Least Significant Bit) are discarded.
• For example, suppose we wish to truncate the following 8-bit number to 4-bits.
• X = 0.01101011 truncates to X = 0.0110
• Converting the above to decimal we can see that there is a large change in value. (0.01101011 equals 0.418 and 0.0110 equals 0.375).
• Thus, truncation is an inferior method of quantization since it has a high margin for error.
• The error from quantization using truncation is given by the formula:
• ${ -2 }^{ -b }\quad \le \quad e\quad \le \quad 0$(For a positive number/2s complement)
• $0\quad \le \quad e\quad \le \quad { 2 }^{ -b }$(For a negative number/1s complement)

## What is Rounding?

• Rounding is a quantization method where we ’round-up’ a particular number to the desired number of bits.
• Basically, rounding is the process of reducing the size of a binary number to some desirable finite size. This is done in such a way that the rounded off number is as close to the original unquantized number as possible.
• Interestingly, the rounding process is a combination of truncation and addition.
• In rounding a number to say b-bits, first, the number is truncated to the desired number of bits. Then depending on the number that existed next to the LSB before truncation, an addition to the LSB is performed.
• If that particular number (previously next to the LSB) was 0, then 0 is added to the LSB. If that number was 1, then a 1 is added to the LSB.
• Consider the same example as above, suppose we wish to truncate the following 8-bit number to 4-bits.
• X = 0.01101011 truncates to X = 0.0110
• Since the number next to the current LSB was 1, we add 1 to the current LSB.
• Thus X is now 0.0111
• Converting both the unquantized and rounded off numbers to decimal, we notice that the magnitude of error is less relative to truncation. (0.01101011 equals 0.418 and 0.0111 equals 0.438).
• Thus rounding is preferable than truncation.
• The magnitude of error in rounding is given by the formula:
• $\frac { { -2 }^{ -b } }{ 2 } \quad \le \quad e\quad \le \quad \frac { { 2 }^{ -b } }{ 2 }$