Friday, November 16, 2007

Fundamentals of embedded audio, part 3

Audio Processing Methods
Getting data to the processor's core
There are a number of ways to get audio data into the processor's core. For example, a foreground program can poll a serial port for new data, but this type of transfer is uncommon in embedded media processors because it makes inefficient use of the core.

Instead, a processor connected to an audio codec usually uses a DMA engine to transfer the data from the codec link (like a serial port) to some memory space available to the processor. This transfer of data occurs in the background without the core's intervention. The only overhead is in setting up the DMA sequence and handling the interrupts once the buffer of data has been received or transmitted.

Block processing versus sample processing
Sample processing and block processing are two approaches for dealing with digital audio data. In the sample-based method, the processor crunches the data as soon as it's available. Here, the processing function incurs overhead during each sample period. Many filters (like FIR and IIR, described later) are implemented this way because the effective latency is lower for sample-based processing than for block processing.

In block processing is a buffer of a specific length must be filled before passing the data to the processing function. Some filters are implemented using block processing because it is more efficient than sample processing. For one, the processing function does not need to be called for each sample, greatly reducing overhead. Also, many embedded processors contain multiple processing units such as multipliers or full ALUs that can crunch blocks of data in parallel. . What's more, some algorithms are, by nature, must be processed in blocks. A well known one is the Fourier Transform (and its practical counterpart, the Fast Fourier Transform, or FFT), which accepts blocks of temporal or spatial data and converts them into frequency domain representations.

Double-Buffering
In a block-based processing system that uses DMA to transfer data to and from the processor core, a "double buffer" must exist to arbitrate between the DMA transfers and the core. This is done so that the processor core and the core-independent DMA engine do not access the same data at the same time and cause a data coherency problem.

For example, to facilitate the processing of a buffer of length N, simply create a buffer of length 2-N. For a bi-directional system, two buffers of length 2-N must be created. As shown in Figure 1a, the core processes the in1 buffer and stores the result in the out1 buffer, while the DMA engine is filling in0 and transmitting the data from out0. It can be seen in Figure 1b that once the DMA engine is done with the left half of the double buffers, it starts transferring data into in1 and out of out1, while the core processes data from in0 and into out0. This configuration is sometimes called "ping-pong buffering," because the core alternates between processing the left and right halves of the double buffers.

Note that in real-time systems, the serial port DMA (or another peripheral's DMA tied to the audio sampling rate) dictates the timing budget. For this reason, the block processing algorithm must be optimized in such a way that its execution time is less than or equal to the time it takes the DMA to transfer data to/from one half of a double-buffer.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure11_big.gif



Two-dimensional (2D) DMA
When data is transferred across a digital link like I2S, it may contain several channels. These may all be multiplexed onto one data line going into the same serial port. In such a case, 2D DMA can be used to de-interleave the data so that each channel is linearly arranged in memory. Take a look at Figure 2 for a graphical depiction of this arrangement, where samples from the left and right channels are de-multiplexed into two separate blocks. This automatic data arrangement is extremely valuable for those systems that employ block processing.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure12_big.gif

Figure 2. A 2D DMA engine used to de-interleave (a) I²S stereo data into (b) separate left and right buffers.

Basic Operations
There are three fundamental operations in audio processing. They are the summing, multiplication, and time delay operations. Many more complicated effects and algorithms can be implemented using these three elements. A summer has the obvious duty of adding two signals together. A multiplication can be used to boost or attenuate an audio signal. On most media processors, these operations can be executed in a single cycle.

A time delay is a bit more complicated. The delay is accomplished with a delay line, which is really nothing more than an array in memory that holds previous data. For example, an echo algorithm might hold 500 mS of input samples for each channel. For a simple delay effect, the current output value is computed by adding the current input value to a slightly attenuated previous sample. If the audio system is sample-based, then the programmer can simply keep track of an input pointer and an output pointer (spaced at 500 mS worth of samples apart), and increment them after each sampling period.

Since delay lines are meant to be reused for subsequent sets of data, the input and output pointers will need to wrap around from the end of the delay line buffer back to the beginning. In C/C++, this is usually done by appending the modulus operator (%) to the pointer increment.

This wrap-around may incur no extra processing cycles for a processor that supports circular buffering (see Figure 3). In this case, the beginning location and length of a circular buffer must be provided only once. During processing, the software increments or decrements the current pointer within the buffer, but the hardware takes care of wrapping around to the beginning of the buffer if the current pointer falls outside of the bounds. Without this automated address generation, the programmer would have to manually keep track of the buffer, thus wasting valuable processing cycles.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure13_big.gif



Figure 3. (a) Graphical representation of a delay line using a circular buffer (b) Layout of a circular buffer in memory.

A delay line structure can give rise to an important audio building block called the comb filter, which is essentially a delay with a feedback element. When multiple comb filters are used simultaneously, they can create the effect of reverberation.

Signal generation
In some audio systems, a signal (for example, a sine wave) might need to be synthesized. Taylor Series function approximations can emulate trigonometric functions. Uniform random number generators are handy for creating white noise.

However, synthesis might not fit into a given system's processing budget. On fixed-point systems with ample memory, you can use a table lookup instead of generating a signal. This has the side effect of taking up precious memory resources, so hybrid methods can be used as a compromise. For example, you can store a coarse lookup table to save memory. During runtime, the exact values can be extracted from the table using interpolation, an operation that can take significantly less time than computing using a full Taylor Series approximation. This hybrid approach provides a good balance between computation time and memory resources.

Filtering and Algorithms
Digital filters are used in audio systems for attenuating or boosting the energy content of a sound wave at specific frequencies. The most common filter forms are high-pass, low-pass, band-pass and notch. Any of these filters can be implemented in two ways. These are the finite impulse response filter (FIR) and the infinite impulse response filter (IIR), and they are often used as building blocks to more complicated filtering algorithms like parametric equalizers and graphic equalizers.

Finite Impulse Response (FIR) filter
The FIR filter's output is determined by the sum of the current and past inputs, each of which is first multiplied by a filter coefficient. The FIR summation equation, shown in Figure 4a, is also known as "convolution," one of the most important operations in signal processing. In this syntax, x is the input vector, y is the output vector, and h holds the filter coefficients. Figure 4a also shows a graphical representation of the FIR implementation.

The convolution is such a common operation in media processing that many processors can execute a multiply-accumulate (MAC) instruction along with multiple data accesses (reads and writes) in one cycle.

http://i.cmpnet.com/dspdesignline/2007/09/adifgure14_big.gif


Figure 4. (a) FIR filter equation and structure (b) IIR filter equation and structure.

Infinite Impulse Response (IIR) filter
Unlike the FIR, whose output depends only on inputs, the IIR filter relies on both inputs and past outputs. The basic equation for an IIR filter is a difference equation, as shown in Figure 4b. Because of the current output's dependence on past outputs, IIR filters are often referred to as "recursive filters." Figure 4b also gives a graphical perspective on the structure of the IIR filter.

Fast Fourier Transform
Quite often, we can do a better job describing an audio signal by characterizing its frequency composition. A Fourier Transform takes a time-domain signal and rearranges it into the frequency domain; the inverse Fourier Transform achieves the opposite, converting a frequency-domain representation back into the time domain. Mathematically, there are some nice property relationships between operations in the time domain and those in the frequency domain. Specifically, a time-domain convolution (or an FIR filter) is equivalent to a multiplication in the frequency domain. This tidbit would not be too practical if it weren't for a special optimized implementation of the Fourier transform called the Fast Fourier Transform (FFT). In fact, it is often more efficient to implement a sufficiently long FIR filter by transforming the input signal and coefficients into the frequency domain with an FFT, multiplying the transforms, and then transforming the result back into the time domain with an inverse FFT.

There are other transforms that are used often in audio processing. Among them, the most common is the modified discrete cosine transform (MDCT), which is the basis for many audio compression algorithms.

Sample Rate Conversion
There are times when you will need to convert a signal sampled at one frequency to a different sampling rate. One situation where this is useful is when you want to decode an audio signal sampled at, say 8 kHz, but the DAC you're using does not support that sampling frequency. Another scenario is when a signal is oversampled, and converting it to a lower frequency can lead to a reduction in computation time. The process of converting the sampling rate of a signal from one rate to another is called sampling rate conversion (or SRC).

Increasing the sampling rate is called interpolation, and decreasing it is called decimation. Decimating a signal by a factor of M is achieved by keeping only every Mth sample and discarding the rest. Interpolating a signal by a factor of L is accomplished by padding the original signal with L-1 zeros between each sample.

Even though interpolation and decimation factors are integers, you can apply them in series to an input signal and get a rational conversion factor. When you upsample by 5 and then downsample by 3, then the resulting factor is 5/3 = 1.67.


http://i.cmpnet.com/dspdesignline/2007/09/adifigure15_big.gif

Figure 5. Sample-rate conversion through upsampling and ownsampling.

To be honest, we oversimplified the SRC process a bit too much. In order to prevent artifacts due to zero-padding a signal (which creates images in the frequency domain), an interpolated signal must be low-pass-filtered before being used as an output or as an input into a decimator. This anti-imaging low-pass filter can operate at the input sample rate, rather than at the faster output sample rate, by using a special FIR filter structure that recognizes that the inputs associated with the L-1 inserted samples have zero values.

Similarly, before they're decimated, all input signals must be low-pass-filtered to prevent aliasing. The anti-aliasing low-pass filter may be designed to operate at the decimated sample rate, rather than at the faster input sample rate, by using a FIR filter structure that realizes the output samples associated with the discarded samples need not be computed. Figure 5 shows a flow diagram of a sample rate converter. Note that it is possible to combine the anti-imaging and anti-aliasing filter into one component for computational savings.

Obviously, we've only been able to give a surface discussion on these embedded audio topics, but hopefully we've provided a useful template for the kinds of considerations necessary for developing an embedded audio processing application.

This series is adapted from the book "Embedded Media Processing" (Newnes 2005) by David Katz and Rick Gentile. See the book's web site for more information.

No comments: