Background
At present, the application range of high-speed digital signal processing is also increasingly wider, and particularly in new technical fields of radar positioning, sonar, oil exploration, optical fiber sensors, mobile communication, software radio, intelligent antennas and the like, strong support of real-time digital signal processing is needed to complete tasks of target detection, estimation and tracking. And, through carrying out certain processing to the signal, extract the characteristic, analyze and discern, find out relevant parameter and result as required.
In digital signal processing, a signal is represented by a sequence of numbers of finite precision, and the processing is implemented by digital operations. Spectral analysis, Discrete Fourier Transform (DFT), is one of the important contents of signal processing, and has wide application in engineering practice. The Fast Fourier Transform (FFT) algorithm for realizing Discrete Fourier Transform (DFT) is an efficient method for signal conversion between time domain and frequency domain, and is an essential core tool in various fields related to waveform analysis.
The advent of efficient Fast Fourier Transforms (FFT) has been crucial to the advancement of digital signal processing. Fast fourier transform FFT for calculating DFT, which was studied by Cooley Turkey, greatly reduces the amount of computation, so that the FFT is widely applied in the fields of digital signal processing, sensing and positioning, earthquake prediction, medical fault diagnosis, coding theory, quantum physics, probability theory, etc.
In actual engineering practice, a Digital Signal Processor (DSP) is generally used to implement the fast fourier FFT algorithm. The fast fourier FFT operation generally has two implementation methods, which are a dedicated chip and a general processor. Although the special chip has high operation efficiency, the flexibility is poor; the general processor has high flexibility in algorithm and processing mode, and can realize various algorithms. The high-performance DSP from TI company can complete the functions in real time with software, which must be completed with several special chips, and this can simplify design, lower cost, raise design flexibility and portability.
The operational efficiency of the fast fourier transform FFT directly affects the performance of the entire system. The signal processor DSP chip usually takes the operation speed of the fast fourier transform FFT as the most important technical index. The DSP chip internal memory of the signal processor is limited, and the realization is easier under the condition that the FFT point number is not large and the storage space is large enough. Usually, the point number and the precision required in the spectrum analysis are not large, and the signal processor DSP chip can finish the analysis at high speed by calling a conventional Fast Fourier Transform (FFT) function algorithm.
However, in some specific applications, such as optical fiber sensor positioning, power spectrum estimation by periodogram method, and radar precise positioning, in order to obtain high spectral resolution, fast fourier transform FFT with an ultra-large number of points is often adopted. Compared with the small-point fast Fourier transform FFT, the implementation of the large-point fast Fourier transform FFT is much more complex, the real-time performance of the operation is ensured, the direct calling of the universal algorithm in the DSP chip of the signal processor cannot be realized, and the fast Fourier transform FFT is completed to a processing system through a multi-stage board-level subsystem, namely a plurality of DSP signal processing chips.
FIG. 1 is a schematic diagram of a hierarchical system using multiple DSP processors. The large-point number original data is stored in an off-chip memory of the floating-point DSP, the first few-stage butterfly operation is performed on the 4 floating-point DSPs in parallel, the operation result is transmitted to the fixed-point DSP103 through the double-port memory, and the last few-stage butterfly operation is completed. In a multi-DSP processor hierarchical system including high-end C6000-series DSP chips of Texas Instruments (TI), C6701 may be used for 4 floating-point DSPs, and C6202 may be used for fixed-point DSP 1.
In the system, the FFT algorithm adopts a time-based decimation radix-two algorithm with natural input and reverse-order output. As shown in fig. 2, a radix-2 FFT flow graph is shown with N-32 decimated over time. The algorithm principle is as follows:
1. splitting a large number of points N into 2-level FFTs, where N is N1×N2Wherein N is1=2r1,N2=2r2。N2For the number of parallel processors DSP, N-point data is cyclically allocated to N2A stage processor.
2. First make N2N is1FFT of the point, then N1The twiddle factors generate r 2-level butterfly operations, and the previous processing calculation stages are completed on a parallel processor. N can be separated by the same method1Of dotsThe FFT is split into a smaller number of points of FFT. The law of the twiddle factor is as follows: all twiddle factors are bit-reversed, and then the required twiddle factor is divided into r2 parts:
(1)1 twiddle factor, obtained by sequentially taking 1 in sequence in the whole twiddle factor;
(2)2 twiddle factors, for sequentially taking 2 results at a time in the whole twiddle factor;
……
(r2)2r2-1a twiddle factor, for sequentially taking 2 at a time in the whole twiddle factorr2-1And (4) obtaining.
3. And transmitting the parallel processing result to the last summary DSP to finish the final 2-stage butterfly operation and the sequencing.
As shown in fig. 3, a flow chart for a 256 k-point FFT implemented in the multi-DSP processor hierarchical system shown in fig. 1.
As shown in the figure, floating-point DSPs-1-4 (C6701)101 a-101 d perform type conversion on the sampled fixed-point data to convert the data into floating-point data (see step 301), wherein the fixed-point data are sampled in columns;
converting the complex sequence FFT of the large point number into the small point number which can be processed on the DSP at one time (see steps 302 and 303), wherein floating point data is put back into the C6701 from the C6701 by columns and is put back into the C6701 from the C6701 by rows;
converting floating point data into fixed point data and transmitting the fixed point data to a fixed point DSP C6202 (see step 304), wherein the floating point data is put back into a C6701 from the C6701 according to rows and transmitted through DPRAN communication;
then, the last butterfly operation is done on the fixed-point DSP (C6202)103 (see step 305), where the fixed-point data is fetched from outside C6202 into inside C6202;
sorting in a reverse order (see step 306), wherein the fixed-point data is put back from the C6202 to the outside of the C6202 according to rows, and the fixed-point data is fetched in the order of bit reverse rolling between the rows;
then, double real number sequence separation is performed, wherein the fixed point data is put back from C6202 to the outside of C6202 according to columns, the fixed point data is taken from the outside of C6202 to the inside of C6202 according to rows, and finally modulo and data detection are performed (see steps 307 to 308), wherein the result is put back from the inside of C6202 to the outside of C6202 according to rows.
However, a drawback with the above systems and algorithms is that: the technical cost is very high, and 5 pieces of TI high-end c6000 series DSP chips are needed for 256k points;
the hardware design is complex, the synchronous communication of the 5 DSP processors is difficult to realize and unstable, which is very important for the realization of the FFT algorithm, otherwise, the asynchronous time occurs, which directly causes the error of the frequency spectrum analysis;
the processing plate cannot be made very small and is not suitable for large-scale integration; in some special occasions, the board is too large to meet the product requirements;
in software, algorithm splitting is also complex, parallel algorithms are difficult originally, and because C6202 is a fixed-point DSP processor and C6701 is a floating-point DSP, the complexity is increased by the conversion from floating point to fixed point in the later stages of FFT.
In addition, a single-chip floating-point DSP small-point algorithm can be adopted to realize FFT with large data volume. The large data amount FFT is mapped into a series of small point number FFT by a data extraction mode, and a Winograd algorithm is adopted, and the implementation on a single chip C6701DSP is taken as an example for explanation. The algorithm principle is as follows: dividing the N-point sequence into N1N 2-point FFTs and N2N 1-point FFTs; as shown in fig. 4, a flow chart for implementing a large data volume FFT for a single-chip floating-point DSP small-point algorithm. Wherein,
step 401, performing linear mapping from a one-dimensional time domain sequence to a two-dimensional time domain sequence, and decomposing an N point sequence into a two-dimensional sequence; namely, it is
<math>
<mfenced open='[' close=']'>
<mtable>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>2</mn>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</math>
Step 402, performing FFT on each row of the two-dimensional time domain sequence, and performing N in total1N is2The FFT of the points yields an intermediate matrix t (n)1,k2)};
Step 403, give matrix { t (n)1,k2) Multiplying each element of the multiplication by a corresponding rotation factorObtain an intermediate matrix g (n)1,k2) And (c) the step of (c) in which,
step 404, for matrix { g (n)1,k2) FFT for each column, N2N is1FFT of the points, a two-dimensional matrix { X (k) } of the frequency domain is obtained1,k2)};
Step 405, for the frequency domain two-dimensional matrix { X (k) }1,k2) Linearly mapping to obtain one-dimensional matrix of frequency domain, i.e.Large data volume FFT
<math>
<mfenced open='[' close=']'>
<mtable>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mn>2</mn>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
<mtd>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
</mtd>
<mtd>
<mi>x</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>N</mi>
<mn>1</mn>
</msub>
<msub>
<mi>N</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</math>
The off-chip memory is divided into 3 blocks of regions: raw data area, intermediate matrix area, twiddle factor. The final result of the data is that it is stored sequentially in the original data area.
The defect of realizing the FFT with large data volume by adopting the single-chip floating-point DSP small-point algorithm is as follows:
the floating-point DSP processor is used, so that the operation speed is relatively low; the number of points which can be calculated efficiently is not large enough and is not easy to expand; the on-chip memory capacity of the floating-point processor is small because N is satisfied1And N2The data volume of FFT can be directly realized in the chip, and the number of FFT points which can be calculated is limited;
according to the on-chip resources, for C6701, N1 and N2 are not more than 4096, if the maximum point number calculated in this way is only 224;
The algorithm is complex to split, the data flow is complex, and the used twiddle factor rule is complex; the program is complex, the number of subprocesses is large, and the detailed design is complicated.
Detailed Description
The invention provides a signal processing method, a signal processing device and a signal processing system. The present invention will be described in detail below with reference to the accompanying drawings.
Example one
The invention provides a signal processing system, as shown in fig. 5, the system includes a data acquisition unit 501 and a data conversion unit 502 connected with the data acquisition unit 501;
in addition, a digital signal processing device 503 is also included for receiving and storing the N-point input data transmitted by the data conversion unit 502, storing the processed output data and the twiddle factor required for processing, and adopting a decimation base-2 according to frequencymFFT and utilize twiddle factor to process N point input data;
wherein the number of stages of the N-point input data is Each stage comprising N/2mAnd each butterfly unit, wherein N is the number of points of input data, M is an integer, and M is a positive integer.
As shown in fig. 5, the digital signal processing apparatus 503 includes: a digital signal processor DSP503a and an external memory 503 b; the external memory 503b is configured to receive and store the N-point input data transmitted by the data conversion unit 502, and further store the output data processed by the digital signal processor 503a and the twiddle factor required for processing;
the DSP503a is connected to the external memory 503b and uses a decimation by frequency of-2mThe FFT processes the N-point input data stored in the external memory 503b by using the twiddle factor, and transfers the processed output data to the external memory 503b for storage.
As shown in fig. 6, which is a schematic configuration diagram of the digital signal processing apparatus 503 according to the embodiment of the present invention, the digital signal processor DSP503a at least includes: a central processor 603, an internal memory 602, a Direct Memory Access (DMA) controller 601, and an interrupt controller 604; wherein,
a DMA controller 601 connected to the external memory 503b and the internal memory 602, for transferring data between the external memory 503b and the internal memory 602;
an internal memory 602, configured to receive, temporarily store and transmit data transmitted by the external memory 503b to the central processor 603 through the DMA controller 601; or after receiving and temporarily storing the processed data transmitted by the central processor 603, the data is transmitted to the external memory 503b through the DMA controller 601; and the internal memory 503b is also used to store programs;
a central processing unit 603, connected to the internal memory 602, for processing the data transmitted from the internal memory 602 by using the program stored in the internal memory 602, and transmitting the processed output data to the internal memory 602 for temporary storage;
the interrupt controller 604 is connected to the DMA controller 601, and is configured to send an interrupt request signal to switch between data transfer between the external memory 503b and the internal memory 601 and processing of the internal memory 602 by the central processor 603.
In addition, in this embodiment, the digital signal processor 503b may further include a first-level internal program memory 605 and a first-level internal data memory 606; wherein,
the first-level internal program memory 605 is disposed between the internal memory 602 and the central processing unit 603, and is used for temporarily storing the program transmitted by the internal memory 602 for the central processing unit 603 to use;
the first-level internal data memory 606 is connected to the internal memory 602 and the central processor 603, and is configured to temporarily store data transmitted by the internal memory 602 and processed by the central processor 603 or processed data.
The system also comprises a spectrum analysis unit 504, a target identification unit 505 and a terminal 506 which are all connected with the digital signal processing device 503; wherein,
a spectrum analysis unit 504 that performs spectrum analysis based on the output data stored in the digital signal processing device 503; a target recognition unit 505 that performs target recognition based on the output data stored in the digital signal processing apparatus 503; the terminal 506 displays the output data stored in the digital signal processing device 503.
In this embodiment, the DSP503b is a single-chip fixed-point DSP. TMS320C6416 developed by TI corporation can be adopted, a 2-level memory structure is adopted, and a first-level memory, namely a first-level internal program memory 605 and a first-level internal data memory 606 stores mutually independent program cache and data cache which are used as cache to be accessed by the CPU. The second level memory, i.e., the internal memory 602, is a unified program/data space that can be mapped as a whole to the memory space of the external memory 503b as an SRAM.
In the present invention, the DMA controller 601 may employ an EDMA controller.
In this embodiment, the external memory 503b is divided into three parts, and stores input data, processed output data, and twiddle factors used when processing data.
Fig. 9 shows an example of a specific application of the signal processing system of the present invention. I.e. a fiber optic sensor positioning system. The system is composed of a light emitting and receiving unit 702, a sensing optical fiber unit 701 and a signal processor. Wherein,
the light emitting and receiving unit 702 is used for completing signal acquisition and analog-to-digital conversion; the sensing optical fiber unit 701 is used for completing the physical positions of wiring and signal acquisition of sensing optical fibers; the signal processing device may be the signal processing device described above, and is not described herein again. The processed data can be used for spectrum analysis, target recognition or terminal display.
According to the embodiment, the design of large points is completed by adopting the single DSP, so that the cost is reduced; the single-chip DSP system has a small circuit board, a small corresponding product and strong usability; the hardware design is simplified, the design of a single DSP chip is realized, and the development period is shortened; a high-speed fixed-point processor is adopted, so that a larger number of points can be efficiently calculated; an Enhanced DMA Controller (EDMA) which is unique to TI C64 series is fully utilized to realize high-efficiency access to an off-chip memory; the algorithm is simplified, and the FFT point number is easy to expand and upgrade; due to the characteristic of the algorithm, the frequency spectrum after FFT calculation is easy to analyze, and the maximum peak value in the frequency spectrum is easy to find.
Example two
The signal processing method of the present invention will be described in detail below, taking as an example the case of performing signal processing using the signal processing system of the first embodiment.
The invention provides a signal processing method, which comprises the following steps:
the data acquisition unit 501 acquires signals and transmits the acquired signals to the data conversion unit 502 for conversion into digital data;
the digital data is input as input data to a digital signal processing apparatus 503, the apparatus 503 including an external memory 503b and a signal processor 503 a; wherein,
an external memory 503b for receiving and storing the N-point input data transmitted from the data conversion unit 502, and also storing the output data processed by the digital signal processor 503a and the twiddle factor required for processing;
the digital signal processor 503a is connected to the external memory 503b and uses a decimation base-2 according to frequencymThe FFT processes the N-point input data stored in the external memory 503b by using the twiddle factor, and transfers the processed output data to the external memory 503b for storage;
the number of stages of the N-point input data is Each stage comprising N/2mAnd each butterfly unit, wherein N is the number of points of input data, M is an integer, and M is a positive integer.
In this embodiment, the data conversion unit 502 is an analog-to-digital converter, and the N-point input data after analog-to-digital (a/D) conversion is a 12-bit complex sequence, and the data is stored with a 16-bit word length.
Before specifically describing the signal processing method of the present invention, first, fourier transform employed in the present invention will be described.
Fourier transform is an important analysis tool in signal processing and data processing, and the essence of the fourier transform is to study the time domain problem into the frequency domain problem, and the transform can greatly simplify the studied problem. Fast Fourier Transform (FFT) is a mathematical operation commonly used in the art.
Before performing the FFT operation, it is assumed that the acquired signal is a periodic signal, that is, it must have a fixed period at the time of signal generation, the acquired signal is based on the sub-period, and the real signal is assumed that an infinitely long time domain signal combination is generated from this basic period.
Next, the FFT algorithm will be explained.
Commonly used FFTs for implementing DFT are radix-2 FFT, radix-4 FFT, and higher radix. And is divided into time-wise decimation DIT and frequency-wise decimation DIF.
Generally, the higher the radix, the less the total computation.
In this embodiment, a Decimation in Frequency (Decimation) basis-4 FFT is used to process the input data, but the FFT is not limited thereto, and other FFT may be used. The decimation by frequency basis-4 FFT will now be described.
The FFT operation is very regular, and each stage of butterfly operation adopts a radix-4 algorithm to extract 4 operands each time, and 4 results are obtained through operation.
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mi>N</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
</mrow>
<mrow>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
</mrow>
<mrow>
<mi>N</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mi>k</mi>
</mrow>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mi>k</mi>
</mrow>
</msubsup>
<mo>+</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mi>k</mi>
</mfrac>
<mo>)</mo>
</mrow>
<mi>k</mi>
</mrow>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
</mrow>
</msubsup>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mrow>
<mo>(</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mi>k</mi>
</mrow>
</msubsup>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mrow>
<mo>(</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mi>k</mi>
</mrow>
</msubsup>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
</mrow>
</math>
And because of
<math>
<mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
</msubsup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<msup>
<mi>e</mi>
<mrow>
<mo>-</mo>
<mi>j</mi>
<mn>2</mn>
<mi>π</mi>
<mo>/</mo>
<mi>N</mi>
</mrow>
</msup>
<mo>)</mo>
</mrow>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<mo>-</mo>
<mi>j</mi>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
</mrow>
</math>
<math>
<mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mi>N</mi>
<mo>/</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
</msubsup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<msup>
<mi>e</mi>
<mrow>
<mo>-</mo>
<mi>j</mi>
<mn>2</mn>
<mi>π</mi>
<mo>/</mo>
<mi>N</mi>
</mrow>
</msup>
<mo>)</mo>
</mrow>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mi>N</mi>
<mo>/</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
</mrow>
</math>
<math>
<mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
</msubsup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<msup>
<mi>e</mi>
<mrow>
<mo>-</mo>
<mi>j</mi>
<mn>2</mn>
<mi>π</mi>
<mo>/</mo>
<mi>N</mi>
</mrow>
</msup>
<mo>)</mo>
</mrow>
<mrow>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<mi>j</mi>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
</mrow>
</math>
Therefore, it is not only easy to use
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<msup>
<mrow>
<mo>(</mo>
<mo>-</mo>
<mi>j</mi>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<msup>
<mrow>
<mo>(</mo>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<msup>
<mrow>
<mo>(</mo>
<mi>j</mi>
<mo>)</mo>
</mrow>
<mi>k</mi>
</msup>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>nk</mi>
</msubsup>
</mrow>
</math>
According to the derivation, let k be 4r, k be 4r +2, k be 4r +1, k be 4r +3,
<math>
<mrow>
<mi>r</mi>
<mo>=</mo>
<mn>0,1</mn>
<mo>,</mo>
<mo>·</mo>
<mo>·</mo>
<mo>·</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
</mrow>
</math>
fig. 7 shows a signal flow diagram of a radix-4 butterfly unit in the FFT.
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mi>r</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mrow>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
</mrow>
<mi>nr</mi>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mi>r</mi>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>jx</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>jx</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mi>n</mi>
</msubsup>
<msubsup>
<mi>W</mi>
<mrow>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
</mrow>
<mi>nr</mi>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mi>r</mi>
<mo>+</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mn>2</mn>
<mi>n</mi>
</mrow>
</msubsup>
<msubsup>
<mi>W</mi>
<mrow>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
</mrow>
<mi>nr</mi>
</msubsup>
</mrow>
</math>
<math>
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mi>r</mi>
<mo>+</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mi>Σ</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mrow>
<mo>[</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>jx</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mi>N</mi>
<mn>2</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>jx</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mfrac>
<mrow>
<mn>3</mn>
<mi>N</mi>
</mrow>
<mn>4</mn>
</mfrac>
<mo>)</mo>
</mrow>
<mo>]</mo>
</mrow>
<msubsup>
<mi>W</mi>
<mi>N</mi>
<mrow>
<mn>3</mn>
<mi>n</mi>
</mrow>
</msubsup>
<msubsup>
<mi>W</mi>
<mrow>
<mi>N</mi>
<mo>/</mo>
<mn>4</mn>
</mrow>
<mi>nr</mi>
</msubsup>
</mrow>
</math>
Each point is a complex pair, divided into real and imaginary parts, and the butterfly graph can be decomposed into a complex signal flow graph as shown in fig. 10. Wherein,
xa′=xa+xb+xc+xd
ya′=ya+yb+yc+yd
xb′=(xa+xb+xc-xd)Cb-(ya-yb-yc+yd)(-Sb)
xb′=(xa+yb-xc-yd)(-Sb)-(ya-xb-yc+xd)(Cb)
xc′=(xa-xb+xc-xd)Cc-(ya-yb+yc-yd)(-Sc)
yc′=(ya-yb+yc-yd)Cc-(xa-xb+xc-yd)(-Sc)
xd′=(xa-xb+xc-xd)Cd-(ya-yb+yc-yd)(-Sd)
xd′=(ya+xb-yc-xd)Cd+(xa-yb-xc+yd)(-Sd)
for example, when N is 64-point FFT, FFT streams are used which are sequentially output and inversely output, as shown in fig. 11. It can be seen that all twiddle factors are used in the first few stages of butterfly operation, and partial twiddle factors are used in the latter few stages of algorithm repetition. The very regular twiddle factor is very beneficial to the splitting of a large point number.
To facilitate the algorithm analysis, 3 concepts in FFT are introduced:
1. stage
The N-point DFT is divided into 4N/4-point DFTs, and then 16N/16-point DFTs until the N/4-point DFTs are divided. Each division is called a "stage";
the N-point DFT can be divided into And (4) stages.
2. Group of
As shown in fig. 11, each of the N/4 butterfly units in each stage may be divided into several groups, each group having the same structure and twiddle factor distribution, for example, m is 0 stage, and is divided into 16 groups; m is 1, and is divided into 4 groups; m is 2 stages and is divided into 1 group. In the case of the radix-two algorithm, the number of groups in the mth stage is N/2m+1M is 0, 1, …, M-1; if the algorithm is the radix-four algorithm, the number of groups of the mth stage is N/4m+1。
3. Code bit inversion (digit reverse)
x (n) normal input, and x (k) reverse output, which is no longer the original natural order. This is because the sequence numbers of the frequency domain are divided by parity and base 4 decimates the data.
For N64, its natural numbers are 0, 1, 2, 3, 4 … …
The final output is 0, 16, 32, 48 … …
Writing to binary can see
Inputting: x (000000), x (000001), x (000010), x (000011), x (000100)
And (3) outputting: x (000000), x (010000), x (100000), x (110000), x (000100)
I.e. if N is 64 or 26Point, x ═ a5a4a3a2a1a0If there are 6 bits, less than 0, the process of digt reverse is:
x1=a5a4a1a0a3a2
x2=a1a0a3a2a5a4until every 2 bits are completed in reverse order. N points require reverse order log4 NNext, the process is carried out.
Then, the FFT operation storage space is analyzed
In this embodiment, the high-speed memory on C6416 for the user to freely command has 1M-byte, and the data and the program share this space. In order to ensure the operation precision, the input signal after sampling A \ D conversion is a 12-bit complex sequence, and each data is stored by adopting a 16-bit word length.
For N-point complex FFT operation, when performing the FFT splitting algorithm within a slice, 2 parts of data space are required. Due to on-chip space limitations, only a portion of data N can be computed at a time1The data input and output occupy the same memory area 4 XN1Byte, in which the real and imaginary parts each occupy 2 XN1A Byte space; twiddle factor requires 4 XN1Byte, data area requires 8 XN in total1And (4) Byte. Considering that a part of the memory space is also required to be divided to store programs, and the number of points satisfies an integral multiple of 2, the maximum number of points that can be processed at one time in the DSP503a is: 216=65536。
Therefore, on C6416, the number of FFT points performed at one time is maximum: 65536. this is much larger than the number of points that can complete an FFT at one time using a general DSP chip.
Occupation of the external memory 503 b: the input data and output data each occupy 4 × N, the twiddle factor requires 4 × N, and therefore requires a total off-chip resource of 12 × N Byte.
In the invention, the key point for realizing the FFT algorithm of the N-point input data (large-point data) is as follows:
data is efficiently moved and utilized because it takes time to move data in and out each time. Therefore, in each stage of operation, the related data is carried in as much as possible, and multiple butterfly operations of the same stage can be performed. Meanwhile, the moved data can complete multi-stage butterfly operation as much as possible. Therefore, it is important to control the data carried in each time for efficient pipeline calculation.
When the method is specifically implemented on C6416, the EDMA601 efficiently moves data, a ping-pong algorithm is adopted, the data and the CPU603 CPU work in parallel, a large number of FFT can be calculated more efficiently, and the effect of running-line parallel operation is realized.
The signal processing method of the present invention will be described in detail below.
In this embodiment, the processing of the input data by the digital signal processor 503a includes the steps of:
step 1, carrying out grading treatment;
processing and storing the N-point input data step by step until the N-point input data is processed into maximum small point data; the maximum point number data is less than or equal to the maximum small point number data which can be processed by the digital signal processor;
step 2, dividing the N-point input data into N-point input data at the level of the maximum small point number data A number of subgroups, wherein n represents the number of subgroups; a represents maximum dot count data; n represents the number of points of the input data;
step 3, grouping processing is carried out;
and processing the maximum dot data of each group step by step, and storing the processed data in an external processor.
In step 1, the N point input data are processed step by step and stored, a ping-pong algorithm is adopted in each stage of processing, and the data of the 1 st to k th groups of data of the stage where the N point input data are located are processed in sequence, wherein The method comprises the following steps:
a. the signal processor 503a moves the 1 st group of data and corresponding twiddle factor in the external memory 503b to the internal memory 602 of the signal processor 503a, and the 1 st group of data is processed by the central processing unit 603 of the signal processor 503 a; at the same time, the 1 st group of data and corresponding twiddle factor in the external memory 503b are moved into the internal memory 602;
b. after the processing of the group 1 data is completed, the data is transferred back to the external memory 503 b;
c. when the group 2 data is completely moved, the central processing unit 603 processes the group 2 data;
d. after the processed 1 st group of data is completely moved back to the external memory 503b, the movement of the 3 rd group of data and the corresponding twiddle factor is started;
e. and (d) repeating the steps a to d until the kth group of data is processed.
The data of the group 1, group 2, … …, or k is transferred from the external memorymThe equal intervals are moved in sequence.
The processed 1 st, 2 nd, … … or k th group of data is moved to the original storage place of the 1 st, 2 nd, … … or k th group of data in the external memory.
And when the 1 st data, the 2 nd data, … … or the kth data are processed, dynamic overflow control is adopted.
In step 3, the maximum dot count data of each group is processed step by step and stored in the external processor 503b, including the steps of:
moving the maximum point data and the corresponding twiddle factors of each group to a central processor 603 of a signal processor 503a, and processing the maximum small point data from the stage to the M stage;
and after the M stage finishes processing, the processed data is subjected to sorting, and then the sorted data is output to an external memory for storage. Wherein, the reverse order algorithm is adopted for the order.
In the present embodiment, N is 222The present invention is illustrated by a decimation by frequency basis-4 FFT.
Input data is N-2224194304 is 4M large dot data, level M is 11, butterfly unit is 220And (4) respectively.
The twiddle factors are generated according to each level, wherein the twiddle factors can be generated and acquired by any existing method, and are sequentially stored in the off-chip memory 503b, so that the twiddle factors are convenient to call. All data, each complex point, with the real part preceding and the imaginary part following.
External memory 503 b: inputting data X2N, outputting data Y2N, twiddle factor W2N
The internal memory 602: x 2N4],w[2N4]。
1 butterfly unit needs 4 specific data to participate, and in order to enable the moved data to be calculated efficiently, each small group of data is taken:the data of each group is input into the array from the external memory 503b and sequentially accessed at 4 equally spaced places N5(N5×4=N4) And (4) point obtaining. After each group finishes one butterfly operation, the result is only moved to the outside of the chip according to the original position. When the new twiddle factor is moved next time, the previous twiddle factor group in the memory can be automatically flushed. The parameter settings are different for each stage.
From the above analysis, the maximum number of small-point FFTs implemented in the signal processor DSP503a is 21665536-64 k. In order to achieve the purpose of efficiently utilizing the CPU, a ping-pong algorithm is adopted and the requirement of a radix-4 FFT algorithm on the number of points is met.
Order N1=222,N2=220,N3=218,N4=216,N5=214Wherein in this example, take N5Is the maximum size point number.
As shown in fig. 10 and 12, the digital signal processing method includes the steps of:
step 1, carrying out grading operation;
at this time, 4-level independent butterfly operations are required to be divided into small dot operations.
The first 4-stage butterfly operation has 256 groups, and each group completes one-stage operation and then carries out the next-stage operation. The method comprises the following specific steps:
a first stage: divided into 1 large groups of N
1A plurality of dots, N
2Each butterfly unit has a different twiddle factor. 4 input data intervals N per butterfly unit
2A plurality of dots. The twiddle factors for the first set of butterfly elements are stored from external memory 503bW [0]Begin to move and get
Values are accumulated sequentially thereafter. After the butterfly operation is completed, the result is stored in the
external memory 503 b.
And a second stage: divided into 4 groups of N
2A plurality of dots, N
3Each butterfly unit in the same group has different twiddle factors. But the twiddle factors used for each large group are the same. 4 input data intervals N for butterfly units within a group
3And (4) points. Twiddle factor from external memory 503bW [6N
2]Is sequentially taken
A plurality of points. But every calculation of N
3After a number of butterfly units, and after a group has been computed, it is again necessary to access the external memory 503bW [6N ]
2]And sequentially taking twiddle factors.
And a third stage: divided into 16 large groups of N
3A plurality of dots, N
4Each butterfly unit in the same group has different twiddle factors. But the twiddle factors used for each large group are the same. 4 input data intervals N for butterfly units within a group
4And (4) points. The twiddle factors are stored from external memory 503bW [6 (N)
2+N
3)]Is sequentially taken
A plurality of points. But every calculation of N
4After each butterfly unit, i.e. after a group of butterfly units is calculated, it is necessary to start from W [6 (N) again
2+N
3)]And sequentially taking twiddle factors.
Fourth stage: divided into 64 large groups of N
4A plurality of dots, N
5Each butterfly unit in the same group has different twiddle factors. But the twiddle factors used for each large group are the same. 4 input data intervals N for butterfly units within a group
5And (4) points. Twiddle factor from W6 (N)
2+N
3+N
4)]Is sequentially taken
A plurality of points. But every calculation of N
5After each butterfly unit, i.e. after a group of butterfly units is calculated, it is necessary to start from W [6 (N) again
2+N
3+N
4)]And sequentially taking twiddle factors.
As can be seen from the above, the number of stages required in this section is different for different points N. The larger the number of points is, the more the number of stages of the required early-stage splitting is, and the more the moving times is.
The application of the ping-pong algorithm to the hierarchical operation is illustrated below:
step a. assume that the first set of data and twiddle factor has been stored in internal memory 602 from x1[0]To x1[2N5]And from w1[0 ]]To w1[2N5]In the unit, an interrupt request is submitted to the CPU603, and the CPU603 is requested to process the data of the unit;
step b. Simultaneous EDMA601 gets the second set of data and twiddle factors from the input array X [2N ] of external memory 503b]From x2[0 ] into the internal memory 602]To x2[2N5]And from w2[0 ]]To w2[2N5]In the unit;
step c. when stored in internal memory 602 from x1[0]To x1[2N5]And from w1[0 ]]To w1[2N5]The data and twiddle factors in the cell have been processed by CPU603, at which point an interrupt request is submitted, stored by EDMA601 in internal memory 602 from x1[0 ]]To x1[2N5]The data of the cell is transferred to the input array X [2N ] of the external memory 503b]The original storage position of (a);
step d. when the second set of data and twiddle factor have been moved to internal memory 602 from x2[0 ]]To x2[2N5]And from w2[0 ]]To w2[2N5]In the unit, an interrupt request is submitted to request the CPU603 to process the data;
step e. when step c is completed, i.e. EDMA601 completes storing data from x1[0 ] in internal memory 602]To x1[2N5]The data of the cell is transferred to the input array X [2N ] of the external memory 503b]Starting the transfer of the third set of data and twiddle factors from the original storage location of (1), and inputting the array X [2N ] from the external memory 503b]From x1[0 ] into the internal memory 602]To x1[2N5]And from w1[0 ]]To w1[2N5]In the unit;
and f, repeating the steps a to e until all the group data of the current stage complete the data processing.
In step 2, the N-point complex numbers are divided into 256(256 ═ N/a) subgroups.
Step 3, performing grouping operation;
each of the 256 groups is processed from 5 to 11 stages, the processed output data is sorted when the 11 th stage is completed, and then the sorted data is output to the external memory 503b for storage. Wherein, the reverse order algorithm is adopted for the order.
At this time, the N point data is divided into the maximum size point N which can be calculated in the slice5。
At this time, the number of N can be divided into 2565And (6) point FFT.
The entire data is loaded into the signal processor 503a, and after the calculation from the 5 th stage to the 11 th stage is completed for each sub-group, the data is unloaded. At this time, the twiddle factors of each group are the same, and after the first carry-in, the carry-in is not required to be repeated.
Input of X [0 ] from external memory 503b by EDMA]At the beginning, move 2N in sequence5Data of points to internal memory 602 x1[2N5],x2[2N5]Level 5 requires shifting the twiddle factor from W [6 (N)2+N3+N4+N5)]Moving 2N5Point to w1[2N5]. Since the twiddle factors used by each group are the same at this time, after the twiddle factors required by each stage after the first move-in, the move-in is not repeated.
The last stage of each group needs to store the result from the on-chip L2 Cache to the input Y [2N ] of the off-chip SDRAM in the original bit order.
Output data Y2N]Divided into 4 equal parts Y0,Y1,Y2,Y3. Each group being at the last stageCalculating the output position a corresponding to the first input of each butterfly according to a bit sequence algorithm (code bit inversion), and sequentially storing 4 outputs to Y0[a],Y[N/2+a]1,Y2[N+a],Y3[3N/4+a]。
In the above embodiment, an "overflow" phenomenon may occur when performing the fixed-point FFT operation. In this embodiment, a dynamic compression mode in fixed-point overflow control is adopted.
In the FFT, when the entire sequence is calculated at one stage, the modulus of the value in the sequence is generally increased, so that fixed point overflow control is necessary. Generally there are 3 types of methods for overflow control.
1. Input control: before FFT calculation, the input data is reduced according to a certain proportion, and the intermediate and final results are ensured not to overflow. This method is very fast, but the accuracy is very low because a relatively large right shift is required at the beginning, especially when the number of points is relatively large. Even for large points, this cannot be achieved at all by this method. The input control is more suitable for FFT with a small number of points.
2. Static compression: after each stage of FFT computation, the output is fixed and shifted right by N bits so that no overflow occurs in the next stage of computation. The method is simple and convenient to implement, each stage is fixed, but the precision loss is large. The fixed right shift N bits cannot be small because it is considered that there is no overflow in the whole flow. The suitable point number is not very small or very large, and the point number for realizing FFT in the chip is generally enough.
3. Dynamic compression: after each stage of the FFT, the output is analyzed for whether a right shift is required, and how many bits to right shift. Since each time it is decided from the output whether a right shift is required and a shift to the right by the minimum number of bits guarantees no overflow. This dynamic adjustment allows the FFT to maintain relatively high accuracy. This algorithm is suitable for large points.
In specific implementation, the input can be set to be in a Q.12 format, namely 3 protection bits are provided except for a sign bit, so that overflow of the next stage of butterfly operation is avoided. Q.12 also accords with signals collected by the current universal A/D chip and has rationality.
After each stage, it is checked whether the bi t increase makes the data exceed 12 bits, and the maximum bit number Maxbit of this time is recorded, and then all the outputs are right-shifted by (Maxbit-12) bits. At the last stage, no further inspection is required. Of course in a system the compression control needs to be checked if there is also an associated FFT or IFFT.
Further, the division into different groups (into 256 groups in the above example) in step 3 completes the FFT calculation of the number of small dots of the last part. Due to the different data input size of each section, it is not possible to guarantee the same number of reduction bits per subgroup at each stage. Therefore, the total right shift number of each small group is counted and accumulated, and after the final sorting, the data of each small group is adjusted according to the right shift number of the phase difference.
As shown in fig. 5, after the input data is processed by the above method, the processed output data may be displayed on the terminal 506, recognized by the target recognition unit 505, and subjected to spectrum analysis by the spectrum analysis unit 504.
In spectral analysis, many times, not every spectral component is required. For special cases, only the largest spectral component needs to be found, so the algorithm can be simplified. Searching the largest frequency spectrum component in each group, finally adjusting the maximum value of each group to the same right shift number, and searching the maximum frequency of the whole large-point FFT.
In the prior art, if a parallel processor is adopted for realizing FFT with large data volume, the cost is high, the design period is long, the parallel algorithm is complex, the processing board card is too large, and the method is not suitable for occasions with special requirements on small design scale. If a floating-point DSP is adopted, the number of data points is limited, the operation speed of a floating-point processor is not high enough, and the data splitting is complex due to a winogrd algorithm.
According to the invention, a single DSP is adopted to complete the design of large points, so that the cost is reduced; the single-chip DSP system has a small circuit board, a small corresponding product and strong usability; the hardware design is simplified, the design of a single DSP chip is realized, and the development period is shortened; a fixed-point processor is adopted, so that a larger number of points can be efficiently calculated; an extended direct memory access controller (EDMA: Enhanced DMAcontroller) specific to the TI C64 series is fully utilized to realize high-efficiency access to an off-chip memory; the algorithm is simplified, and the FFT point number is easy to expand and upgrade; due to the characteristic of the algorithm, the frequency spectrum after FFT calculation is easy to analyze, and the maximum peak value in the frequency spectrum is easy to find.
The above examples are intended to illustrate the invention, but not to limit the invention.