I'm designing high speed FIR filter ON FPGA .Currently My sampling rate is 3600MSPS. But the clock supported by device is 350MHZ.Please suggest how to go with multiple instantiation
or parallel implementation of FIR Filter so that it meets the design requirement.
Also suggest how to pass samples to the parallel implementation
It's difficult to answer your question based on the information you have given.
The first question I would ask myself is: can you reduce the sample rate at all? 3600 MSPS is very high. The sample rate only needs to be this high if you are truely supporting data requiring that bandwidth.
Assuming you do really need that rate, then in order to implement an FIR filter running at such a high sample rate, you will need to parallelise the architecture as you suggested. It's generally very easy to implement such a structure. An example approach is shown here:
http://en.wikipedia.org/wiki/Parallel_Processing_%28DSP_implementation%29#Parallel_FIR_Filters
Each clock cycle you will pass a parallel word into each filter section, and extract a word from the combined filter output.
But only you know the requirements and constraints of your FPGA design; you will have to craft the FIR filter according to your requirements.
Related
I applied a discrete wavelet transform to horizontal wind speed data to receive the below plot. I'm basically trying to use the information from the detail coefficient (the turbulent flow) for further analysis, but I'm not sure the best direction to go in. I don't have much experience with Wavelet Transform, so forgive me if there are obvious options, but the examples I've seen usually discard the higher frequency information since it's the noise of the signal. Is there anything further I can do with this discrete wavelet transform like statistic analysis or forecasting?
The path to pursue really depends on the question that you are trying to answer.
First of all, I would suggest double checking that your DWT is actually doing what you expect it to do. The plot that you shared suggests that it is successful in separating the low frequency coherent (laminar?) flow from the high frequency turbulent flow, but it would be helpful to figure out which frequencies are present in the high frequency component in order to confirm that the processing parameters (e.g. decomposition level) were properly chosen.
Once convinced that your wavelet decomposition provides you with useful information about the turbulent flow, what should you do with these high pass filtered data?
I suggest computing their variance over 1 hour long intervals. This is a measure of the "energy" of the signal over the chosen interval. If you are dealing with large amounts of data this would allow you to boil down your time series into a single sample per hour. Maybe you will be able to spot diurnal variations in the turbulent flow (e.g. maybe turbulent flow is higher at dawn). If you have multiple stations it would be interesting to study if the turbulence variations share the same behavior.
Before venturing into time series forecasting, I would really take a closer look at you data and try to identify trends or nail down possible outliers.
Last but not least, I would suggest posting your question on Physics Stack Exchange (e.g. https://physics.stackexchange.com/) rather than on SO.
Background: I am trying to find a list of floating point parameters for a low level controller that will lead to balance of a robot while it is walking.
Question: Can anybody recommend me any local search algorithms that will perform well for the domain I just described? The main criteria for me is the speed of convergence to the right solution.
Any help will be greatly appreciated!
P.S. Also, I conducted some research and found out that "Evolutianry
Strategy" algorithms are a good fit for continuous state space. However, I am not entirely sure, if they will fit well my particular problem.
More info: I am trying to optimize 8 parameters (although it is possible for me to reduce the number of parameters to 4). I do have a simulator and a criteria for me is speed in number of trials because simulation resets are costly (take 10-15 seconds on average).
One of the best local search algorithms for low number of dimensions (up to about 10 or so) is the Nelder-Mead simplex method. By the way, it is used as the default optimizer in MATLAB's fminsearch function. I personally used this method for finding parameters of some textbook 2nd or 3rd degree dynamic system (though very simple one).
Other option are the already mentioned evolutionary strategies. Currently the best one is the Covariance Matrix Adaption ES, or CMA-ES. There are variations to this algorithm, e.g. BI-POP CMA-ES etc. that are probably better than the vanilla version.
You just have to try what works best for you.
In addition to evolutionary algorithm, I recommend you also check reinforcement learning.
The right method depends a lot on the details of your problem. How many parameters? Do you have a simulator? Do you work in simulation only, or also with real hardware? Speed is in number of trials, or CPU time?
When designing an embedded system, how can I tell in general when the floating point processing required will be too much for a standard microcontroller?
In case anyone is curious, the system I am designing is a Kalman filter and some motor control. However, I am looking for an engineering methodology for the general case.
The general case on finding out whether the given processor can solve your problem, is to estimate the number of floating-point operations that have to be run per second, and then comparing it to what the processor can do.
This ideal case will be affected by memory-access times, I/O-interrupts, etc. In practise, you'll have to run it (although you don't want to hear that).
For the Kalman filter case:
1. Know the sample rate, the size of the state variable and the measurement-variable.
2. The complexity of the Kalman filter is dominated by the matrix inversion and multiple matrix multiplications. (O(d^3), d: size of state variable, or the Information Filter (inverse problem): O(z^3), z: size of measurement-vector) On-line or in books you'll find in-detail analysis of the operations required for Kalman Filters.
3. Find out what actual operations are run in the algorithms and add the number of operations required for each part of the algorithm.
The analysis is essentially the same for a general microcontroller or a DSP, except that some things come for free on the DSP.
I am interested in making a simple digital synthesizer to be implemented on an 8bit MCU. I would like to make wavetables for accurate representations of the sound. Standard wavetables seem to either have a table for several frequencies or to have a single sample that has fractional increments with missing data interpolated by the program to create different frequencies.
Would it be possible to create a single table for a given waveform, likely of a low frequency and change the rate at which the program polls the table to generate different frequencies which would then be processed. My MCU (free one, no budget) is rather slow so I don't have the space for lots of wavetables nor for large amounts of processing so I am trying to skimp where I can. Has anyone seen this implementation?
You should consider using a single table with a phase accumulator and linear interpolation. See this question on DSP.SE for many useful suggestions.
I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.