how to perform the output binarization of a torch model - lua

I have to binarize the output o of a torch model (lua script), the value range is [-1,+1], i want to threshold those values in such a way that:
0 if o[i]<0
1 if o[i]>=0
The output is composed by 32 layers with size 1x1 float tensors, so 32 floats, i want to get 32 bits from those 32 floats but i cannot find a layer that allows to do that.
At the moment I have a for cycle that checks the value of each level but it is very slow.
Maybe I can use the threshold layer or implement one by my own, do you have any advice?

You can use the 'greater or equal than' operator https://github.com/torch/torch7/blob/master/doc/maths.md#torchgea-b
local threshold_tensor = o:ge(0)

Related

How to calculate 512 point FFT using 2048 point FFT hardware module

I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?
There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)
An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.

How are matrices stored in memory?

Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)

Understanding FFT in aurioTouch2

I've been looking at aurioTouch 2 from Apple' sample code (found here). At the end of the day I want to analyze the frequencies myself. For now I'm trying to understand some of what's going on here. My apologies if this is trivial, just trying to understand some of the uncommented magic numbers floating around in some of the source. My main points of confusion right now are:
Why do they zero out the nyquist value in FFTBufferManager::ComputeFFT? Can this value really just be thrown away? (~line 112 of FFTBufferManager.cpp).
They scale everything down by -128db, so I'm assuming that the results are thus in the range of (-128, 0). However, later in aurioTouchAppDelegate.mm (~line 807), They convert this to a value between 0 and 1 by adding 80 and dividing by 64, then clamping to 0 and 1. Why the fuzziness? Also, am I right in assuming values will be in the vicinity of (-128, 0)?
Well, it's not trivial for me either but this is how i understand it. If i've over simplified it is purely for my benefit, i don't mean to be patronising.
Zeroing the result corresponding to the Nyquist frequency:
I'm going to suppose we are computing the forward FFT of 1024 input samples. At 44100hz input this is usually true in my case (but isn't what AurioTouch is doing, which i find a bit weird, but i'm no expert). It's easier for me to understand with specific values.
Given 1024 (n) input samples, arranged as needed (even indexes' first then odd indexes' { in[0], in[2], in[4], …, in1, in[3], in[5], … }) (use vDSP_ctoz() to order your input)
The output of FFT 1024 (n) input samples is 513 ((n/2)+1) complex values. ie 513 real components and 513 imaginary components, a total of 1026 values.
However, imaginary[0] and imaginary[512] (n/2) are always, necessarily, zero. So by placing real[512] (the real component of the Nyquist frequency bin) at imaginary[0] and forgetting imaginary[512] - which is always zero and can be inferred, the results are packed into an 1024 (n) length buffer.
So, for the returned results to be valid you must at least set imaginary[0] back to zero. If you require all 513 ((n/2)+1) frequency bins you need to append another complex value to the result and set it thus..
unpackedVal = imaginary[0]
real[512]=unpackedVal, imaginary[512]=0
imaginary[0] = 0
In AurioTouch i always supposed they just don't bother. n/2 results is obviously more convenient to work with and you can hardly tell from the visualizer:- "Oh look, it's missing one magnitude at the Nyquist frequency"
The UsingFourierTransforms docs explain the packing
NB the specific values 1024, 513, 512, etc. are examples not the actual values of n, (n/2)+1, n/2 from AurioTouch.
They scale everything down by -128db
Not quite, the range of the output values is relative to the number of input samples so it has to be normalised. The scale is 1.0/(2*inNumberFrames).
After scaling the range is -1.0 –> +1.0. The magnitude of the complex vector is then taken (the phase is ignored) giving a Scalar value for each frequency bin between 0 and 1.0
This value is then interpreted as a decibel value between -128 and 0
The drawing stuff… +80 / 64. …*120… …i'm not sure. I may be completely wrong or it may be …artistic license?

How to normalize OpenCV feature descriptors to an integer scale?

OpenCV SURF implementation returns a sequence of 64/128 32 bit float values (descriptor) for each feature point found in the image. Is there a way to normalize this float values and take them to an integer scale (for example, [0, 255])?. That would save important space (1 or 2 bytes per value, instead of 4). Besides, the conversion should ensure that the descriptors remain meaningful for other uses, such as clustering.
Thanks!
There are other feature extractors than SURF. The BRIEF extractor uses only 32 bytes per descriptor. It uses 32 unsigned bytes [0-255] as its elements. You can create one like this: Ptr ptrExtractor = DescriptorExtractor::create("BRIEF");
Be aware that a lot of image processing routines in OpenCV need or assume that the data is stored as floating-point numbers.
You can treat the float features as an ordinary image (Mat or cvmat) and then use cv::normalize(). Another option is using cv::norm() to find the range of descriptor values and then cv::convertTo() to convert to CV_8U. Look up the OpenCV documentation for these functions.
The descriptor returned by cv::SurfFeatureDetector is already normalized. You can verify this by taking the L2 Norm of the cv::Mat returned, or refer to the paper.

Kohonen SOM Maps: Normalizing the input with unknown range

According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.

Resources