How many bootstraps does rfsrc in the R package randomforestsrc peform? - random-forest

Does anyone know how many bootstraps or subsamples a standard call to rfsrc performs?
rf1<-rfsrc(Surv(time,status)~., data=myTable)
Also rf1$err.rate which is described as "cumulative OOB error rate" for me at the above settings is a vector of length 1000, with 999 element being NAand only the last element is an error rate (between 0 and 0.5). Is that the expected behaviour? Is this last value the average error of all bootstraps?
Update: I have found a setting block.size, which regulates how many OOB error rates out of the 1000 are returned. If you set it to e.g. 10, every thenth OOB error rate is filled. Hower what I am still not sure about is on how many bootstraps each of these error rates is calculated. Is each simply one error rate from a single bootstrap or subsample or is it somehow averaged?

Per the documentation:
sampsize Function specifying size of bootstrap data when by.root is in
effect. For sampling without replacement, it is the requested size of
the sample, which by default is .632 times the sample size. For
sampling with replacement, it is the sample size. Can also be specified
by using a number.
So by default, sampling is done without replacement and 63.2 % of observations are sampled randomly for each tree in the forest.

Related

change in value of one frequency bin affect FFT and IFFT values of non changing bins

I have a 3001x577 matrix. I want to apply a operation to the first 120 samples. I have applied to the first 120 samples which accounts to 20 Hz of frequency. The sampling rate is 2 msec. So I have Fnyq =250hz. Now I have taken out the first 120 samples. I noticed that after applying the filter and replacing it with the older 120 samples, the values of bins greater than 120 has changed after I applied an IFFT . And this is evident on my final result. I got the desired filter result but it ends up changing values of samples which i want untouched.
Can someone explain why change in value of few frequency bins affect the ifft or fft of non changing bins. I am using matlab. And how can i prevent it?
You took part of the spectrum (the first 120 samples), changed this part somehow and transformed the outcome back into the time domain by using an IFFT. It is to be expected that the signal has changed beyond the 120 samples since you manipulated frequency components which will alter all samples in the time domain. Think of it this way: You changed the amplitude (and phase) of 120 sinuses and then expect that the outcome to be limited to a certain time extent. Maybe you can post a new question where you describe what you actually want to achieve instead of the experiment you perform to get the job done.

AudioKit FFT conversion to dB?

First time posting, thanks for the great community!
I am using AudioKit and trying to add frequency weighting filters to the microphone input and so I am trying to understand the values that are coming out of the AudioKit AKFFTTap.
Currently I am trying to just print the FFT buffer converted into dB values
for i in 0..<self.bufferSize {
let db = 20 * log10((self.fft?.fftData[Int(i)])!)
print(db)
}
I was expecting values ranging in the range of about -128 to 0, but I am getting strange values of nearly -200dB and when I blow on the microphone to peg out the readings it only reaches about -60. Am I not approaching this correctly? I was assuming that the values being output from the EZAudioFFT engine would be plain amplitude values and that the normal dB conversion math would work. Anyone have any ideas?
Thanks in advance for any discussion about this issue!
You need to add all of the values from self.fft?.fftData (consider changing negative values to positive before adding) and then change that to decibels
The values in the array correspond to the values of the bins in the FFT. Having a single bin contain a magnitude value close to 1 would mean that a great amount of energy is in that narrow frequency band e.g. a very loud sinusoid (a signal with a single frequency).
Normal sounds, such as the one caused by you blowing on the mic, spread their energy across the entire spectrum, that is, in many bins instead of just one. For this reason, usually the magnitudes get lower as the FFT size increases.
Magnitude of -40dB on a single bin is quite loud. If you try to play a tone, you should see a clear peak in one of the bins.

How does Caffe determine test set accuracy?

Using the BVLC reference AlexNet file, I have been training a CNN against a training set I created.  In order to measure the progress of training, I have been using a rough method to approximate the accuracy against the training data.  My batch size on the test net is 256.  I have ~4500 images.  I perform 17 calls to solver.test_nets[0].forward() and record the value of solver.test_nets[0].blobs['accuracy'].data (the accuracy of that forward pass).  I take the average across these.  My thought was that I was taking 17 random samples of 256 from my validation set and getting the accuracy of these random samplings.  I would expect this to closely approximate the true accuracy against the entire set.  However, I later went back and wrote a script to go through each item in my LMDB so that I could generate a confusion matrix for my entire test set.  I discovered that the true accuracy of my model was significantly lower than the estimated accuracy.  For example, my expected accuracy of ~75% dropped to ~50% true accuracy.  This is a far worse result than I was expecting.
My assumptions match the answer given here.
Have I made an incorrect assumption somewhere?  What could account for the difference?  I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case.  blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
I had assumed that forward() function gathered a random sample, but I'm not so sure that was the case. blobs.['accuracy'].data returned a different result (though usually within a small range) everytime, so this is why I assumed this.
The forward() function from Caffe does not perform any random sampling, it will only fetch the next batch according to your DataLayer. E.g., in your case forward() will pass the next 256 images in your network. Performing this 17 times will pass sequentially 17x256=4352 images.
Have I made an incorrect assumption somewhere? What could account for the difference?
Check that the script that goes through your whole LMDB performs the same data pre-processing as during training.

how to understand "interchange of filtering with compressor/expander"

In Sec. 4.7 of the classical textbook "Discrete-Time Signal Processing (3rd)", the efficient implementation of multi-rate processing is well discussed. The first method deal with the "interchange of filtering with compressor/expander", and the following figure shows the interchange in downsampling.
Since downsampling can cause aliasing, the pre-filtering is necessary. In the figure, we can notice that H(z) in (a) and H(z^M) in (b); however, if aliasing has occurred after downsampling in (a), can H(z) eliminates the aliasing? Thank you!
Yes, as long as the original filter was of the form H(z^M), meaning that only every Mth coefficient of the filter is non-zero.
The reason this is possible comes down to the fact that only each Mth input sample actually factors into the output sequence in this configuration. It is a special case since input samples at non multiples of M are always cancelled out by either the filter zero coefficients or the decimator. It is unnecessary to even consider input samples at indexes other than multiples of M.
This means you can decimate the input first and then apply the filter with its zero coefficients removed.

Can FFT length affect filtering accuracy?

I am designing a fractional delay filter, and my lagrange coefficient of order 5 h(n) have 6 taps in time domain. I have tested to convolute the h(n) with x(n) which is 5000 sampled signal using matlab, and the result seems ok. When I tried to use FFT and IFFT method, the output is totally wrong. Actually my FFT is computed with 8192 data in frequency domain, which is the nearest power of 2 for 5000 signal sample. For the IFFT portion, I convert back the 8192 frequency domain data back to 5000 length data in time domain. So, the problem is, why this thing works in convolution, but not in FFT multiplication. Does converting my 6 taps h(n) to 8192 taps in frequency domain causes this problem?
Actually I have tried using overlap-save method, which perform the FFT and multiplication with smaller chunks of x(n) and doing it 5 times separately. The result seems slight better than the previous, and at least I can see the waveform pattern, but still slightly distorted. So, any idea where goes wrong, and what is the solution. Thank you.
The reason I am implementing the circular convolution in frequency domain instead of time domain is, I am try to merge the Lagrange filter with other low pass filter in frequency domain, so that the implementation can be more efficient. Of course I do believe implement filtering in frequency domain will be much faster than convolution in time domain. The LP filter has 120 taps in time domain. Due to the memory constraints, the raw data including the padding will be limited to 1024 in length, and so with the fft bins.
Because my Lagrange coefficient has only 6 taps, which is huge different with 1024 taps. I doubt that the fft of the 6 taps to 1024 bins in frequency domain will cause error. Here is my matlab code on Lagrange filter only. This is just a test code only, not implementation code. It's a bit messy, sorry about that. Really appreciate if you can give me more advice on this problem. Thank you.
t=1:5000;
fs=2.5*(10^12);
A=70000;
x=A*sin(2*pi*10.*t.*(10^6).*t./fs);
delay=0.4;
N=5;
n = 0:N;
h = ones(1,N+1);
for k = 0:N
index = find(n ~= k);
h(index) = h(index) * (delay-k)./ (n(index)-k);
end
pad=zeros(1,length(h)-1);
out=[];
H=fft(hh,1024);
H=fft([h zeros(1,1024-length(h))]);
for i=0:1:ceil(length(x)/(1024-length(h)+1))-1
if (i ~= ceil(length(x)/(1024-length(h)+1))-1)
a=x(1,i*(1024-length(h)+1)+1:(i+1)*(1024-length(h)+1));
else
temp=x(1,i*(1024-length(h)+1)+1:length(x));
a=[temp zeros(1,1024-length(h)+1-length(temp))];
end
xx=[pad a];
X=fft(xx,1024);
Y=H.*X;
y=abs(ifft(Y,1024));
out=[out y(1,length(h):length(y))];
pad=y(1,length(a)+1:length(y));
end
Some comments:
The nearest power of two is actually 4096. Do you expect the remaining 904 samples to contribute much? I would guess that they are significant only if you are looking for relatively low-frequency features.
How did you pad your signal out to 8192 samples? Padding your sample out to 8192 implies that approximately 40% of your data is "fictional". If you used zeros to lengthen your dataset, you likely injected a step change at the pad point - which implies a lot of high-frequency content.
A short code snippet demonstrating your methods couldn't hurt.

Resources