Update 3/5/14: RACSequence is deprecated in ReactiveCocoa 3.0 (see comments), but I'm still curious about the best way to process data in batches, either using RACSequence or RACSignal.
I'm trying to process a potentially large array of values using ReactiveCocoa. My idea is to split the input sequence into smaller sequences up to a maximum size, and then process them separately (potentially in parallel).
I've written a category method on RACSequence to do the chunking:
// Returns a sequence of sequences of at most `size` objects
- (RACSequence *)chunk:(NSInteger)size
{
if ([self head] == nil) {
return [RACSequence empty];
}
RACSequence *chunk = [self take:size];
RACSequence *rest = [self skip:size];
return [RACSequence sequenceWithHeadBlock:^id {
return chunk;
} tailBlock:^RACSequence *{
return [rest chunk:size];
}];
}
This works, but it's extremely slow for large sequences. I wrote an iterative version and a utility to compare the two approaches. Here's the difference for a sequence of 10000 numbers broken into smaller sequences of up to 30 objects:
$ time ./main i 10000 30 # Iterative
2014-03-04 10:48:28.845 main[59637:507] Number of chunks: 334
real 0m0.012s
user 0m0.007s
sys 0m0.004s
$ time ./main r 10000 30 # Recursive
2014-03-04 10:48:45.423 main[59645:507] Number of chunks: 334
real 0m15.513s
user 0m15.133s
sys 0m0.378s
Is there a better way to process a RACSequence in smaller batches?
Related
The following code declares two arrays, and then iterates over stdin ( just blindly iterates over the file - no interaction with the arrays ).
This is causing continuous increase in memory.
However, if I just declare two arrays and sleep - there is no increase in memory.
Similarly, if I just iterate over stdin - there is no increase in memory.
But together ( apart from the memory allocated for the arrays) there is a continuous increase.
I measure this by looking at the RES memory using top tool.
I have commented out the first few lines in func doSomething() to show that there is no memory increase when it is commented. Uncommenting the lines and running will cause an increase.
NOTE: This was run on go 1.4.2, 1.5.3 and 1.6
NOTE: You will need to recreate this on a machine with at least 16GB RAM as I have observed it only on the array size of 1 billion.
package main
import (
"bufio"
"fmt"
"io"
"os"
)
type MyStruct struct {
arr1 []int
arr2 []int
}
func (ms *MyStruct) Init(size int, arr1 []int, arr2 []int) error {
fmt.Printf("initializing mystruct arr1...\n")
ms.arr1 = arr1
if ms.arr1 == nil {
ms.arr1 = make([]int, size, size)
}
fmt.Printf("initializing mystruct arr2...\n")
ms.arr2 = arr2
if ms.arr2 == nil {
ms.arr2 = make([]int, size, size)
}
fmt.Printf("done initializing ...\n")
for i := 0; i < size; i++ {
ms.arr1[i] = 0
ms.arr2[i] = 0
}
return nil
}
func doSomething() error {
fmt.Printf("starting...\n")
fmt.Printf("allocating\n")
/* NOTE WHEN UNCOMMENTED CAUSES MEMORY INCREASE
ms := &MyStruct{}
size := 1000000000
ms.Init(size, nil, nil)
*/
fmt.Printf("finished allocating..%d %d\n", len(ms.arr1), len(ms.arr2))
fmt.Printf("reading from stdin...\n")
reader := bufio.NewReader(os.Stdin)
var line string
var readErr error
var lineNo int = 0
for {
if lineNo%1000000 == 0 {
fmt.Printf("read %d lines...\n", lineNo)
}
lineNo++
line, readErr = reader.ReadString('\n')
if readErr != nil {
fmt.Printf("break at %s\n", line)
break
}
}
if readErr == io.EOF {
readErr = nil
}
if readErr != nil {
return readErr
}
return nil
}
func main() {
if err := doSomething(); err != nil {
panic(err)
}
fmt.Printf("done...\n")
}
Is this an issue with my code ? Or is the go system doing something unintended ?
If its the latter, how can I go about debugging this ?
To make it easier to replicate here are pastebin files for good case ( commented portion of the above code) and bad case ( with uncommented portion )
wget http://pastebin.com/raw/QfG22xXk -O badcase.go
yes "1234567890" | go run badcase.go
wget http://pastebin.com/raw/G9xS2fKy -O goodcase.go
yes "1234567890" | go run goodcase.go
Thank you Volker for your above comments. I wanted to capture the process of debugging this as an answer.
The RES top / htop just tells you at a process level what is going on with memory. GODEBUG="gctrace=1" gives you more insight into how the memory is being handled.
A simple run with gctrace set gives the following
root#localhost ~ # yes "12345678901234567890123456789012" | GODEBUG="gctrace=1" go run badcase.go
starting...
allocating
initializing mystruct arr1...
initializing mystruct arr2...
gc 1 #0.050s 0%: 0.19+0.23+0.068 ms clock, 0.58+0.016/0.16/0.25+0.20 ms cpu, 7629->7629->7629 MB, 7630 MB goal, 8 P
done initializing ...
gc 2 #0.100s 0%: 0.070+2515+0.23 ms clock, 0.49+0.025/0.096/0.24+1.6finished allocating..1000000000 1000000000
ms cpu, 15258->15258reading from stdin...
->15258 MB, 15259read 0 lines...
MB goal, 8 P
gc 3 #2.620s 0%: 0.009+0.32+0.23 ms clock, 0.072+0/0.20/0.11+1.8 ms cpu, 15259->15259->15258 MB, 30517 MB goal, 8 P
read 1000000 lines...
read 2000000 lines...
read 3000000 lines...
read 4000000 lines...
....
read 51000000 lines...
read 52000000 lines...
read 53000000 lines...
read 54000000 lines...
What does this mean ?
As you can see, the gc hasn't been called for a while now. This means that all the garbage generated from reader.ReadString hasn't been collected and free'd.
Why isn't the garbage collector collecting this garbage ?
From The go gc
Instead we provide a single knob, called GOGC. This value controls
the total size of the heap relative to the size of reachable objects.
The default value of 100 means that total heap size is now 100% bigger
than (i.e., twice) the size of the reachable objects after the last
collection.
Since GOGC wasn't set - the default was 100%. So, it would have collected the garbage only when it reached ~32GB. ( Since initially the two arrays give you 16GB of heap space - only when heap doubles will the gc trigger ).
How can I change this ?
Try setting the GOGC=25.
With the GOGC as 25
root#localhost ~ # yes "12345678901234567890123456789012" | GODEBUG="gctrace=1" GOGC=25 go run badcase.go
starting...
allocating
initializing mystruct arr1...
initializing mystruct arr2...
gc 1 #0.051s 0%: 0.14+0.30+0.11 ms clock, 0.42+0.016/0.31/0.094+0.35 ms cpu, 7629->7629->7629 MB, 7630 MB goal, 8 P
done initializing ...
finished allocating..1000000000 1000000000
gc 2 #0.102s reading from stdin...
12%: 0.058+2480+0.26 ms clock, 0.40+0.022/2480/0.10+1.8 ms cpu, 15258->15258->15258 MB, 15259 MB goal, 8 P
read 0 lines...
gc 3 #2.584s 12%: 0.009+0.20+0.22 ms clock, 0.075+0/0.24/0.046+1.8 ms cpu, 15259->15259->15258 MB, 19073 MB goal, 8 P
read 1000000 lines...
read 2000000 lines...
read 3000000 lines...
read 4000000 lines...
....
read 19000000 lines...
read 20000000 lines...
gc 4 #6.539s 4%: 0.019+2.3+0.23 ms clock, 0.15+0/2.1/12+1.8 ms cpu, 17166->17166->15258 MB, 19073 MB goal, 8 P
As you can see, another gc was triggered.
But top/htop shows it stable at ~20 GB instead of the calculated 16GB.
The garbage collector doesn't "have" to give it back to the OS. It will sometimes keep it to use efficiently for the future. It doesn't have to keep taking from the OS and giving back - The extra 4 gb is in its pool of free space to use before asking the OS again.
I have implemented fft into at32ucb series ucontroller using kiss fft library and currently struggling with the output of the fft.
My intention is to analyse sound coming from piezo speaker.
Currently, the frequency of the sounder is 420Hz which I successfully got from the fft output (cross checked with an oscilloscope). However, the output frequency is just half of expected if I put function generator waveform into the system.
I suspect its the frequency bin calculation formula which I got wrong; currently using, fft_peak_magnitude_index*sampling frequency / fft_size.
My input is real and doing real fft. (output samples = N/2)
And also doing iir filtering and windowing before fft.
Any suggestion would be a great help!
// IIR filter calculation, n = 256 fft points
for (ctr=0; ctr<n; ctr++)
{
// filter calculation
y[ctr] = num_coef[0]*x[ctr];
y[ctr] += (num_coef[1]*x[ctr-1]) - (den_coef[1]*y[ctr-1]);
y[ctr] += (num_coef[2]*x[ctr-2]) - (den_coef[2]*y[ctr-2]);
y1[ctr] = y[ctr] - 510; //eliminate dc offset
// hamming window
hamming[ctr] = (0.54-((0.46) * cos(2*M_PI*ctr/n)));
window[ctr] = hamming[ctr]*y1[ctr];
fft_input[ctr].r = window[ctr];
fft_input[ctr].i = 0;
fft_output[ctr].r = 0;
fft_output[ctr].i = 0;
}
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
peak = 0;
freq_bin = 0;
for (ctr=0; ctr<n1; ctr++)
{
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
{
peak = fft_mag[ctr];
freq_bin = ctr;
}
frequency = (freq_bin*(10989/n)); // 10989 is the sampling freq
//************************************
//Usart write
char filtResult[10];
//sprintf(filtResult, "%04d %04d %04d\n", (int)peak, (int)freq_bin, (int)frequency);
sprintf(filtResult, "%04d %04d %04d\n", (int)x[ctr], (int)fft_mag[ctr], (int)frequency);
char c;
char *ptr = &filtResult[0];
do
{
c = *ptr;
ptr++;
usart_bw_write_char(&AVR32_USART2, (int)c);
// sendByte(c);
} while (c != '\n');
}
The main problem is likely to be how you declared fft_input.
Based on your previous question, you are allocating fft_input as an array of kiss_fft_cpx. The function kiss_fftr on the other hand expect an array of scalar. By casting the input array into a kiss_fft_scalar with:
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
KissFFT essentially sees an array of real-valued data which contains zeros every second sample (what you filled in as imaginary parts). This is effectively an upsampled version (although without interpolation) of your original signal, i.e. a signal with effectively twice the sampling rate (which is not accounted for in your freq_bin to frequency conversion). To fix this, I suggest you pack your data into a kiss_fft_scalar array:
kiss_fft_scalar fft_input[n];
...
for (ctr=0; ctr<n; ctr++)
{
...
fft_input[ctr] = window[ctr];
...
}
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, fft_input, fft_output);
Note also that while looking for the peak magnitude, you probably are only interested in the final largest peak, instead of the running maximum. As such, you could limit the loop to only computing the peak (using freq_bin instead of ctr as an array index in the following sprintf statements if needed):
for (ctr=0; ctr<n1; ctr++)
{
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
{
peak = fft_mag[ctr];
freq_bin = ctr;
}
} // close the loop here before computing "frequency"
Finally, when computing the frequency associated with the bin with the largest magnitude, you need the ensure the computation is done using floating point arithmetic. If as I suspect n is an integer, your formula would be performing the 10989/n factor using integer arithmetic resulting in truncation. This can be simply remedied with:
frequency = (freq_bin*(10989.0/n)); // 10989 is the sampling freq
Specifically I would like nn.LogSoftMax to not use omp when the size of the input tensor is small. I have a small script to test the run time.
require 'nn'
my_lsm = function(t)
o = torch.zeros((#t)[1])
sum = 0.0
for i = 1,(#t)[1] do
o[i] = torch.exp(t[i])
sum = sum + o[i]
end
o = o / sum
return torch.log(o)
end
ii=torch.randn(arg[1])
m=nn.LogSoftMax()
timer = torch.Timer()
timer:stop()
timer:reset()
timer:resume()
my_lsm(ii)
print(timer:time().real)
timer:stop()
timer:reset()
timer:resume()
m:forward(ii)
print(timer:time().real)
If arg[1] is 10, then my basic log softmax function run much faster:
0.00021696090698242
0.033425092697144
But once arg[1] is 10,000,000, omp really helps a lot:
29.561321973801
0.11547803878784
So I suspect that omp overhead is very high. If my code has to call log softmax several times with small inputs (says tensor size is only 3), it will cost too much time. Is there a way to manually disable omp usage in some cases (but not always)?
Is there a way to manually disable omp usage in some cases (but not always)?
If you really want to do that one possibility is to use torch.setnumthreads and torch.getnumthreads like that:
local nth = torch.getnumthreads()
torch.setnumthreads(1)
-- do something
torch.setnumthreads(nth)
So you can monkey-patch nn.LogSoftMax as follow:
nn.LogSoftMax.updateOutput = function(self, input)
local nth = torch.getnumthreads()
torch.setnumthreads(1)
local out = input.nn.LogSoftMax_updateOutput(self, input)
torch.setnumthreads(nth)
return out
end
I’ve been looking at a few example programs in order to find better ways to code with Dart.
Not that this example (below) is of any particular importance, however it is taken from rosettacode dot org with alterations by me to (hopefully) bring it up-to-date.
The point of this posting is with regard to Benchmarks and what may be detrimental to results in Dart in some Benchmarks in terms of the speed of printing to the console compared to other languages. I don’t know what the comparison is (to other languages), however in Dart, the Console output (at least in Windows) appears to be quite slow even using StringBuffer.
As an aside, in my test, if n1 is allowed to grow to 11, the total recursion count = >238 million, and it takes (on my laptop) c. 2.9 seconds to run Example 1.
In addition, of possible interest, if the String assignment is altered to int, without printing, no time is recorded as elapsed (Example 2).
Typical times on my low-spec laptop (run from the Console - Windows).
Elapsed Microseconds (Print) = 26002
Elapsed Microseconds (StringBuffer) = 9000
Elapsed Microseconds (no Printing) = 3000
Obviously in this case, console print times are a significant factor relative to computation etc. times.
So, can anyone advise how this compares to eg. Java times for console output? That would at least be an indication as to whether Dart is particularly slow in this area, which may be relevant to some Benchmarks. Incidentally, times when running in the Dart Editor incur a negligible penalty for printing.
// Example 1. The base code for the test (Ackermann).
main() {
for (int m1 = 0; m1 <= 3; ++m1) {
for (int n1 = 0; n1 <= 4; ++n1) {
print ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}");
}
}
}
int fAcker(int m2, int n2) => m2==0 ? n2+1 : n2==0 ?
fAcker(m2-1, 1) : fAcker(m2-1, fAcker(m2, n2-1));
The altered code for the test.
// Example 2 //
main() {
fRunAcker(1); // print
fRunAcker(2); // StringBuffer
fRunAcker(3); // no printing
}
void fRunAcker(int iType) {
String sResult;
StringBuffer sb1;
Stopwatch oStopwatch = new Stopwatch();
oStopwatch.start();
List lType = ["Print", "StringBuffer", "no Printing"];
if (iType == 2) // Use StringBuffer
sb1 = new StringBuffer();
for (int m1 = 0; m1 <= 3; ++m1) {
for (int n1 = 0; n1 <= 4; ++n1) {
if (iType == 1) // print
print ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}");
if (iType == 2) // StringBuffer
sb1.write ("Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}\n");
if (iType == 3) // no printing
sResult = "Acker(${m1}, ${n1}) = ${fAcker(m1, n1)}\n";
}
}
if (iType == 2)
print (sb1.toString());
oStopwatch.stop();
print ("Elapsed Microseconds (${lType[iType-1]}) = "+
"${oStopwatch.elapsedMicroseconds}");
}
int fAcker(int m2, int n2) => m2==0 ? n2+1 : n2==0 ?
fAcker(m2-1, 1) : fAcker(m2-1, fAcker(m2, n2-1));
//Typical times on my low-spec laptop (run from the console).
// Elapsed Microseconds (Print) = 26002
// Elapsed Microseconds (StringBuffer) = 9000
// Elapsed Microseconds (no Printing) = 3000
I tested using Java, which was an interesting exercise.
The results from this small test indicate that Dart takes about 60% longer for the console output than Java, using the results from the fastest for each. I really need to do a larger test with more terminal output, which I will do.
In terms of "computational" speed with no output, using this test and m = 3, and n = 10, the comparison is consistently around 530 milliseconds for Java compared to 580 milliseconds for Dart. That is 59.5 million calls. Java bombs with n = 11 (238 million calls), which I presume is stack overflow. I'm not saying that is a definitive benchmark of much, but it is an indication of something. Dart appears to be very close in the computational time which is pleasing to see. I altered the Dart code from using the "question mark operator" to use "if" statements the same as Java, and that appears to be a bit faster c. 10% or more, and that appeared to be consistently the case.
I ran a further test for console printing as shown below (example 1 – Dart), (Example 2 – Java).
The best times for each are as follows (100,000 iterations) :
Dart 47 seconds.
Java 22 seconds.
Dart Editor 2.3 seconds.
While it is not earth-shattering, it does appear to illustrate that for some reason (a) Dart is slow with console output, and (b) Dart-Editor is extremely fast with console output. (c) This needs to be taken into account when evaluating any performance that involves console output, which is what initially drew my attention to it.
Perhaps when they have time :) the Dart team could look at this if it is considered worthwhile.
Example 1 - Dart
// Dart - Test 100,000 iterations of console output //
Stopwatch oTimer = new Stopwatch();
main() {
// "warm-up"
for (int i1=0; i1 < 20000; i1++) {
print ("The quick brown fox chased ...");
}
oTimer.reset();
oTimer.start();
for (int i2=0; i2 < 100000; i2++) {
print ("The quick brown fox chased ....");
}
oTimer.stop();
print ("Elapsed time = ${oTimer.elapsedMicroseconds/1000} milliseconds");
}
Example 2 - Java
public class console001
{
// Java - Test 100,000 iterations of console output
public static void main (String [] args)
{
// warm-up
for (int i1=0; i1<20000; i1++)
{
System.out.println("The quick brown fox jumped ....");
}
long tmStart = System.nanoTime();
for (int i2=0; i2<100000; i2++)
{
System.out.println("The quick brown fox jumped ....");
}
long tmEnd = System.nanoTime() - tmStart;
System.out.println("Time elapsed in microseconds = "+(tmEnd/1000));
}
}
I am implementing Naive Bayes algorithm for text classification. I have ~1000 documents for training and 400 documents for testing. I think I've implemented training part correctly, but I am confused in testing part. Here is what I've done briefly:
In my training function:
vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection
spamModelArray[vocabularySize];
nonspamModelArray[vocabularySize];
for each training_file{
class = GetClassLabel(); // 0 for spam or 1 = non-spam
document = GetDocumentID();
counterTotalTrainingDocs ++;
if(class == 0){
counterTotalSpamTrainingDocs++;
}
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
if(class = 0){ //SPAM
spamModelArray[id]+= freq;
totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
}else{ // NON-SPAM
nonspamModelArray[id]+= freq;
totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
}
}//for
for i in vocabularySize{
spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;
}//for
priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}
I think I understood and implemented training part correctly, but I am not sure I could implemented testing part properly. In here, I am trying to go through each test document and I calculate logP(spam|d) and logP(non-spam|d) for each document. Then I compare these two quantities in order to determine the class (spam/non-spam).
In my testing function:
vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
document = getDocumentID;
logProbabilityofSpam = 0;
logProbabilityofNonSpam = 0;
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
// logP(w1w2.. wn) = C(wj)∗logP(wj)
logProbabilityofSpam+= freq*log(spamModelArray[id]);
logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
}//for
// Now I am calculating the probability of being spam for this document
if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
newclass = 1; //not spam
}else{
newclass = 0; // spam
}
}//for
My problem is; I want to return the probability of each class instead of exact 1's and 0's (spam/non-spam). I want to see e.g. newclass = 0.8684212 so I can apply threshold later on. But I am confused here. How can I calculate the probability for each document? Can I use logProbabilities to calculate it?
The probability of the data described by a set of features {F1, F2, ..., Fn} belonging in class C, according to the naïve Bayes probability model, is
P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)
You have all the terms (in logarithmic form), except for the 1 / P( F1, ..., Fn) term since that's not used in the naïve Bayes classifier that you're implementing. (Strictly, the MAP classifier.)
You'd have to collect frequencies of the features as well, and from them calculate
P(F1, ..., Fn) = P(F1) * ... * P(Fn)