Huge memory usage with circular pipeline - memory

Update 2:
As #Valle Lukas pointed out, Looks like this is due to a leak being addressed.
Update 1:
Ok I got around to trying this again and have a much simpler bit of code that demonstrates the issue I'm having:
my $channel=Channel.new; #create a new channel
$channel.send(0); #kickstart the circular pipeline
react {
whenever $channel {
say $_;
$channel.send($_ + 1); #send back to same pipeline
#sit back and watch the memory usage grow
}
}
Basically I create a single stage pipeline via a Channel, send a single message to it, then setup react/whenever blocks to process the message (add 1) and send it back to the same channel again. The pipeline once started never stops.
The growth in memory usage ( I get to about 600MB and climbing in about 10 seconds), isn't due to message buffering, there is only ever one message in the cue.
Is this simply a bug or how can I address the memory usage of a channel?

Seems to be addressed by Jonathan Worthington commit
d5044de
and
25b486d

Related

How do I debug a memory issue in Rust?

I hope this question isn't too open-ended. I ran into a memory issue with Rust, where I got an "out of memory" from calling next on an Iterator trait object. I'm unsure how to debug it. Prints have only brought me to the point where the failure occurs. I'm not very familiar with other tools such as ltrace, so although I could create a trace (231MiB, pff), I didn't really know what to do with it. Is a trace like that useful? Would I do better to grab gdb/lldb? Or Valgrind?
In general I would try to do the following approach:
Boilerplate reduction: Try to narrow down the problem of the OOM, so that you don't have too much additional code around. In other words: the quicker your program crashes, the better. Sometimes it is also possible to rip out a specific piece of code and put it into an extra binary, just for the investigation.
Problem size reduction: Lower the problem from OOM to a simple "too much memory" so that you can actually tell the some part wastes something but that it does not lead to an OOM. If it is too hard to tell wether you see the issue or not, you can lower the memory limit. On Linux, this can be done using ulimit:
ulimit -Sv 500000 # that's 500MB
./path/to/exe --foo
Information gathering: If you problem is small enough, you are ready to collect information which has a lower noise level. There are multiple ways which you can try. Just remember to compile your program with debug symbols. Also it might be an advantage to turn off optimization since this usually leads to information loss. Both can be archived by NOT using the --release flag during compilation.
Heap profiling: One way is too use gperftools:
LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=/tmp/profile ./path/to/exe --foo
pprof --gv ./path/to/exe /tmp/profile/profile.0100.heap
This shows you a graph which symbolizes which parts of your program eat which amount of memory. See official docs for more details.
rr: Sometimes it's very hard to figure out what is actually happening, especially after you created a profile. Assuming you did a good job in step 2, you can use rr:
rr record ./path/to/exe --foo
rr replay
This will spawn a GDB with superpowers. The difference to a normal debug session is that you can not only continue but also reverse-continue. Basically your program is executed from a recording where you can jump back and forth as you want. This wiki page provides you some additional examples. One thing to point out is that rr only seems to work with GDB.
Good old debugging: Sometimes you get traces and recordings that are still way too large. In that case you can (in combination with the ulimit trick) just use GDB and wait until the program crashes:
gdb --args ./path/to/exe --foo
You now should get a normal debugging session where you can examine what the current state of the program was. GDB can also be launched with coredumps. The general problem with that approach is that you cannot go back in time and you cannot continue with execution. So you only see the current state including all stack frames and variables. Here you could also use LLDB if you want.
(Potential) fix + repeat: After you have a glue what might go wrong you can try to change your code. Then try again. If it's still not working, go back to step 3 and try again.
Valgrind and other tools work fine, and should work out of the box as of Rust 1.32. Earlier versions of Rust require changing the global allocator from jemalloc to the system's allocator so that Valgrind and friends know how to monitor memory allocations.
In this answer, I use the macOS developer tool Instruments, as I'm on macOS, but Valgrind / Massif / Cachegrind work similarly.
Example: An infinite loop
Here's a program that "leaks" memory by pushing 1MiB Strings into a Vec and never freeing it:
use std::{thread, time::Duration};
fn main() {
let mut held_forever = Vec::new();
loop {
held_forever.push("x".repeat(1024 * 1024));
println!("Allocated another");
thread::sleep(Duration::from_secs(3));
}
}
You can see memory growth over time, as well as the exact stack trace that allocated the memory:
Example: Cycles in reference counts
Here's an example of leaking memory by creating an infinite reference cycle:
use std::{cell::RefCell, rc::Rc};
struct Leaked {
data: String,
me: RefCell<Option<Rc<Leaked>>>,
}
fn main() {
let data = "x".repeat(5 * 1024 * 1024);
let leaked = Rc::new(Leaked {
data,
me: RefCell::new(None),
});
let me = leaked.clone();
*leaked.me.borrow_mut() = Some(me);
}
See also:
Why does Valgrind not detect a memory leak in a Rust program using nightly 1.29.0?
Handling memory leak in cyclic graphs using RefCell and Rc
Minimal `Rc` Dependency Cycle
In general, to debug, you can use either a log-based approach (either by inserting the logs yourself, or having a tool such a ltrace, ptrace, ... to generate the logs for you) or you can use a debugger.
Note that ltrace, ptrace or debugger-based approaches require that you be able to reproduce the problem; I tend to favor manual logs because I work in an industry where bug reports are generally too imprecise to allow immediate reproduction (and thus we use logs to create the reproducer scenario).
Rust supports both approaches, and the standard toolset that one uses for C or C++ programs works well for it.
My personal approach is to have some logging in place to quickly narrow down where the issue occurs, and if logging is insufficient to fire up a debugger for a more fine-combed inspection. In this case I would recommend going straight away for the debugger.
A panic is generated, which means that by breaking on the call to the panic hook, you get to see both the call stack and memory state at the moment where things go awry.
Launch your program with the debugger, set a break point on the panic hook, run the program, profit.

Call to CFReadStreamRead stops execution in thread

NB: The entire code base for this project is so large that posting any meaningful amount wold render this question too localised, I have tried to distil any code down to the bare-essentials. I'm not expecting anyone to solve my problems directly but I will up vote those answers I find helpful or intriguing.
This project uses a modified version of AudioStreamer to playback audio files that are saved to locally to the device (iPhone).
The stream is set up and scheduled on the current loop using this code (unaltered from the standard AudioStreamer project as far as I know):
CFStreamClientContext context = {0, self, NULL, NULL, NULL};
CFReadStreamSetClient(
stream,
kCFStreamEventHasBytesAvailable | kCFStreamEventErrorOccurred | kCFStreamEventEndEncountered,
ASReadStreamCallBack,
&context);
CFReadStreamScheduleWithRunLoop(stream, CFRunLoopGetCurrent(), kCFRunLoopCommonModes);
The ASReadStreamCallBack calls:
- (void)handleReadFromStream:(CFReadStreamRef)aStream
eventType:(CFStreamEventType)eventType
On the AudioStreamer object, this all works fine until the stream is read using this code:
BOOL hasBytes = NO; //Added for debugging
hasBytes = CFReadStreamHasBytesAvailable(stream);
length = CFReadStreamRead(stream, bytes, kAQDefaultBufSize);
hasBytes is YES but when CFReadStreamRead is called execution stops, the App does not crash it just stops exciting, any break points below the CFReadStreamRead call are not hit and ASReadStreamCallBack is not called again.
I am at a loss to what might cause this, my best guess is the thread is being terminated? But the hows and whys is why I'm asking SO.
Has anyone seen this behaviour before? How can I track it down and ideas on how I might solve it will be very much welcome!
Additional Info Requested via Comments
This is 100% repeatable
CFReadStreamHasBytesAvailable was added by me for debugging but removing it has no effect
First, I assume that CFReadStreamScheduleWithRunLoop() is running on the same thread as CFReadStreamRead()?
Is this thread processing its runloop? Failure to do this is my main suspicion. Do you have a call like CFRunLoopRun() or equivalent on this thread?
Typically there is no reason to spawn a separate thread for reading streams asynchronously, so I'm a little confused about your threading design. Is there really a background thread involved here? Also, typically CFReadStreamRead() would be in your client callback (when you receive the kCFStreamEventHasBytesAvailable event (which it appears to be in the linked code), but you're suggesting ASReadStreamCallBack is never called. How have you modified AudioStreamer?
It is possible that the stream pointer is just corrupt in some way. CFReadStreamRead should certainly not block if bytes are available (it certainly would never block for more than a few milliseconds for local files). Can you provide the code you use to create the stream?
Alternatively, CFReadStreams send messages asynchronously but it is possible (but not likely) that it's blocking because the runloop isn't being processed.
If you prefer, I've uploaded my AudioPlayer inspired by Matt's AudioStreamer hosted at https://code.google.com/p/audjustable/. It supports local files (as well as HTTP). I think it does what you wanted (stream files from more than just HTTP).

Debugging Erlang heart timeouts

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?
By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):
--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
## -559,10 +559,11 ##
int res;
if(heart_beat_kill_pid != 0){
pid = (pid_t) heart_beat_kill_pid;
- res = kill(pid,SIGKILL);
+ res = kill(pid,SIGUSR1);
+ sleep(4);
for(i=0; i < 5 && res == 0; ++i){
sleep(1);
- res = kill(pid,SIGKILL);
+ res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
}
if(errno != ESRCH){
print_error("Unable to kill old process, "
As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.
Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.
You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.
You can try using the erl_call tool with e.g. -a erlang halt 123.
If the erlang node can't respond to this is also interesting information.
Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.
If you have any idea of why it is freezing you could try to trace the module using dbg.
http://www.erlang.org/doc/man/dbg.html
In short try
dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).
If you want to stop this tracing issue
dbg:ctpl()
See documentation for more info.
Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.
Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

node.js process out of memory error

FATAL ERROR: CALL_AND_RETRY_2 Allocation Failed - process out of memory
I'm seeing this error and not quite sure where it's coming from. The project I'm working on has this basic workflow:
Receive XML post from another source
Parse the XML using xml2js
Extract the required information from the newly created JSON object and create a new object.
Send that object to connected clients (using socket.io)
Node Modules in use are:
xml2js
socket.io
choreographer
mysql
When I receive an XML packet the first thing I do is write it to a log.txt file in the event that something needs to be reviewed later. I first fs.readFile to get the current contents, then write the new contents + the old. The log.txt file was probably around 2400KB around last crash, but upon restarting the server it's working fine again so I don't believe this to be the issue.
I don't see a packet in the log right before the crash happened, so I'm not sure what's causing the crash... No new clients connected, no messages were being sent... nothing was being parsed.
Edit
Seeing as node is running constantly should I be using delete <object> after every object I'm using serves its purpose, such as var now = new Date() which I use to compare to things that happen in the past. Or, result object from step 3 after I've passed it to the callback?
Edit 2
I am keeping a master object in the event that a new client connects, they need to see past messages, objects are deleted though, they don't stay for the life of the server, just until their completed on client side. Currently, I'm doing something like this
function parsingFunction(callback) {
//Construct Object
callback(theConstructedObject);
}
parsingFunction(function (data) {
masterObject[someIdentifier] = data;
});
Edit 3
As another step for troubleshooting I dumped the process.memoryUsage().heapUsed right before the parser starts at the parser.on('end', function() {..}); and parsed several xml packets. The highest heap used was around 10-12 MB throughout the test, although during normal conditions the program rests at about 4-5 MB. I don't think this is particularly a deal breaker, but may help in finding the issue.
Perhaps you are accidentally closing on objects recursively. A crazy example:
function f() {
var shouldBeDeleted = function(x) { return x }
return function g() { return shouldBeDeleted(shouldBeDeleted) }
}
To find what is happening fire up node-inspector and set a break point just before the suspected out of memory error. Then click on "Closure" (below Scope Variables near the right border). Perhaps if you click around something will click and you realize what happens.

trouble reading from __global memory after atom_inc in OpenCL

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?
Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Resources