Memory profiling on Google Cloud Dataflow - google-cloud-dataflow

What would be the best way to debug memory issues of a dataflow job?
My job was failing with a GC OOM error, but when I profile it locally I cannot reproduce the exact scenarios and data volumes.
I'm running it now on 'n1-highmem-4' machines, and I don't see the error anymore, but the job is very slow, so obviously using machine with more RAM is not the solution :)
Thanks for any advice,
G

Please use the option --dumpHeapOnOOM and --saveHeapDumpsToGcsPath (see docs).
This will only help if one of your workers actually OOMs. Additionally you can try running jmap -dump PID on the harness process on the worker to obtain a heap dump at runtime if it's not OOMing but if you observe high memory usage nevertheless.

Related

Application slow down due to zombie process?

We face the application downtime/issue while uploading the file to Azure storage via Arc.
There is no specific code error, but facing a timeout issue.
It gets resolved once the Azure web app is restarted.
It has happened intermittently.
Since we cannot find the root cause, we consulted if there is an issue on the Azure side.
The Microsoft team says the system health is OK but pointing towards accumulated zombie processes. EPMD and inet_gethost
On searching, I understand that these are created by Erlang runtime.
Please let me know if we have some process to kill these zombie processes from time to time?
Also, do they contribute to the application downtime?
Thanks
Please let me know, if we have some process to kill these zombie processes time to time?
If you're running a sensible init process, these zombie processes should be correctly reaped. This can often be a problem if you run Erlang as the top-level process inside a container, for instance. Can you give more detail of your environment?
Also do they really contribute for the application downtime?
Depends on how many of them there are, but probably not, no.

"The Dataflow appears to be stuck" for a job usually working

So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:
Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
So here I am :-)
The jobId: 2017-08-29_13_30_03-3908175820634599728
Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?
I'm working with Dataflow-Version 1.9.0
Thanks Google Dataflow Team
It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).
Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.

Jenkins Server Java.exe memory is growing very fast

We're running Jenkins server with few slaves that run the builds. Lately there are more and more builds that are running in the same time.
I see the java.exe process on the Jenkins server just increasing , and not decreasing even when the jobs were finnished.
Any idea dwhy oes it happen?
We're running Jenkins ver. 1.501.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?
I can't seem to find a reference on this (still posting an answer because it's too long for comments ;-) ) but this is what I've observed using the Oracle JVM:
If more memory than currently reserved is needed, the JVM reserves more. So far so good. What it doesn't seem to do is release the memory once it's not needed anymore. You can watch this behaviour by turning on the heap size indicator in Eclipse.
I'd say the same does happen with Jenkins. A running Jenkins with only a few projects already can easily jump the 1 gig mark. If you have a lot of concurrent builds, Jenkins needs a lot of memory at some point. After the builds are done and the heapsize has decreased, the JVM keeps the memory reserved. It is practically "empty" but still claimed by the JVM so it's unavailable for other processes.
Again: It's just an observation. I'd be happy if someone with deeper insight on Java memory management would back this up (or disprove it)
As for a practical solution I'd say you gonna have to live with it to some point. Jenkins IS very hungry for memory. Restarting it solves the problem only temporary. At least it should stop claiming memory at some point because the "empty" reserved memory should be reused. If it's not this really sounds like a memory leak and would be a bug.
Jenkins' [excessive] use of memory without bounds seems to be a common observation. The Jenkins Wiki gives some suggestions for "I'm getting OutOfMemoryErrors".
We have also found that the Monitoring Plugin is useful for keeping an eye on the memory usage and helping us know if we might need to restart Jenkins soon.
Is there a way maybe to make the Jenkins server service ro wait until the last job is finnished, then restart automatically?
Check out the Restart Safely Plugin

Clojure STM Out of Memory

I have a small program that should perform parallel banking transfer using the STM, so I am testing it on different machines, 2-core and a 1-core. In the 2-core machines everything works, but in the 1-core machine, the Java Out of Memory error is thrown when I perform 1 million parallel transactions.
The error is the following "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
Also I have a Java-synchronize version of the same program which works, even if it is slower, it can reach a million transactions.
What can I do to make my Clojure application work in the 1-core machine? I am afraid the garbage collector can't handle so many Refs...what do you think?
Thanks a lot for your help!
Update:
It works now, I did java -Xmx1000m -jar myprog.jar and worked perfectly!
I didn't know it was possible to increasing the heap size for the JVM, and that was exactly my problem.
Thanks a lot to "sw1nn" for the great comment ;)
You can also add jvm-opts to your leiningen project.clj like below:
:jvm-opts ["-Xmx1500m"]
to get it specified when you run your program in leiningen.(like testing)

Delayed Jobs leaking memory?

I'm using collectiveidea's delayed_job with my Ruby on Rails app (v2.3.8), and running about 40 background jobs with it on an 8GB RAM Slicehost machine (Ubuntu 10.04 LTS, Apache 2).
Let's say I ssh into my server with no workers running. When I do free -m, I'm see I'm generally using about 1GB of RAM out of 8. Then after starting the workers and waiting about a minute for them to be utilized by the code, I'm up to about 4GB. If I come back in an hour or two, I'll be at 8GB and into the swap memory, and my website will be generating 502 errors.
So far I've just been killing the workers and restarting them, but I'd rather fix the root of the problem. Any thoughts? Is this a memory leak? Or, as a friend suggested, do I need to figure out a way to run garbage collection?
Actually, Delayed::Job 3.0 leaks memory in Ruby 1.9.2 if your models have serialized attributes. (I'm in the process of researching a solution.)
Here's someone who seemed to have solved it, http://spacevatican.org/2012/1/26/memory-leak-in-yaml-on-ruby-1-9-2
Here's the issue from Delayed::Job https://github.com/collectiveidea/delayed_job/issues/336
Just about every time someone asks about this, the problem is in their code. Try using one of the available profiling tools to find where your job is leaking. ( https://github.com/wycats/ruby-prof or similar.)
Triggering GC at the end of each job will reduce your max memory usage at the cost of thrashing your throughput. It won't stop Ruby from bloating to the max size required by any individual job, however, since Ruby can't free memory back to the OS. I don't recommend taking this approach.

Resources