We have an erlang/elixir application (on 18/erts 7.3.1) that processes large json payloads.
Here's a typical workflow:
A listener gets a token from rabbitmq and sends to a gen_server.
The gen_server puts the token into a ETS table with a future time (current + n secs). A schedule job in the gen_server will pick up expired tokens from ETS and launch several short lived processes with these tokens.
These short lived processes downloads 30-50k json payloads from elasticsearch (using hackney) and processes it, then uploads the results back to elasticsearch, following that the process dies immediately. We keep track of these processes and have confirmed they die. We process 5-10 of these requests per second.
The Problem: We see an ever growing binary space and within 48 hours this grows to couple of GBs (seen via observer and debug prints). A manual GC also has no impact.
We already added "recon" and ran recon:bin_leak, however this only frees up few KBs and no impact on the ever growing binary space.
stack: Erlang 18/erts 7.3.1, elixir 1.3.4, hackney 1.4.4, poison 2.2.0, timex 3.1.13, etc., none of these apps holding the memory either.
Has anyone come across a similar issue in the past? Would appreciate any solutions.
Update 9/15/2017:
We updated our app to Erlang 19/ERTS 8.3 and hackney and poison libs to latest, still there is no progress. Here's some log within a GenServer, which periodically sends a message to itself, using spawn/receive or send_after. At each handle_info, it looks up an ets table and if it finds any "eligible" entries, it spawns new processes. If not, it just returns a {:noreply, state}. We print the VMs binary space info here at entry to the function (in KB), the log is listed below. This is a "queit" time of the day. You can see the gradual increase of binary space. Once again :recon.bin_leak(N) or :erlang.garbage_collect() had no impact on this growth.
11:40:19.896 [warn] binary 1: 3544.1328125
11:40:24.897 [warn] binary 1: 3541.9609375
11:40:29.901 [warn] binary 1: 3541.9765625
11:40:34.903 [warn] binary 1: 3546.2109375
--- some processing ---
12:00:47.307 [warn] binary 1: 7517.515625
--- some processing ---
12:20:38.033 [warn] binary 1: 15002.1328125
We never had a situation like this in our old Scala/Akka app which processes 30x more volume runs for years without an issue or restarts. I wrote both apps.
We found that the memory_leak came from a private reusable library that sends messages to Graylog and that uses the function below to compress that data before sending it via gen_udp.
defp compress(data) do
zip = :zlib.open()
:zlib.deflateInit(zip)
output = :zlib.deflate(zip, data, :finish)
:zlib.deflateEnd(zip)
:zlib.close(zip) #<--- was missing, hence the slow memory leak.
output
end
Instead using term_to_binary(data, [:compressed]) I could have saved some headaches.
Thanks for all the inputs and comments. Much appreciated!
Related
Our application uses neo4j 3.5.x (tried both community and enterprise editions) to store some data.
No matter how we setup memory in conf/neo4j.conf (tried with lots of combinations for initial/max heap settings from 4 to 16 GB), the GC process runs periodically every 3 seconds, putting the machine to its knees, slowing the whole system down.
There's a combination (8g/16g) that seems to make stuff more stable, but a few minutes (20-30) after our system is being used, GC kicks again on neo4j and goes into this "deadly" loop.
If we restart the neo4j server w/o restarting our system, as soon as our system starts querying neo4j, GC starts again... (we've noticed this behavior consistently).
We've had a 3.5.x community instance which was working fine from last week (when we've tried to switch to enterprise). We've copied over the data/ folder from enterprise to community instance and started the community instance... only to have it behave the same way the enterprise instance did, running GC every 3 seconds.
Any help is appreciated. Thanks.
Screenshot of jvisualvm with 8g/16g of heap
On debug.log, only these are significative:
2019-03-21 13:44:28.475+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-03-21 13:45:15.136+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-03-21 13:45:15.140+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
And I have a neo4j.log excerpt from around the time the jvisualvm screenshot shows but it's 3500 lines long... so here it is on Pastebin:
neo4j.log excerpt from around the time time the jvisualvm screenshot was taken
JUST_PUT_THIS_TO_KEEP_THE_SO_EDITOR_HAPPY_IGNORE...
Hope this helps, I have also the logs for the Enterprise edition if needed, though they are a bit more 'cahotic' (neo4j restarts in between) and I have no jvisualvm screenshot for them
I have a Neo4J server running that periodically stalls out for 10s of seconds. The web frontend will say it's "disconnected" with the red warning bar at the top, and a normally instant REST query in my application will apparently hang, until the stall ends and then everything returns to normal. The web frontend becomes usable and my REST query completes fine.
Is there any way to debug what is happening during one of these stall periods? Can you get a list of currently running queries? Or a list of what hosts are connected to the server? Or any kind of indication of server load?
Most likely JVM garbage collection kicking in because you haven't allocated enough heap space.
There's a number of ways to debug this. You can for example enable GC logging (uncomment appropriate lines in neo4j-wrapper.conf), or use a profiler (e.g. YourKit) to see what's going on and why the pauses.
I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.
We are using ActiveMQ 5.6 with the following configuration:
- Flow control on
- Memory limit for topics 1MB
- Mirror Queues enabled (no explicit Virtual Topics defined)
There are persistent messages being sent to a queue QueueA. Obviously, this message is copied to Mirror.QueueA which is a non persistent and automatically created topic.
On this topic, there are no consumers. If there are consumers once in a while, they are non-durable subscribers.
After a while, the producer blocks and we get the following error:
Usage Manager memory limit reached for topic://Mirror.QueueA
According to various sources including the ActiveMQ documentation, there messages in a topic without durable subscribers will be dropped which is what I want and what had expected. But this is obviously not the case.
There is one related StackOverflow question but the accepted solution suggests using flow control but disabling disk-spooling:
That would not use the disk, and block producers when the memoryLimit is hit.
But I do not want to block producers because they will block indefinitely because there is no consumer coming. Why are these messages are being persisted?
I see few options:
- This is a bug and probably fixed in later AMQ versions
- This some configuration issue (of which I don't know how to resolve it)
- There is some option to simply drop the oldest message when the memory limit is hit (I couldn't find any such option)
I hope someone can help!
Thanks,
//J
[Update]
Although we have already deployed versions of 5.6 out in the field, I am currently running the same endurance/load test on a 5.8 installation of AMQ with the same configuration. Right now, I have already transmitted 10 times the messages as on the 5.6 system without any issues. I will let this test run over night or even the next days to see if there is some other limit.
Ok,
as stated in the update before, I was running the same laod test on a 5.8 installation of ActiveMQ with the same configuration that cause the storage exceedance.
This was happening after approximately sending 450 transactions into 3 queues with a topic memory limit of 1MB. You could even watch the size of the KahaDB database file growing.
With AMQ 5.8, I stopped the load test after 4 days resulting in about 280.000 transactions sent. No storage issues, no stuck producer and the KahaDB file stayed approximately the same size all the time.
So, although I cannot say for sure that this is a bug in ActiveMQ 5.6, 5.8 is obviously behaving differently and as expected and documented. It is not storing message in the mirrored queues persistently when no subscriber is registered.
For existing installations of AMQ 5.6, we used a little hack to avoid changing the application code.
Since the application was consuming from topics prefixed with "Mirror." (the default prefix) and some wildcards, we simply defined a topic at start-up in the configuration using the <destinations> XML tag. Where wildcards were used we just used a hardcoded name like all-device. This was unfortunately required for the next step:
We defined a <compositeQueue> within the <destinationInterceptors> section of the config that routed copies of all messages (<forwardTo>) from the actual (mirrored) queue to one topic. This topic needs to be defined in advance or being created manually since simply defining the compositeQueue does not also create the topic. Plus, you cannot use
Then we removed the mirrored queue feature from the config
To sum it up, it looks a bit like this:
<destinations>
<topic name="Mirror.QueueA.all-devices" physicalName="Mirror.all-devices" />
</destinations>
<destinationInterceptors>
<virtualDestinationInterceptor>
<virtualDestinations>
<compositeQueue name="QueueA.*" forwardOnly="false">
<forwardTo>
<topic physicalName="Mirror.QueueA.all-devices" />
</forwardTo>
</compositeQueue>
</virtualDestinations>
</virtualDestinationInterceptor>
</destinationInterceptors>
Hope this helps. This "hack" may not be possible in every situation but since we never consumed on individual Mirror topics, this was possible.
I have a BigCouch cluster with Q=256, N=3, R=2, W=2. Everything seems to be up and running and I can read and write small test documents. The application is in Python and uses the CouchDB library. The cluster has 3 nodes, each on CentOS on vxware with 3 cores and 6GB RAM each. BigCouch 0.4.0, CouchDB 1.1.1, Erlang R14B04, Linux version CentOS Linux release 6.0 (Final) on EC2 and CentOS release 6.2 (Final) on vmware 5.0.
Starting the application attempts to do a bulk insert with 412 documents and a total of 490KB data. This works fine with N=1 so there isn't an issue with the contents. But when N=3 I seem to randomly get one of these results:
write completes in about 9 sec
write completes in about 24 sec (nothing in between)
write fails after about 30sec (some documents were inserted)
Erlang crashes after about 30sec (some documents were inserted)
vmstat shows near 100% CPU utilization, top shows this is mostly the Erlang process, truss shows this is mostly spent in "futex" calls. Disk usage jumps up and down during the operation, but CPU remains pegged.
The logs show lovely messages like:
"could not load validation funs {{badmatch, {error, timeout}}, [{couch_db, '-load_validation_funs/1-fun-1-', 1}]}"
"Error in process <0.13489.10> on node 'bigcouch-test02#bigcouch-test02.oceanobservatories.org' with exit value: {{badmatch,{error,timeout}},[{couch_db,'-load_validation_funs/1-fun-1-',1}]}"
And of course there are Erlang dumps.
From reading about other people's use of BigCouch, this certainly isn't a large update. Our VMs seem beefy enough for the job. I can reproduce with cURL and a JSON file, so it isn't the application. (Can post that too if it helps.)
Can anyone explain why 9 cores and 18GB RAM can't handle a (3x) 490KB write?
more info in case it helps:
bigcouch.log entries including longer crash report
JSON entries that repeatably cause the failure
erl_crash.dump from an EC2 machine m1.small trying to allocate 500mb heap
can reproduce with commands:
download above JSON entries as file.json
url=http://yourhost:5984
curl -X PUT $url/test
curl -X POST $url/test/_bulk_docs -d #file.json -H "Content-Type: application/json"
Got a suggestion that Q=256 may be the issue and found that BigCouch does slow down a lot as Q grows. This is a surprise to me -- I would think the hashing and delegation would be pretty lightweight. But perhaps it dedicates too many resources to each DB shard.
As Q grows from too small to do allow any real cluster growth to maybe big enough for BigData, the time to do my 490kb update grows from uncomfortably slow to unreasonably slow and finally into the realm of BigCouch crashes. Here is the time to insert as Q varies with N=3, R=W=2, 3-nodes as originally described:
Q sec
4 6.4
8 7.7
16 10.8
32 16.9
64 37.0 <-- specific suggestion in adam#cloudant's webcast
This seems like an achilles heel for BigCouch: in spite of suggestions to overshard to support later growth of your cluster, you can't have enough shards unless you already have a moderate-sized cluster or some powerful hardware!