ActiveMQ - Memory limit reached for topic (Mirror Queue) - memory

We are using ActiveMQ 5.6 with the following configuration:
- Flow control on
- Memory limit for topics 1MB
- Mirror Queues enabled (no explicit Virtual Topics defined)
There are persistent messages being sent to a queue QueueA. Obviously, this message is copied to Mirror.QueueA which is a non persistent and automatically created topic.
On this topic, there are no consumers. If there are consumers once in a while, they are non-durable subscribers.
After a while, the producer blocks and we get the following error:
Usage Manager memory limit reached for topic://Mirror.QueueA
According to various sources including the ActiveMQ documentation, there messages in a topic without durable subscribers will be dropped which is what I want and what had expected. But this is obviously not the case.
There is one related StackOverflow question but the accepted solution suggests using flow control but disabling disk-spooling:
That would not use the disk, and block producers when the memoryLimit is hit.
But I do not want to block producers because they will block indefinitely because there is no consumer coming. Why are these messages are being persisted?
I see few options:
- This is a bug and probably fixed in later AMQ versions
- This some configuration issue (of which I don't know how to resolve it)
- There is some option to simply drop the oldest message when the memory limit is hit (I couldn't find any such option)
I hope someone can help!
Thanks,
//J
[Update]
Although we have already deployed versions of 5.6 out in the field, I am currently running the same endurance/load test on a 5.8 installation of AMQ with the same configuration. Right now, I have already transmitted 10 times the messages as on the 5.6 system without any issues. I will let this test run over night or even the next days to see if there is some other limit.

Ok,
as stated in the update before, I was running the same laod test on a 5.8 installation of ActiveMQ with the same configuration that cause the storage exceedance.
This was happening after approximately sending 450 transactions into 3 queues with a topic memory limit of 1MB. You could even watch the size of the KahaDB database file growing.
With AMQ 5.8, I stopped the load test after 4 days resulting in about 280.000 transactions sent. No storage issues, no stuck producer and the KahaDB file stayed approximately the same size all the time.
So, although I cannot say for sure that this is a bug in ActiveMQ 5.6, 5.8 is obviously behaving differently and as expected and documented. It is not storing message in the mirrored queues persistently when no subscriber is registered.
For existing installations of AMQ 5.6, we used a little hack to avoid changing the application code.
Since the application was consuming from topics prefixed with "Mirror." (the default prefix) and some wildcards, we simply defined a topic at start-up in the configuration using the <destinations> XML tag. Where wildcards were used we just used a hardcoded name like all-device. This was unfortunately required for the next step:
We defined a <compositeQueue> within the <destinationInterceptors> section of the config that routed copies of all messages (<forwardTo>) from the actual (mirrored) queue to one topic. This topic needs to be defined in advance or being created manually since simply defining the compositeQueue does not also create the topic. Plus, you cannot use
Then we removed the mirrored queue feature from the config
To sum it up, it looks a bit like this:
<destinations>
<topic name="Mirror.QueueA.all-devices" physicalName="Mirror.all-devices" />
</destinations>
<destinationInterceptors>
<virtualDestinationInterceptor>
<virtualDestinations>
<compositeQueue name="QueueA.*" forwardOnly="false">
<forwardTo>
<topic physicalName="Mirror.QueueA.all-devices" />
</forwardTo>
</compositeQueue>
</virtualDestinations>
</virtualDestinationInterceptor>
</destinationInterceptors>
Hope this helps. This "hack" may not be possible in every situation but since we never consumed on individual Mirror topics, this was possible.

Related

Beam Runner hooks for Throughput-based autoscaling

I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.

Cloud Dataflow Running really slow when reading/writing from Cloud Storage (GCS)

Since using the release of the latest build of Cloud Dataflow (0.4.150414) our jobs are running really slow when reading from cloud storage (GCS). After running for 20 minutes with 10 VMs we were only able to read in about 20 records when previously we could read in millions without issue.
It seems to be hanging, although no errors are being reported back to the console.
We received an email informing us that the latest build would be slower and that it could be countered by using more VMs but we got similar results with 50 VMs.
Here is the job id for reference: 2015-04-22_22_20_21-5463648738106751600
Instance: n1-standard-2
Region: us-central1-a
Your job seems to be using side inputs to a DoFn. Since there has been a recent change in how Cloud Dataflow SDK for Java handles side inputs, it is likely that your performance issue is related to that. I'm reposting my answer from a related question.
The evidence seems to indicate that there is an issue with how your pipeline handles side inputs. Specifically, it's quite likely that side inputs may be getting re-read from BigQuery again and again, for every element of the main input. This is completely orthogonal to the changes to the type of virtual machines used by Dataflow workers, described below.
This is closely related to the changes made in the Dataflow SDK for Java, version 0.3.150326. In that release, we changed the side input API to apply per window. Calls to sideInput() now return values only in the specific window corresponding to the window of the main input element, and not the whole side input PCollectionView. Consequently, sideInput() can no longer be called from startBundle and finishBundle of a DoFn because the window is not yet known.
For example, the following code snippet has an issue that would cause re-reading side input for every input element.
#Override
public void processElement(ProcessContext c) throws Exception {
Iterable<String> uniqueIds = c.sideInput(iterableView);
for (String item : uniqueIds) {
[...]
}
c.output([...]);
}
This code can be improved by caching the side input to a List member variable of the transform (assuming it fits into memory) during the first call to processElement, and use that cached List instead of the side input in subsequent calls.
This workaround should restore the performance you were seeing before, when side inputs could have been called from startBundle. Long-term, we will work on better caching for side inputs. (If this doesn't help fully resolve the issue, please reach out to us via email and share the relevant code snippets.)
Separately, there was, indeed, an update to the Cloud Dataflow Service around 4/9/15 that changed the default type of virtual machines used by Dataflow workers. Specifically, we reduced the default number of cores per worker because our benchmarks showed it as cost effective for typical jobs. This is not a slowdown in the Dataflow Service of any kind -- it just runs with less resources per worker, by default. Users are still given the options to override both the number of workers as well as the type of the virtual machine used by workers.
We had a similar issue. It is when the side-input is reading from a BigQuery table that has had its data streamed in, rather than bulk loaded. When we copy the table(s), and read from the copies instead everything works fine.
If your tables are streamed, try copying them and reading the copies instead. This is a workaround.
See: Dataflow performance issues

Deploying an SQS Consumer

I am looking to run a service that will be consuming messages that are placed into an SQS queue. What is the best way to structure the consumer application?
One thought would be to create a bunch of threads or processes that run this:
def run(q, delete_on_error=False):
while True:
try:
m = q.read(VISIBILITY_TIMEOUT, wait_time_seconds=MAX_WAIT_TIME_SECONDS)
if m is not None:
try:
process(m.id, m.get_body())
except TransientError:
continue
except Exception as ex:
log_exception(ex)
if not delete_on_error:
continue
q.delete_message(m)
except StopIteration:
break
except socket.gaierror:
continue
Am I missing anything else important? What other exceptions do I have to guard against in the queue read and delete calls? How do others run these consumers?
I did find this project, but it seems stalled and has some issues.
I am leaning toward separate processes rather than threads to avoid the the GIL. Is there some container process that can be used to launch and monitor these separate running processes?
There are a few things:
The SQS API allows you to receive more than one message with a single API call (up to 10 messages, or up to 256k worth of messages, whichever limit is hit first). Taking advantage of this feature allows you to reduce costs, since you are charged per API call. It looks like you're using the boto library - have a look at get_messages.
In your code right now, if processing a message fails due to a transient error, the message won't be able to be processed again until the visibility timeout expires. You might want to consider returning the message to the queue straight away. You can do this by calling change_visibility with 0 on that message. The message will then be available for processing straight away. (It might seem that if you do this then the visibility timeout will be permanently changed on that message - this is actually not the case. The AWS docs state that "the visibility timeout for the message the next time it is received reverts to the original timeout value". See the docs for more information.)
If you're after an example of a robust SQS message consumer, you might want to check out NServiceBus.AmazonSQS (of which I am the author). (C# - sorry, I couldn't find any python examples.)

MSMQ max count message notification

We are in process of implementing msmq for the quick storage of the messages and process them in disconnected mode. Typical usage of any message broker.
One of the administration requirement is to send the automatic notification to administrator/developers if the queue messages (unprocessed) count reaches 1000.
Can it be done out of the box? If yes then how?
If no then do I need to write some windows service (or any sort of scheduler) to check the count every x-seconds?
Any suggestions or past experience is welcome..
The only (partially) built-in solution would be to set up the MSMQ Queue performance counter which gives you this information for private queues on the server.
There are a number of other solutions, including a SCOM management pack, and some third party solutions like evtools, or you could roll you own using System.Messaging.
Hope this is of help.
There's commercial solution for this - QueueMonitor.
Disclaimer: I'm the author of that software.
Edit
Few tips for this scenario:
set message's UseDeadLetterQueue to true - this way if there's any issue with delivering messages at least they won't be lost but moved to system's dead letter queue.
set message's Recoverable property to true - it does reduce performance, but for this kind of long running scenario there's too much risk that some restart or failure would loose messages which are only stored in memory.
if messages are no longer valid after some period, you can use TimeToReachQueue to automatically delete them.

Icinga - check_yum - Socket Timeout?

I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?
First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.

Resources