Why would SQS ApproximateNumberOfMessagesVisible & ApproximateAgeOfOldestMessage go up even when Received/Deleted metrics match Sent metrics? - amazon-sqs

In the CloudWatch Metrics graph below, the purple line is ApproximateNumberOfMessagesVisible and red line is ApproximateAgeOfOldestMessage. They are trending up even when NumberOfMessagesReceived (orange)/NumberOfMessagesDeleted (green) match NumberOfMessagesSent (blue).
How is this possible?
In my code, I process the message in a new thread and therefore the message is almost immediately deleted from the queue. (This is not good practice in production but this is a load testing script so I don't expect or care about exceptions)
sqsClient.receiveMessage(queueUrl).getMessages().forEach(msg -> {
pool.execute(() -> handleSqsMessage(msg));
sqsClient.deleteMessage(queueUrl, msg.getReceiptHandle());
});

If the approximateAgeOfOldestMessage is increasing then it indicates that there is a poison pill. A poison pill is a malformed message which is unable to get processed by the consumer.
What is your redrive policy ? You will have to set the max-receive-count to a smaller value (say 3 for example). After the message is received 3 times by the consumer, if it was not able to process/delete it will be moved to dead letter queue. You can then analyze this poison pill.
If the number of visible messages are increasing consistently, it indicates that your consumer is unable to catch up and messages are piling up in the queue. This is not necessarily a bad sign but shouldn't be very large. Seems ok to me. You can increase the number of consumers to bring it down.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html#sqs-dead-letter-queues-when-to-use
https://aws.amazon.com/message-queue/features/

Related

Performance of selective `receive` in Elixir/Erlang

When I have code like this:
receive do
{:hello, msg} -> msg
end
And let's say I have N messages in my mailbox. Is the performance of finding this particular message O(1), O(N), or something in between?
Receive perform a linear scan of the message box and then returns the first one which matches. There is one exception (Since R14A)
OTP-8623 == compiler erts hipe stdlib ==
Receive statements that can only read out a newly created
reference are now specially optimized so that it will execute
in constant time regardless of the number of messages in the
receive queue for the process. That optimization will benefit
calls to gen_server:call(). (See gen:do_call/4 for an example
of a receive statement that will be optimized.)
So in your case it is O(N) operation.
Messaging in Erlang, hence Elixir, is "first in, first out". You browse them one by one and the first that's meet any clause in receive is handled. In worst case scenario you can choke up your messagebox.
The performance will grow linearly and in direct proportion to the number of elements in the mailbox, thus being O(N).

Apache Storm - use multiple spouts?

So I'm trying to configure my spout(s) to read from an Amazon SQS queue. Now, I want a situation wherein I can share the load across multiple spouts.
I understand it's possible to have multiple threads, but can I have two or more different spout instances/applications which are reading from the same queue and emitting to the same topology? For eg. Spout A and Spout B read from the SQS and then both emit to bolt C?
Of course, you can have multiple spouts, but you have to define them accordingly to prevent double submit of the same element (or your topology does accept that by design). Multiple processes of the same element imply bad counters for instance.
Check Storm concurrency as a start with executors (threads) and tasks (instances) per spout / bolt and choose the number you want.
In your code, you have to be sure that you don't manage the same tuples twice or more, either you do it before storm (a queue which doesn't accept the same element twice which is processed / emptied by many spouts for instance, or multiple queues - one for each spout, beware of transactions) or you do it in storm (process messages only with x param in one spout, with y in another and a message cannot be x and y at the same time).
SQS Queue -----> Spout (N Number of Executors).
This model will perfectly fine. as soon as, any of executor instance will pick up message, message will become invisible from SQS.
Keep Message Invisibility time Much higher than Message Processing time with in Storm Topology.
You can keep delete SQS message logic inside ack method.

Could you overflow the message queue of an Erlang process?

I'm still in the learning fase of Erlang, so I might be wrong, but this is how I understood a process' message queue.
A process could be in it's main receive loop, receiving certain types of messages, while later it could be put in a waiting loop to deal with a different kind of message in the second loop. If the process would receive messages intended for the first loop in the second loop, it would just put them in the queue, ignore them for the time being and only process those message that it can match against in the current loop it is in. Now if it would enter the first receive loop again, it would start from the beginning and again process the messages that it can match against.
Now my question would be, if this is how Erlang works and I understood this correctly, then what happens when a malicious process would send a bunch of messages that the process will never process. Will the queue eventually overflow, resulting in a crash for the process or how should I deal with this? I'll type out an example to illustrate what I mean.
Now if a malicious program would get a hold of the Pid and would go Pid ! {malicioudata, LotsOfData} repeatedly, would those messages be filtered out since they will never possibly be processed or would they just stack up in the queue?
startproc() -> firstloop(InitValues).
firstloop(Values) ->
receive
retrieveinformation ->
WaitingList=askforinformation(),
retrieveloop(WaitingList);
dostuff ->
NewValues=doingstuff(),
firstloop(NewValues);
sendmeyourdata ->
sendingdata(Values),
firstloop(Values)
end.
retrieveloop([],Values) -> firstloop(Values).
retrieveloop(WaitingList,Values) ->
receive
{hereismyinformation,Id,MyInfo} ->
NewValues=dosomethingwithinfo(Id,MyInfo),
retrieveloop(lists:remove(Id,1,WaitingList),NewValues);
end.
There is not a hard limit on message counts, and there is not a fixed amount of memory you are limited to, but you can certainly run out of memory if you have billions of messages (or a few super huge ones, maybe).
Long before you OOM because of a huge mailbox you will notice either selective receives taking a long time (not that "selective receive" is a good pattern to follow much of the time...) or innocently peek into a process mail queue and realized you've opened Pandora's Box in your terminal.
This is usually treated as a throttling and monitoring issue in the Erlang world. If you aren't able to keep up and your problem is parallelizable then you need more workers. If you are maxing out your hardware then you need more efficiency. If you are still maxing out your hardware, can't get any more, and you're still overwhelmed then you need to decide how to implement pushback or load shedding.
Unfortunately there is no "message queue overflow" and it's going to grow until VM crashes due to memory allocation error.
Solution is to drop any invalid messages in main loop, because you are not suppose to receive any of {hereismyinformation, _,_} nor one you get in askforinformation() due to blocking nature of your process.
startproc() -> firstloop(InitValues).
firstloop(Values) ->
receive
retrieveinformation ->
WaitingList=askforinformation(),
retrieveloop(WaitingList, Values); % i assume you meant that
dostuff ->
NewValues=doingstuff(),
firstloop(NewValues);
sendmeyourdata ->
sendingdata(Values),
firstloop(Values);
_ ->
firstloop(Values) % you can't get {hereismyinformation, _,_} here so we can drop any invalid message
end.
retrieveloop([],Values) -> firstloop(Values).
retrieveloop(WaitingList,Values) ->
receive
{hereismyinformation,Id,MyInfo} ->
NewValues=dosomethingwithinfo(Id,MyInfo),
retrieveloop(lists:remove(Id,1,WaitingList),NewValues);
end.
It's not really a problem with unexpected messages because it's easily avoidable but when process queue is growing faster than it's processed. For this specific problem there is a nice jobs framework for production systems.

Erlang/OTP How to notify parent process that child processes are idle and no messages in their mailbox

I would like to design a process hierarchy where there is a a parent process P which acts like a gatekeeper and delegates the work(messages/events from its client processes) to it's children processes C1,C2..Cn which collaborate with each other and may send the result back to P. The children processes cannot talk to any process outside, only P.
The challenge is that though P may have multiple messages from its clients, it should accept only one message, delegate the work to C1..Cn and ONLY accept the next message from its clients
when all its children processes are done(or idle) and there are no more messages circulating between C1 to Cn.
P finishes accepting messages from C1..Cn so that it can return the result to its clients
Constraints:
Idle for me is when they are waiting with a receive (blocking) or simply exited.
C1 to Cn are finite state machines. Some or all of them may send messages back to C. Or there may be no messages to be sent back to C. Even if no messages are sent back to C, C has to figure out that all of them are done with no messages between them.
If any of C1 to Cn have been pre-empted, then it is considered busy(this may be obvious but I thought I'll put it here for completion) and C will not receive the next message
Is there an OTP pattern or library which will do this for me (before I hack something?). I know that process_info can let me know if the mailbox of a process are empty and I could keep on checking the children's mailboxes from P but it would be unnecessary polling from P.
EDIT GENERAL: I am trying to implement a reactive variant of Flow Based Programming on the Erlang platform. This has the notion of 'hierarchical processes' or composites which themselves may contain composite processes until we reach some boxes of actual code...I am going to research(looking at monitor,process_info,process_flag) but I wanted to respond to your excellent answers
EDIT RECURSIVE PARENTS: Each of C1 and Cn can themselves be parent/composite processes. If I just spawn processes and let them exit immediately, I'll have to create the chain of Composites everytime as C1..Cn may themselves be composites (which spawn composites..and so on). Finally, when we reach a leaf box(which is not a composite node), they are supposed to be finite state machines.. so I'm not sure of spawning and making them exit quickly if the are FSMs.
EDIT TKOWAL: Since I am trying to create a generic parent/composite process, it does not know 'when' the task ends. All it does is relay the messages it receives from its children to it's siblings with the 'constraint' that it will not accept the next message from its client/siblings until its children are 'done'. The children C1..Cn may send not just one but many messages. I understand from your proposal, that wait_for_task_finish will stop blocking the moment it gets the first message. But more messages may be emitted too by P's children. P should wait for all messages. Also, having a task_end symbol will not work for the same reason(i.e. multiple messages possible from the children)
Given how inexpensive it is to start up Erlang processes, your gatekeeper could start new children for each incoming task, and then wait for them all to exit normally once they complete their work.
But in general, it sounds like you're looking for a process pool. There are a few of these already available, such as poolboy and sidejob. Pools can be harder to get right than you think, so I advise using an existing proven pool implementation before attempting to write your own.
After edits, this became entirely different question, so I am posting new answer.
If you are trying to write Flow Based Programming, then you are probably solving wrong problem. FBP is effective, because almost everything is asynchronous and you start processing next request immediately after you finished with previous one.
So, the answer is - don't wait for children to finish:
In FBP, there is no time dependencies between the components. So if I
have a chunk of data, it should be able to flow from one end of the
diagram to the other regardless of how any other pieces of data are
being handled. In order to program an FBP system, you have to minimize
your dependencies.
source
When creating parent and children, you know all the connections between blocks, so just configure children to send processed data directly to next block. For example: P1 has children C1 and C2. You send message to P1, it delegates it to C1, packet flows couple of times between C1 and C2 and after that, C1 or C2 sends it directly to P2.
Blocks should be stateless. They output should not depend on previous requests, so even if C1 and C2 are processing data from two different requests to P1 - it is OK. There could be situations, where P1 gets data packet D1 and then D2, but will output answers in different order R2 and then R1. It is also OK. You can use Erlang reference to tag messages and then check, which response is from which request.
I don't think, there is ready library for that, but it is really easy to hack, unless I missed something. Your P process should look like this:
ready_for_next_task() ->
receive
{task, Task, CallerPid} ->
send_task_to_workers(Task)
end,
wait_for_task_finish(CallerPid).
wait_for_task_finish(CallerPid) ->
receive
{task_end, Response} ->
CallerPid ! Response
end,
ready_for_next_task().
In wait_for_task_finish/1 you have only one clause for receive, so it will not accept next task, until current one is finished. If you are waiting for multiple responses from workers, you can simply add second clause to receive with some partial response and call wait_for_task_finish/1 recursively.
It is always better to have some indicator, that the processing ended, because you don't have guarantees on message delivery time. This means, that you could check, that all processes currently are waiting for message and think, that they ended processing, but actually, they did not started yet or one of them send message to other and you caught them before the second one had it in message box.
If the processes C1..Cn have only parts of actual work and don't know about the progress, than the gatekeeper P should know how many parts there were, receive all of them one by one and then call ready_for_next_task/1.

Missing master heartbeat does not cause node to react in a CANopen system

I have a strange finding about the heartbeat-protocol in CANopen. Maybe somebody else has seen something like this and maybe it is supposed to work like this... Anyway, here's what it's about:
In CANopen there are two timeout-based life-guarding mechanisms: the first is node guarding, which I will not mention further, since it's considered old news.
The other one is called heartbeat. It is pretty simple: Any participant on the network sends a regular message stating its node ID and its state. The frequency is defined by object 0x1017sub0 and is called heartbeat-producer-time. If it is set to zero, no heartbeat is being sent.
Any other participant can then define a number of nodes it wants to find on the network plus the maximum time there may be between two consecutive heartbeat-messages. This information is stored in object 0x1016sub1..n as 32-bit entries for as many nodes as this particular node wants to listen to.
The entries consist of the node ID (bits 22 to 16) and the mentioned maximum time that may elaps between heartbeats, called the heartbeat-consumer-time (in bits 15..0). Again if the entry is zero, it is being ignored.
As you may have gathered, there is no distinction between network-master (node ID 1) and slaves (node IDs 2 to 127).
So far the theory, now for my problem:
I configure one of the slave-nodes in my network as a heartbeat-consumer for the master, so there's an entry in object 0x1016sub1 that looks like this: 0x000107D0. Meaning that a heartbeat-message from the master is expected after at least two seconds.
I have observed that this works in two examples. If I send a master-heartbeat for a time and then stop, the node either returns to pre-operational mode or sends an appropriate emergency-message.
If I don't send any master-heartbeat-messages, I would expect that after I start the node (send it into operational mode) it takes at most two seconds for the node to either return to pre-operational mode or send an appropriate emergency-message or perhaps even both. But in the two examples I tried, nothing happened. If I never send any heartbeat, the node never expects one and just keeps on running.
The two examples are very different from each other. I am not sure whether they use the same CANopen-stack library perhaps.
Is there an explanation?
If you read CANopen User Manual, section 1.3.1.6, page 39, you will notice that the heartbeat consumer is first activated upon receiving a heartbeat from the producer. I would assume then that, since in your example the first heartbeat is never sent, the consumer is not activated.

Resources