Apache Storm Timeout Issues - timeout

We are using Storm 1.0.2 and have an issue with bad timeouts in two different completely unrelated projects. And have verified these issues in 0.10 as well...
We have 2 different scenarios:
First. No matter what we set the tuple timeout to, when that time elapses we get a number of tuples back as failed. But ALL of them are less than the time-out time old. For example if we set the timeout to 15 minutes (crazy high), the topology runs fine for 15 minutes procecessing thousands of tuples per minute, but exactly at 15 minutes we suddenly get a thousand or more fails. When we trace the message IDs back to the original emitted tuple, we find that all of them were emitted in the previous few minutes. There are no tuples that exceed the 15 minute timeout. It's almost as if the system is, at the timeout, just randomly flushing out tuples in flight for no reason.
Second. We have a topology that is getting fails back to the spout BEFORE the ack of the same message. So, the sequence of events: Emit tuple, sleep in the spout for x seconds. During that x seconds, the final bolt acks the tuple. The spout wakes from the sleep. The spout gets a fail for the tuple, and the very next call to the spout is the ack for the same tuple. During this testing some timeouts did elapse for the tuple during the spout sleeping, however the ack from the bolt was BEFORE the timeout. It's as if the messages to the spout as queued up as fails before acks, and it has no mechanism from pulling a queued-pending fail message because it just got an ack. This is not consistent it seems. Sometimes we don't get the fail message, but sometimes we do. We could not figure out a pattern.
In both these scenarios, the solution is to not have a time-out. We have been testing different things for a week now, and we've found that without a time-out, every single one of our messages are processed just fine, nothing is lost. During one experiment we even ignored all the fails, and everything was still processed. However we had thousands upon thousands of false-fails. The problem is, the data we are processing is vetted to be 100% perfect and the system is error free. In the real world, we want the fails to re-emit tuples for a few different retries, then to an error log . But if we can not count on the built-in mechanism of the time-out in Storm from handling time-outs properly then we are stuck having to build our own time-out mechanism in the spount ourselves.
Has anyone else experienced time-out issues like this? Is this a "known" issue? Are we maybe setting something not quite right in our Topologies?
Thanks
UPDATE...
Through trial and error, and a test topology that does nothing but have a number of bolt with a fixed sleep time, I beleive I may have figured out some sort of pattern as to what is happening.
The problem lies in the fact that the Storm system does not gaurantee message (tuple) processing order. Not only does it not gaurentee it, it seems to have some element of randomzing the messages. This randomization only occurs when we add paralelism to the bolts. This is exaserbated by the fact that our spout pending setting was way too high. The number of workers does not seem to affect this observation, it's only the paralelism of the bolts that affects things.
What I don't know (yet) is where the messages are getting delayed in the in the system. Is it from the spout to the first bolt, or one of the bolts in between (I have 3 layers) or, is the message really getting done, but not passed back to the spout immediatly.
What my test topology has not shown yet, is how messages are timing out BEFORE the time-out. The tests clearly show messages that ack back AFTER the timeout time though. At this point I have to assume that the messages made their way through the bolt, but just didn't get ack'd back to the spout in time? I need to set up a way to record each message's passage through the system.
So this brings us up to a rock and a hard-place. The reason we have soo many pending, is because we have a topology with 7 or 8 layers of bolts. When the pending number is low, of course we get no timeouts, but the bolts are not runing anywhere near capacity, and our throughput (messages/second) is not very good. We try to paralellize the bolts such that each runs equally in the capacity calculations. And once we reached a capacity equalization (no hot spots), then we started tuning other things, and one of those tuning measures is to increase tuple pending from the spout. The idea is the queue for each bolt has always got messages; we don't want a bolt instance to go idle because there is nothin in the queue for it to do.
The problem for us, is not that the messages are out of order, it's the fact that we cannot seem to have ANY sort of time-out setting. If we do, we get fails back to the spout that are not really fails.
Does anybody have any idea on why it is that we are experiencing these issues? And what we could do to... well... not experience them? Aside from no timeouts.

Related

Speed up the proces of requesting messages from SQS

We need to process a big number of messages stored in SQS (the messages originate from Amazon store and SQS is the only place we can save them to) and save the result to our database. The problem is, SQS can only return 10 messages at a time. Considering we can have up to 300000 messages in SQS, even if requesting and processing a 10 messages takes little time, the whole process takes forever with the main culprit being actually requesting and receiving the messages from SQS.
We're looking for a way to speed this up. The intended result would be dumping the results to our database. The process would probably run a few times per day (the number of messages would likely be less per run in that scenario).
Like Michael-sqlbot wrote, parallel requests were the solution. By rewriting our code to use async and making 10 requests at the same time, we managed to reduce the execution time to something much reasonable.
I guess it's because I rarely use multithreading directly in my job, that I haven't thought of using it to solve this problem.

How to do an operation, and if it doesn't complete in 6 seconds to stop it?

I am trying to receive information from a telnet connection in Lua using LuaSocket. I have all of that up and running except when I receive, if I receive anything less than the maximum number of bytes it takes 5 seconds. If I receive anything more than the number of bytes on the screen it takes upwards of half an hour.
My current idea for a solution is to try receiving for instance 750 bytes, then if that doesn't work within 6-7 seconds do 700 bytes, then 650 and so on until I can receive it very quickly. I need to parse the information and find two specific phrases, so if it's possible to do that inside of my telnet connection and just return that instead of the entire screen that would work as well. I also don't need ALL of it, but I need as much of the received information as possible to raise the chances that my information is in that block of it, hence why I'm only decrementing by 50 in my example.
I can't find any functions that allow you to start reading something (doing a function) and then quit it after a certain time interval. If anybody knows how to do this, or has any other solutions to my problem, let me know please! :) Thanks!
here is what I need repeated:
info = conn:receive(x)
with x decrementing each time it takes longer than 6 seconds to complete.
The solution you are proposing looks a bit strange as there are more straightforward ways to deal with asynchronous communications. First of all, you can use settimeout to limit the amount of time that send and receive calls will wait for the results (be careful as receive may return partial results in this case). Second option is to use select which allows you to check if a socket has something to read/write before issuing a blocking command.

Is it guaranteed that mnesia event listeners will get each state of a record, if it changes fast?

Let's say I have some record like {my_table, Id, Value}.
I constantly overwrite the value so that it holds consecutive integers like 1, 2, 3, 4, 5 etc.
In a distributed environment, is it guaranteed that my event listeners will receive all of the values? (I don't care about ordering)
I haven't verified this by reading that part of the source yet, but it appears that sending a message out is part of the update process, so messages should always come out, even on very fast changes. (The alternative would be for Mnesia to either queue messages or queue changes and run them in batches. I'm almost positive this is not what happens -- it would be too hard to predict the variability of advantageous moments to start batching jobs or queueing messages. Sending messages is generally much cheaper than making a change in the db.)
Since Erlang guarantees delivery of messages to a live destination this is as close to a promise that every Mnesia change will eventually be seen as you're likely to get. The order of messages couldn't be guaranteed on the receiving end (as it appears you expect), and of course a network failure could make a set of messages get missed (rendering the destination something other than live from the perspective of the sender).

Delayed Job forgets about jobs that have been sitting on the queue for several minutes and have no attempts

I'm using delayed_job to create large numbers of jobs, nearly all at one time, to be done at a later time, if the number of jobs gets too high, after a certain amount of time, every job is cleared from the queue regardless of it's state.
the following rails project illustrates this issue:
https://github.com/hayksaakian/taskbreaker
to recreate the issue, create several tasks (say 5 to 15), each with around 100 goals
(from the web interface, or console)
then in console attempt to do these tasks with:
Task.attempt_tasks
What will happen is the following:
Many jobs will be created, the workers do their thing for several minutes, then poof every job disappears from the queue.
To verify this is the case, check any task, you'll notice that each accomplishment may not have an arbitrary_number equal to '10' (it should be 10 since we're incrementing by one for each of the delayed attempt_string calls). The arbitrary_array of each accomplishment is also not of length 10 (which it should be, given that we delayed 10 calls to attempt_array);
I'm not sure why this is happening as I'm seeing no errors, but i'm sure that it's happening.
see an example of the bad work at taskbreaker.herokuapp.com
Note i'm hosting on heroku if that's of any help. Also you'll need at least 5 workers to recreate the issue in any reasonable amount of time.
This was due to a race condition. Since the methods were happening concurrently, their reads and writes were overlapping, resulting in unexpected output.

Anyone know average HL7 clinical message response times?

I'm designing a .net interface for sending and receiving a HL7 message and noticed on this forum theres a few people with this experience.
My question is.... Would anyone be able to share their experience on how long it could take to get a message response back from a hospital HL7 server. (Particularly when requesting patient demographics) - seconds / minutes / Hours?
My dilemma is do I design my application to make the user wait for the message to come back.
(Sorry if this is a little off topic, it’s still kinda programming related? – I searched the web for HL7 forums but got stuck so again if anyone knows of any please let me know )
cheers,
Jason
In my experience, you should receive an ACK or NAK back within a few seconds. The receiving application shouldn't do something like making you wait while it performs operations on the message. We have timeouts set to 30 seconds, and we almost never wait that long for a response.
This is quite dependent on the kind of HL7 message sent, typically messages like ADT's are sent as essentially updates to the server, and are acknowledged almost immediately if the hospital system is behaving well. This will result in a protocol level acknowledgement, which indicates that the peer has received the message but not necessarily processed it yet.
Typically, most systems will employ a broker or message queue in their integration engines so you get your ack almost immediately.
Other messages like lab request messages may actually send another non-ack message back which contains the information requested. These requests can take longer.
You can check with the peer you're communicating with to see what integration engine they are using, and if a queue sits on that end which would help ensure the response times are short.
In the HL7 integration tool I work on, we use queues for inbound data so we can responde immediately. And for our outbound connections, 10s timeouts are default, and seem to work fine for most of our customers.
When sending a Query type event in HL7, it could take a number of seconds to get the proper response back. You also need to code for the possibility that you will never get a response back, and the possibility that connected systems "don't do" queries.
Most HL7 nets that I have worked on, assume that all interested systems are listening for demographic updates at all times. Usually, receiving systems process these updates into a patient database that documents both the Person and Encounter (Stay) information on the fly.
In my location, my system usually gets about 10-20 thousand messages a day, most of which are patient demographic updates.
It depends if the response is generated automatically by a system or if the response is generated after an user does something on the system. For an automatic response it might take less than a second, depending of course on the processing that is done by the system and the current work load of that system. If the system is not too busy and processing is just a couple of queries and verification of some conditions, considering network delays, response time should be a few seconds or less.

Resources