NodeMCU resest with mqtt.client:close()), or mqtt.client:connect() - lua

I have an application that uses mqtt for communication between modules and with a mobile terminal.
In some situations when the messages do not arrive, the node performs a self test of MQTT (sending a msg to itself), and when the selftest fails, tries to reconnect to the broker (mqtt offline not always arrives). And then two problems may arise:
If I perform a mqtt.client:close() to assure that the client is closed (to avoid the second problem) and the client is already closed, the node resets.
If I perform a mqtt.client:connect() and the client is still connected, an exception and a restet occurs.
is there a way to know if the mqtt client is connected or not?
Thanks for your comment. I am going to describe what I am doing, to see if you can help me:
I have two independent system, a master and a slave. The master publish a test message every 10 minutes. If there is no answer from the slave. it publish a test message to itself. If this self test does not arrive, a disconnection from the broker is assumed, and a reconnection is initiated.
And here is where the problem comes, sometimes the client is disconnected and everything go well, but sometimes it is still connected but unresponsive, and the node resets with an exception "already connected".
Performing a mqtt:close() previously to the reconnection, should be safe, but if I send it and the client is truly disconnected, the node resets without any reason (known to me).
All this is happening without receiving any offline message.

Instead of waiting for messages manually sent by the master client (which could fail to send for various reasons, leading a listening device to the wrong conclusion about the state of its connection to the broker), I recommend using MQTT's built-in connection management.
First, you can make sure that each client's initial connection has succeeded by including an error handler in :connect(). If the client really opens, nothing in the NodeMCU documentation indicates that it will close itself; it may go offline.
Once connected, the client only knows that something is wrong when it sends a message and does not receive a response. It sounds like you are not calling :publish() much (which would otherwise let you know by returning false), so pinging may be best. If you expect to receive a message from the broker every n seconds, set a keepalive time slightly higher than that on the client.
Then, failure to get a response to those messages should trigger an event that you can respond to. That might be something like the following (not tested, may work better called outside the callback):
m:on("offline", function(client) m:close() end)

Related

What happens when a message is published with QoS=1?

I would like to have a better understanding of the behavior of this library.
Specifically: let's say I have an open connection (over WSS, if this change anything) with an MQTT server.
I publish a message with QoS=1.
My understanding is that mqtt awaits for a PUBACK message. After the ack has been received, the done callback is called and the flow is ended.
What is not clear to me is the low-level stuff: how much "time" do the library awaits for the ack? what happen if the ack doesn't come? the message is resent? the connection is closed/reopened? something else?
Is this behavior tunable?
Prior to answering your specific question I feel its worth outlining what the protocol requires (I'll highlight the key term). The MQTT 3.1.11 spec says:
When a Client reconnects with CleanSession set to 0, both the Client and Server MUST re-send any unacknowledged PUBLISH Packets (where QoS > 0) and PUBREL Packets using their original Packet Identifiers [MQTT-4.4.0-1]. This is the only circumstance where a Client or Server is REQUIRED to redeliver messages.
The v5 spec tightens this:
When a Client reconnects with Clean Start set to 0 and a session is present, both the Client and Server MUST resend any unacknowledged PUBLISH packets (where QoS > 0) and PUBREL packets using their original Packet Identifiers. This is the only circumstance where a Client or Server is REQUIRED to resend messages. Clients and Servers MUST NOT resend messages at any other time [MQTT-4.4.0-1].
So the only time the spec requires that the publish be resent is when the client reconnects. v3.1.1 does not prohibit resending at other times but I would not recommend doing this (see this answer for more info).
Looking specifically at mqtt.js I have scanned through the code and the only resend mechanism I can see is when the connection is established (backed up by this issue). So to answer your specific questions:
how much "time" do the library awaits for the ack?
There is no limit; the callback is stored and called when the flow completes (for example).
what happen if the ack doesn't come? the message is resent? the connection is closed/reopened? something else?
Nothing. However in reality the use of TCP/IP means that if a message is not delivered then the connection should drop (and if the broker receives the message, but is unable to process it, then it should really drop the connection).
Is this behavior tunable?
I guess you could implement a timed resend but this is unlikely to be a good idea (and doing so would breach the v5 spec). A better approach might be to drop the connection if a message is not acknowledged within a set time frame. However there really should be no need to do this.

Twilio WebRTC TURN relay randomly stops working after a few minutes

I am using the Twilio Network Traversal Service as part of a native application I am working on to perform peer-to-peer remote desktop connections. We implement a subset of the WebRTC protocol stack that is equivalent to the WebRTC data channels (not the WebRTC video and audio protocols). When using a TURN relay, the TURN allocation seems to be invalidated randomly somewhere between a few minutes and a maximum of 12 minutes from the session start. This issue looks very similar to this one, but the proposed workaround (sending silent audio) is not acceptable in my case, since I do not implement the WebRTC audio/video protocols.
I have been pulling my hair on this problem for the last two weeks, and isolated the issue as being the Twilio service itself. To compare, I have used a web based WebRTC data channel demo using firefox and the Xirsys TURN server cloud. I have wireshark captures showing firefox getting disconnected with Twilio just like my native application, while the exact same firefox demo doesn't get disconnected when using the Xirsys servers.
I was using Xirsys originally, but I experienced some instability with their service that made me switch to Twilio, which is why I would rather have Twilio fix this issue instead of going back with Xirsys. At the bare minimum, I would rather have two WebRTC hosting providers I can choose from that I know should work fine. This is why I am taking the time to explain the issue in detail so it can get fixed.
Here are two wireshark captures (with the peer-to-peer data messages filtered out) showing firefox using WebRTC data channels and the Twilio TURN relay servers:
The traffic stops being relayed after 4 minutes in the first capture, and after about 11 minutes in the second capture. In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero. Both graceful disconnections result in a 437 Allocation Mismatch error, indicating that the server doesn't even know about the allocation firefox is trying to close gracefully.
With my native application, this would often take the form of a CreatePermission Request message that fails with a 438 "Wrong nonce" error, which is basically what should happen if a client tries to update the permission on an allocation that no longer exists. The error code 438 usually means "Stale nonce", which is not really an error, but an indication that the nonce has expired and the client should try again using the new nonce contained in the "error" message. It took me a while to figure out, but even if the error code is 438, the error string is not the same. I have observed a true stale nonce error with Xirsys and successfully updated my permission with the new nonce from the error response, so I know I can properly handle this case in my implementation.
Here is the source code for the WebRTC data channel demo I have used:
https://github.com/devolutions/webrtc-demo
For comparison, here is the same firefox data channel demo using the Xirsys TURN server cloud:
In this capture, I have let the demo run for about 16 minutes (it works for much longer than that, the longest I have tried is two hours). We can see that the traffic keeps getting relayed for the entire duration of the session, and CreatePermission requests keep getting sent by firefox with success. At the end, the graceful disconnection is triggered by firefox closing the WebRTC data channel (instead of being closed due to traffic no longer being relayed). As opposed to the Twilio captures, the Refresh request with a lifetime of zero is successful: the Xirsys TURN server still knows about the allocation and sends back a success response, as expected.
It should be noted that the ICMP unreachable errors are normal because I think in this case firefox is no longer listening on the given port when the response comes back. In other words, it sends the Refresh request with a lifetime of zero and doesn't wait for the answer to come back.
For the time being, I have no other choice but to go back with Xirsys, but I would really like if the Twilio Network Traversal Service could be fixed. Let me know if you have more questions regarding the issue.
I have uploaded the wireshark captures here for reference.
EDIT: I have modified the webrtc demo page such that it doesn't close the connection when the ice connection state is set to 'disconnected'. Now I get the real disconnection when the ice connection state goes to 'failed'. However, it effectively didn't change anything, since in this case it takes just a few seconds more for the state to go from 'disconnected' to 'failed'.
Since I have new relevant screenshots and captures, I am updating the original question to clarify certain problems pointed out by Philipp Hancke:
First, here is a new capture with the ice connection state fix (the browser closes the connection only when the state goes to 'failed'):
It's interesting to see that this time, the session stayed up for a whole 18 minutes. This was taken on a saturday morning, so I'm guessing that the issue could be related to the current workload on the twilio servers. However, it failed in the exact same way as it always does so far for me. As a bonus, we even have a valid stale nonce response that is correctly handled by firefox.
However, if we take a different view of the same capture, we can see that the traffic stops being relayed for a solid 30 seconds before firefox considers the connection as being dropped and sends the Refresh request with a lifetime of zero. As in previous captures, the server responds with an Allocation Mismatch error, indicating it doesn't know which allocation firefox is talking about.
The last eight packets being sent are of the same size, so my guess is that they are retransmissions. After 30 seconds of retransmissions, it is likely that SCTP considers the transport as being dropped.
With regards to the refresh request with a lifetime of zero, I did a test where I close the connection early on, from the browser. In this case, the server recognizes the allocation and returns a success response:
The allocation mismatch is the easiest symptom to observe, but in my testing with my native application, I have seen similar errors with Refresh requests for non-zero lifetimes, and with CreatePermission requests (438 "Wrong nonce" error). However, since the browser closes the connection after 30 seconds of data not being relayed, it is hard to observe these errors with the current webrtc demo. If we could change that timeout to 10 minutes, we would see those errors as well.
Excellent problem description!
Without the server log this is hard to determine what goes wrong. I tried with the appear.in TURN servers which run an up-to-date version of coturn and show the same behaviour as the Twilio servers. Xirsys seems to be running a custom version of coturn (Coturn-0.5 'Xirsys Turn Services' from the software field but coturn never had such a version).
In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero.
Not quite. A refresh request with a lifetime of 0 is used to discard an allocation. At that point it does not matter what the server returns as the connection is beyond repair anyway.
This is caused by peerjs closing the peerconnection if the iceconnectionstate changes to disconnected, here in your bundled library version.
This is overly aggressive (and does not even fix things) and we've had a discussion about what the specification should do wrt to trying to fix things with an ice restart here which also links to a great explanation of the disconnected state.
The disconnected state probably happens because a few packets get lost. But this is something that can happen when there is minor congestion. I'd recommend removing the pc.close() in the disconnected case.
If you are looking for other TURN providers, Tokbox provides the same service. For datachannels the latency of a properly run distributed TURN network does not matter as much as for VoIP so you might run your own servers in a single location instead.

Losing messages over lost connection xmpp

i went through this question
Lost messages over XMPP on device disconnected
but there is no answer.
When a connection is lost due to some network issue then the server is not able to recognize it and keeps on sending messages to disconnected receiver which are permanently lost.
I have a workaround in which i ping the client from server and when the client gets disconnected server is able to recognize it after 10 sec and save further messages in queue preventing them from being lost.
my question is can 100% fail save message delivery be achieved by using some other way i know psi and many other xmpp client are doing it.
on ios side i am using xmppframework
One way is to employ the Advanced Message Processing (AMP) on your server; another one is to employ the Message Delivery Receipts on your clients.
The former one requires an AMP-enabled server implementation and the initiating client has to be able to tell the server what kind of delivery status reports it wants (it wants an error to be returned if the delivery is not possible). Note that this is not bullet-proof anyway as there is a window between the moment the target client losts its connectivity with the server and the moment the TCP stack on the server's machine detects this and tells the server about it: during this window, everything sent to the client is considered by the server to be sent okay because there's no concept of message boundaries in the TCP layer and hence if the server process managed to stuff a message stanza's XML into the system buffers of its TCP connection, it considers that stanza to be sent—there's no way for it to know which bits of its stream did not get to the receiver once the TCP stack says the connection is lost.
The latter one is bullet-proof as the clients rely on explicit notifications about message reception. This does increase chattiness though. In return, no server support for this feature is required—it's implemented solely in the clients.
go with XEP-0198 and enjoy...
http://xmpp.org/extensions/xep-0198.html
For a XMPP client I'm working on, the following mechanism is used:
Add Reachability to the project, to detect quickly when the phone is having connectivity problems.
Use a modified version of XEP-0198, adding a confirmation sent by the server. So, the client sends a message, the server confirms with a receipt. Later on, the receiving user will also confirm with a receipt. For each message you send, you get two confirmations, one from the server, one from the client. This requires modifications on the server of course.
When the app is not connected to the XMPP server, messages are queued.
When the app is logged in again to the XMPP server, the app takes all messages which were not confirmed by the server and sends them again.
For this to work, you have to locally store the messages in the app with three possible states: "Not sent", "Confirmed by server", "Confirmed by user"

Suggtestions for sending e-mail notifications from a 2 tier application with client potentially not connected to the internet

I have to add e-mail notifications to a client server application.
Notifications happen as the user do some particular action on the client UI.
If I had a middle tier or a service running at server I can imagine how to do it:
1) I simply create a DB tables with "pending notifications"
2) as a user does an action that generates a notification I add a record to the table
3) serverside I would continuously try to send those mails and removing them from the table once sending is succesful
Now I cannot do this now, I have a plan to add a service later on, but for now I must go the quick and dirty way.
So somehow what I was thinking to is to implement something like this:
1) as a notify-worth event occurs at client, the same client (my exe) tries to send the notification, upon failure it will log the notification in the "pending notifications" table (failure can be becuase lack of internet connection or any other problem)
2) I add a Timer that will work from any client machine to check for pending notifications. If there are any the client will try to send the e-mail (using a transaction: I will mark a field as "TryngToSendFromClientX" and in case of failure I will reset that field to NULL)
I think this approach would work, it has obvious limitations (if after failure no one logs into the system, no notification will be sent - same would be if service goes "down"). But can you comment on this approach and suggest a better one?
Additional notes (to better understand the scenario):
a) Note: all notifications are sent from the same e-mail account.
b) I don't need to keep track of who sent the e-mail.
c) the problem of creating the service now is that it will basically complicate significantly deployment and I need to create tools for monitoring the status of the service. Something that I will do in future but not now, in future I have plan to add more functionality (not only sending notifications) to the service, so in that case it makes more sense to create it.
d) I will send e-mails by using Indy components and SMTP server.
If you are not willing to create the service now, I think you are stuck with the scenario you describe. There are some things though you could do to circumvent the problem of no user firing up the client anymore while there are still pending messages.
You could add a commandline utility (or commandline parameter as bepe4711 suggested) that will only check for pending messages and try to send them.
Add this commandline utility to the StartUp folder or Run key in the registry. This way messages will at least get sent when the computer restarts, even if the user does not fire up the your app.
Add a scheduled task to run this utility at least once every day. The scheduled task can be added by code or by your installer.
If you do both, you will only have to worry about pending messages of users that never start their computer again.
Perhaps you can add a parameter to your client which causes it to just look at the pending notifications and send them. After this it can terminate itself. It will just act like some kind of service.
Then you install the client on the server and start it every x minutes.
I do something very similar to the approach you describe. Instead of sending emails I need to call a web service. My application is installed on several laptops and they are commonly not connected to any network.
When my application raises an exception I collect various bits of information including user comments and screen shots. Then I attempt to send this to our web service. If by chance the web service is not available. (i.e. not connected to the internet or web service is down) I write the results to an XML file on disk in the User Profile (App_Data) directory.
The one major difference is I don't poll to check to see if the server is up. I attempt to send them again on the startup of the application.
If both Systems are running on Windows, have a look at MS Message Queue. It is designed to send notifications to systems, which are not allways online. I did it in .Net, there are already easy to use classes implemented. Not sure about Delphi.
Latest version of Windows uses much more the Windows Task Scheduler, and now task can be fired on event (i.e. when a network card gets connected...). You could write a separate utility that tries to send pending notification, even if noone is logged in.

A message queue model in Erlang(Comet chat)?

I am doing Comet chat with Erlang. I only use one connection (long-polling) for the message transportation. But, as you know, the long-polling connection can not be stay connected all the time. Every time a new message comes or reaches the timeout, it will break and then connect to the server again. If a message is sent before the connection re-connected, it is a problem to keep the integrity of chat.
And also, if a user opens more than one window with Comet-chat, all the chat messages have to keep sync, which means a user can have lots of long-polling connections. So it is hard to keep every message delivered on time.
Should I build a message queue for every connection? Or what else better way to solve this?
For me seems simplest way to have one process/message queue per user connected to chat (even have more than one chat window). Than keep track of timestamp of last message in chat window application and when reconnect ask for messages after this timestamp. Message queue process should keeps messages only for reasonable time span. In this scenario reconnecting is all up to client. In another scenario you can send some sort of hart beats from server but it seems less reliable for me. It is not solving issue with other reason of disconnection than timeout. There are many variant of server side queuing as one queue per client, per user, per chat room, per ...

Resources