rdma unexpected CM event 3 - RDMA_CM_EVENT_ROUTE_ERROR - network-programming

I am hosting an Infiniband server in a linux machine and also I have created a client and connected to that service in the same machine.This works fine most of the time.
But in one instance when I was trying to connect to that server from the same client (there is no prior connectivity with that server), It is throwing RDMA_CM_EVENT_ROUTE_ERROR and the connection couldn't be established.
I don't know the root cause of this error and it is not 100% recreatable.
This made my application unreliable. I want to know the root cause of it..

Related

Connection reset by peer on Azure

I have a web application running on an App Service on Azure cloud.
On the back-end I'm using a tcp connection to our database (Neo4j graph db), the best practice is to open the tcp connection and keep it alive in order to be more reactive when we perform queries.
The issue I encountered is that the database is logging the exception "Connection reset by peer";
reading on the web I found out that maybe Azure has a TCP timeout configured by default, I read it to be set up to 4 minutes, which could be my issue root cause.
Someone knows how to configure the tcp KEEP ALIVE to always for App Services on Azure?
I found on the web how to do it in Google cloud but nothing about Azure cloud.
Thank you in advance.
OaicStef
From everything I can find that is not an adjustable setting. Here is the forum link that says it will not be changing and that is a couple years old at this point. https://social.msdn.microsoft.com/Forums/en-US/32b76114-67a4-4e6b-ac45-61b0f0a0829f/changing-the-4-minute-request-time-out-for-app-services?forum=windowsazurewebsitespreview
I think you are going to have to add logic to your app that tests the connection, if it has been closed then either reopen it or create a new one. I don't know what language you are using to make any suggestions there.
Edit
I will add that the total number of TCP connections that can be open on a single App Service is about 6k, at least on the S1. Keep that in mind because if you don't have pooling on the server side or you are not disposing of those then you will exhaust that the TCP pool and you will start getting errors. I recommend you configure an alert for that.

How to ensure the receiving application is up and running in client-server communication?

I am presently working on a client-server solution to transfer files to another machine via a socket network connection. Since I intend to do some evaluation on the receiving end as well I am assuming that I will need to have some kind of client or server programme running there, too.
I am fairly new to the whole client-server thing and therefore have the following elementary question:
My present understanding is that client and server will be two independent programmes running on two different machines. How would one typically ensure that the communication partner (i.e., the server when sending from a client and the client when sending from a server) is actually up and running on the remote machine that I want to transfer a file to?
So far, I have been looking into the following options:
In the sending programme include an ssh access to the remote
machine and start an instance of the receiving programme on the
remote machine.
Have the receiving programme run as a demon process on the remote
machine. This would mean that the receiving programme should always
be running on the remote machine. However, how would I know whether
the process has crashed or has been shut down for some reason and
how would one recover from that without option 1) above?
So, my main question is: Are there any additional options that might be worth considering?
Thanks for your view on this!
Depending on how your client server messages are setup, a ping (I don't mean the ICMP ping, but the basic idea) message, where the server can respond with "I am alive" would help. This way at least you know the server end is running.
It is not uncommon in production environments using these that monitoring systems are put in place. Other options worth considering - xinet.d scripts - stuff that gets started on incoming connections.
There probably new ways to achieve the automatic start/restart or start on connection of this with systemd/systemctl but I am not familiar enough with them to give you the specifics.
A somewhat crude, but effective means may be a cron job that periodically runs a script to enforce keeping the service up.

TCP/IP long-term connections

I have a server application which runs on a Linux machine. I can connect this application from Windows/Linux machines and can send/recieve data. After a few hours, something occurs and I get following error on the client side.
On Windows: An existing connection was forcibly closed by the remote host
On Linux: Connection timed out
I have made a search on the web and found some posts which suggest to increase/decrease OS's keep alive time. However, it didin't work for me.
Can I found a soultion to this problem or should I simply try to reconnect to the server when the connection is forcibly closed?
EDIT: I have tracked the situation. I sent a data to the remote node and sent another data after waiting 5 hours. Sending side sent the first data, but whet the sender sent the second data it didn't response. TCP/IP stack of the sender repeated this 5 times by incrementing the times between retries. Finally, sender reset the connection. I can't be sure why this is happening (Maybe because of a firewall or NAT - see Section 2.4) but I applied two different approach to solve this problem:
Use TCP/IP keep alive using setsockopt (Section 4.2)
Make an application level keep alive. This is more reliable since the first approach is OS related.
It depends on what your application is supposed to do. A little more information and perhaps the code you use for listening and handling connections could be of help.
Regardless, technically a longer keep alive time, should prevent the OS from cutting you off. So perhaps it is something else causing the trouble.
Such a thing could be router malfunction or traffic causing your keep-alive packet to get lost.
If you aren't already testing it on a LAN (without heavy trafic) I suggest doing so.
It might also be due to how your socket is handled (which I can't determine from your question)
This article might help.
Non blocking socket with timeout
I'm not used to how connections are handled on Linux, but I expect the OS won't cut off a connection unnecessary.
You can re-establish connection as a recovery, but you need to take into account that not all disconnects are gentle, and therefore you could end up making recovery on a connection you actually wish to be closed.
Since it is TCP, it will do its best to make a gentle disconnect, but you can send a custom message telling the server or client not to re-establish the connection right before disconnecting. That way you be absolutely sure, despite that it should be unnecessary to do so.

TCP/IP connectivity via DataSnap

I wrote a multi-tier application suite in Delphi XE, using DataSnap (VCL application).
This will be used internally, in my company, mostly to replace the outdated fax communication.
Everything works fine, but I came across an unpleasant situation: The server machine is behind a router, so it has an internal network IP. I forwarded (in the router) all incoming connections on port 211(DataSnap default) to the server's internal IP and about 8 times out of ten all the clients connect to the server without any problems.
The problem is that for the rest 2 times I get all sort of connection errors (mostly connection timed out). When it does this I have to close and reopen either the server application either (some of) the clients, and then it works.
Right now I'm still in the design phase, so it's only a bother, but when I do release it I don't want either to tell everyone NOT to EVER close the application (once it works, it works, no more problems), either close and reopen the applications each time there is a connection problem.
How can I eliminate this problem?
I had (only) a look at NetCat and SoCat, but (to me) it seems overkill for this situation. Is there another way to solve this?
The solution was switching off router's internal firewall.

Datasnap Service application fails

I have created a Datasnap service, using Bob Swart's white paper as a guide. I have been debugging and deployed succesfully using the VCL Forms application as a server. But when I try to deploy the service version, it installs ok, I then try to start the service and it immediately stops. The error in the event log would suggest that the port set is already in use, I have tried different port numbers for both the TCPServerTransport and the HTTPService without any joy. The DSServer is not set to Autostart as I want to set the Port number from a configuration file. The error message displayed in the event log is:
Service failed on start: Could not bind socket. Address and port are already in use..
I have also tried writing to a log file on start up and execute but it looks as if it is not getting this far.
Solution needed asap, before I have to revert back to a thick client which I do not really want to do.
Thanks
Firstly get a copy of TCPView from the Sysinternals suite (now run by Microsoft) and use it to monitor which app is using the port you want to use.
I would hazard a guess that if the app works fine as a stand alone (as you say it does) and you are trying to use the same port in the service then perhaps the service app is opening up the port at startup without you realizing it and then when you try to open the port manually the app finds it already in use. Or somehow the app is trying to open the port twice. The first time is successful but, maybe due to an event or an unexpected code path, the app tries to open it a second time and fails. TCPView will help spot this.
If you are sure that the port you have configured is actually free and not in use by any other software on the machine, then there might be some anti-virus / security software running that is blocking all software from listening on either specific ports or on any port except a few configured ones. The message you are getting could be one of the symptoms of how the anti-virus / security software handles attempts by apps to start listening on a port.

Resources