I have an erlang cluster of 4 nodes running on different hosts. I recently encountered an issue with the cluster in a way that beam process is running in one of the nodes but the application was not running. I could see that my OTP application was still holding client connections. The client connections are tcp connection.
Every node was able to ping the others but this node appeared to have stopped to other nodes.
Very weird but pretty much the same thing happened. Do you have any clue what might be wrong with them.
Can tcp retransmissions cause this?
Related
I'm sitting with a new issue that you might also face soon. I need a little help if possible. I've spent about almost 2 working weeks on this.
I have 2 possible solutions for my problem.
CONTEXT
I have 2 kubernetes clusters called FS and TC.
The Jenkins I am using runs on TC.
The slaves do deploy in FS from the TC Jenkins, however the slaves in FS would not connect to the Jenkins master in TC.
The slaves make use of a TCP connection that requires a HOST and PORT. However, the exposed jnlp service on TC is HTTP (http:/jenkins-jnlp.tc.com/) which uses nginx to auto generate the URL.
Even if I use
HOST: jenkins-jnlp.tc.com
PORT: 80
It will still complain that it's getting serial data instead of binary data.
The complaint
For TC I made use of the local jnlp service HOST (jenkins-jnlp.svc.cluster.local) with PORT (50000). This works well for our current TC environment.
SOLUTIONS
Solution #1
A possible solution would involve having a HTTP to TCP relay container running between the slave and master on FS. It will then be linked up to the HTTP url in TC (http:/jenkins-jnlp.tc.com/), encapsulating the HTTP connection to TCP (localhost:50000) and vice versa.
The slaves on FS can then connect to the TC master using that TCP port being exposed from that container in the middle.
Diagram to understand better
Solution #2
People kept complaining and eventually someone made a new functionality to Jenkins around 20 Feb 2020. They introduced Websockets that can run over HTTP and convert it to TCP on the slave.
I did set it up, but it seems too new and is not working for me even though the slave on FS says it's connected, it's still not properly communicating with the Jenkins master on TC. It still sees the agent/slave pod as offline.
Here are the links I used
Original post
Update note on Jenkins
Details on Jenkins WebSocket
Jenkins inbound-agent github
DockerHub jenkins-inbound-agent
CONCLUSION
After a lot of fiddling, research and banging my head on the wall, I think the only solution is solution #1. Problem with solution #1, a simple tool or service to encapsulate HTTP to TCP and back does not exist (that I know of, I searched for days). This means, I'll have to make one myself.
Solution #2 is still too new, zero to none docs to help me out or make setting it up easy and seems to come with some bugs. It seems the only way to fix these bugs would be to modify both Jenkins and the jnlp agent's code, which I have no idea where to even start.
UPDATE #1
I'm halfway done with the code for the intermediate container. I can now get a downstream from HTTP to TCP, I just have to set up an upstream TCP to HTTP.
Also considering the amount of multi-treading required to run a single central docker container to convert the protocols. I figured on adding the the HTTP-to-TCP container as a sidecar to the Jenkins agent when I'm done.
This way every time a slave spins up in a different cluster, it will automatically be able to connect and I don't have to worry about multiple connections. That is the theory, but obviously I want results and so do you guys.
We have a bare metal Docker Swarm cluster, with a lot of containers.
And recently we have a full stop on the physical server.
The main problem, happened on Docker startup where all container tried to start on the same time.
I would like to know if there is a way to limit the amount of starting container?
Or if there is another way to avoid overloading the physical server.
At present, I'm not aware of an ability to limit how fast swarm mode will start containers. There is a todo entry to add an exponential backoff in the code and various open issues in swarmkit, e.g. 1201 that may eventually help with this scenario. Ideally, you would have an HA cluster with nodes spread in different AZ's, and when one node fails, the workload would migrate to another node and you do not end up with one overloaded node.
What you can use are resource constraints. You can configure each service with a minimum CPU and memory reservation. This would prevent swarm mode from scheduling more containers on a node than it could handle during a significant outage. The downside is that some services may go unscheduled during an outage and you cannot prioritize which are more important to schedule.
I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)
I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.
I am having this issue for a while I am not sure how to fix it. I have a Docker container running PHP+Apache with an application. The MySQL and MongoDB servers are on the same network as the host. So:
MySQL DB Server IP: 192.168.1.98
Mongo DB Server IP: 192.168.1.98
Host: 192.168.1.90
For some reason the connectivity between the application running on the container and the DB server is pretty slow and sometimes it takes more than one minute running long queries.
I can say the problem is not the DB server because running the same application on the same server works fast so I think is something related to networking but I am not sure what or why.
Can any give me some advice around this?
You have not given much information, but based on what you have described
The simplest reason could be that the amount of data that is being transferred across the network is high. Even though the hosts are on the same network, the time taken to transfer a large file across a pair of machines on the network would be considerably slower than copying it from the same host.
Since it seems like you are running both MongoDB and MySQL DB on the same host, they could be easily interfering with the execution of each other. While the container provides isolation among them at the operating system level, the hardware does not identify containers. When both containers try to use the disk, the performance can degrade.
I have personally run into both these issues at different times and while they seem simple they can have significant impact on the performance. It would be nice if you could provide some additional information to help better understand your problem.
I'm evaluating a strategy for implementing docker for a small company with 2 servers. We wanted to have them both working as a cluster, to load balance the work, but to work as a fail-safe for one another in case of failure.
From what I understand, etcd requires a minimum of 3 up hosts or you lose the ability to put/get keys. That would not be possible with 2 machines, and with 3 machines none could fail. Is this assessment correct?
The only solution would be to have a single etcd but that would mean that if the machine that failed was the "etcd"-one then both would stop working correctly...
Just to clarify, I wanted the benefits of something like fleetd's scheduling and clustering abilities but with a small sized deploy. Moving containers/systemd-units and data manually between hosts is my backup plan, but less than ideal.
You can run coreos with only 2 hosts, however you will lose your etcd cluster once you don't have a quorum, with only 2 machines this is possible if both are rebooted. With 3 hosts, you have a much higher likelihood of having a quorum if all machines are rebooted.
If you are willing to have one always be considered master, you can do this, you just have to be sure that you understand how to make an etcd peer consider itself master if quorum is lost.
If you have static IP's, then you have more control over your cluster and should be fine with setting the cluster IP's then even if both servers are restarted, they should be able to discover each other and reach a stable state.
Take a looks at the docs