PBS jobs not starting - startup

I am new to PBS. Our system admin left (to an other workplace). Our claster was sut down for a weekend because of some reason.
Now i'am trying to restart it. There is 1 server machine, and the other machines load the operating system via drbl.
There are some files we need to update manually after all restart - i did it.
Then i started pbs_mom and trqauthd on all machines.
Then i started pbs_server and maui on the server machine.
When a job is started with qsub, it does not run: it remain quined. It is possible to start it with qrun. pbsnodes -a show the nodes are free, expect one, which is down. (unkown reason). showq qstat qdel comands works on the cluster.
Thanks for any help: Gergo

Related

Docker install of AZCore results in authserver+worldserver doesn't exist error

I'm trying to spin up a fresh server using the azerothcore docker installation guide. I have completed all of the early installation steps, up until running the containers. Upon running the containers (for worldserver and authserver) i see the following output from the containers. It appears the destination of the world and auth servers in dist/bin is missing, how may i resolve this issue?
Check your docker settings. Make sure you have enough memory. If containers have low memory they will not finish the compile. Check if you have build issues.

Running docker-Desktop on Windows 10 cannot restart containers after system restart

I am running Docker-Dektop Version 2.1.0.0 (36874) on a Windows 10 environment.
I am using two separate container compositions, one of these binding to port 8081 on my machine, and the other binding to 9990 and 8787.
After a system restart, I am unable to start these container compositions again, because the ports are already bound.
So far, I have tried multiple approaches to solve this:
manually stop all containers prior to system shutdown
manually stop and remove all containers prior to system shutdown
the above, plus explicitly stopping the docker application prior to system shutdown
removing all containers after system startup and prior to restart
pruning the networks after container removal
restart docker app prior to restarting containers (this worked up until the last update)
I did fiddle around with the compose files and the configuration, but taht would be too much detail to go into right now; all of these did not help.
What I recently found was, directly after a system startup and prior to starting any container, that the process com.docker.backend was already listening to the bound ports. This is confusing as the containers were shut down prior to system shutdown and are not run with a restart-command.
So I explicitly quit the docker desktop app, and the process still remaind, and it still bound the ports.
After manually killing the process as administrator from the power shell, and restarting the docker desktop application, my containers were able to start again.
Has anyone else had this problem? Does anyone know a "fix" for this at all?
And, of course, is this even the right page to ask? As this is not strictly programming, I am unsure.
Docker setup gets screwed up sometimes, so try deleting %appdata%\Docker.
The problem went away after the update to version 2.1.0.1 (37199)

How to interact with already running instance via terminal in Mongooseim?

I am using Mongooseim 3.2.0 from the source code on the ubuntu server. Below are concern:
What is the best way to run mongooseim as a service so that it automatically restarts if mongooseim crashes or system restarts?
How to interact via terminal with already running mongooseim instance on the ubuntu server like "mongooseimctl live". My guess is running "mongooseimctl live" will try to create another instance. I just want to see the live logs and interaction and don't want to keep scrolling the long log files for this purpose.
I apologize if the answer to above is obvious but just want to follow the best guidance.
mongooseimctl live or mongooseimctl foreground is mostly useful for development or smoke testing a deployment (unless you're running inside a container). For real world use cases you should start the server in the background with mongooseimctl start.
Back to the container - the best approach for containerised applications is to run them in the foreground, therefore in a container startup script use mongooseimctl foreground.
Once the server is running (no matter how it was started) attaching a shell to troubleshoot issues can be done with mongooseimctl debug. This is the command to use when you get the Protocol 'inet_tcp': the name mongooseim#localhost seems to be in use by another Erlang node error. Be careful if it's a production environment - you can easily take the server down with access to this shell.
If you're just interested in watching logs, with no interactive access to the server internals that the shell offers, a simple tail -f /your-configured-mongooseim-log-dir/* should be enough.
Ubuntu nowadays uses systemd for managing its services' lifetimes. A systemd .service file can be found at https://github.com/esl/MongooseIM/blob/master/tools/pkg/platforms/debian_stretch/files/build/mongooseim.service - we use it for packaging into Debian/Ubuntu .deb packages.

Spin down unused Dokku containers (and spin them up upon access)

Heroku spins down containers for free accounts when the app isn't accessed for a day. For our system, deployed on Dokku, we have production, staging, as well as developer containers running the same app. Today I noticed a Dokku app hang indefinitely mid-deploy on our dev VM. After investigating, I discovered that the issue was due to insufficient VM memory. After I killed a few containers, the container started successfully. For reference, there are almost 60 containers deployed on our dev box now, but only about 5 of them are being actively used. Often, our devs deploy multiple versions of the same app when testing. Sometimes these apps are no longer needed (in which case we can simply remove them), but more often than not, they'll need to be accessed again a week or two later.
To save resources on our VMs, we would like to spin down dev containers, especially since there are likely to be multiple instances of the same app.
Is this possible with Dokku? If I simply stop containers that haven't been accessed for a while (using docker stop command), then the user accessing the app later will be greeted with a 404 page. What I would like to do instead is show the loading icon to the user until the container is spun up again.
simply with dokku commands this is not posible for the moment. maybe you can use ps:stop and try something like if you find a 502 error on nginx, you then try to run a shell script that start the application, but that will of course give the 502 error to the user the first time.

Docker swarm mode load balancing

I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up.
But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.
You are correct -- You need to specify a healthcheck to run against your app inside the container in order to make sure it is ready. Your container will not receive traffic until this healtcheck has passed.
A simple curl to an endpoint should suffice. Use the Healthcheck flag in your Dockerfile to specify a healthcheck to perform.
An example of the healthcheck line in a Dockerfile to check if an endpoint returned 200 OK would be:
HEALTHCHECK CMD curl -f 'http://localhost:8443/somepath' || exit 1
If you can't modify your Dockerfile, then you can also specify your healthcheck manually at deployment time using the compose file healthcheck format.
If that's also not possible either and you need to update a running service, you can do a service update and use a combination of the health flags to specify your healthcheck.

Resources