pgSql crash after migrate to containerd

pgSql crash after migrate to containerd - docker

i have k8s cluster (1.22.3) with harbor installation (2.5.0, installed via helm char 1.9.0)
harbor configured to use internal database and all work fine.
some time ago i remove docker from all nodes and reconfigure k8s to use containerd directly (based on https://kruyt.org/migrate-docker-containerd-kubernetes/)
all services are works normally after that but pgsql for harbor crashed periodically
in log of pod i cant see following:
2022-04-26 09:26:35.794 UTC [1] LOG: database system is ready to accept connections
2022-04-26 09:31:42.391 UTC [1] LOG: server process (PID 361) exited with exit code 141
2022-04-26 09:31:42.391 UTC [1] LOG: terminating any other active server processes
2022-04-26 09:31:42.391 UTC [374] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [374] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [374] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [364] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [364] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [364] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [245] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [245] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [245] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [157] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [157] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [157] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [22] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [22] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [22] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [123] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [123] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [123] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.391 UTC [244] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.391 UTC [244] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.391 UTC [244] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.392 UTC [243] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.392 UTC [243] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.392 UTC [243] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.392 UTC [246] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.392 UTC [246] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.392 UTC [246] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:42.432 UTC [69] WARNING: terminating connection because of crash of another server process
2022-04-26 09:31:42.432 UTC [69] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-04-26 09:31:42.432 UTC [69] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-04-26 09:31:43.031 UTC [375] FATAL: the database system is in recovery mode
2022-04-26 09:31:43.532 UTC [376] LOG: PID 243 in cancel request did not match any process
2022-04-26 09:31:46.992 UTC [1] LOG: all server processes terminated; reinitializing
2022-04-26 09:31:47.545 UTC [377] LOG: database system was interrupted; last known up at 2022-04-26 09:26:35 UTC
2022-04-26 09:31:47.545 UTC [378] LOG: PID 245 in cancel request did not match any process
2022-04-26 09:31:50.472 UTC [388] FATAL: the database system is in recovery mode
2022-04-26 09:31:50.505 UTC [398] FATAL: the database system is in recovery mode
2022-04-26 09:31:52.283 UTC [399] FATAL: the database system is in recovery mode
2022-04-26 09:31:56.528 UTC [400] LOG: PID 246 in cancel request did not match any process
2022-04-26 09:31:58.357 UTC [377] LOG: database system was not properly shut down; automatic recovery in progress
2022-04-26 09:31:59.367 UTC [377] LOG: redo starts at 0/63EFC050
2022-04-26 09:31:59.385 UTC [377] LOG: invalid record length at 0/63F6D038: wanted 24, got 0
2022-04-26 09:31:59.385 UTC [377] LOG: redo done at 0/63F6D000
2022-04-26 09:32:00.480 UTC [410] FATAL: the database system is in recovery mode
2022-04-26 09:32:00.511 UTC [420] FATAL: the database system is in recovery mode
2022-04-26 09:32:00.523 UTC [1] LOG: received smart shutdown request
2022-04-26 09:32:04.946 UTC [1] LOG: abnormal database system shutdown
2022-04-26 09:32:05.139 UTC [1] LOG: database system is shut down
and in the events of pod i see message about liveness/readness probe fail.
there is no resource problem (no memory limit, no storage limit, cpu almost idle)
so i think that there is some misconfiguration in containerd, because with docker all works fine
env info:
k8s: 1.22.3
os: ubuntu 20.04
containerd: 1.5.5

Related

Crunchy Postgres log messages

I am new to Crunchy Postgres, and recently I installed a Crunchy PostgresCluster on an openshift environment. After the cluster was started, I had a look at the container log messages.
I also checked script startup.sh , which is called during Postgresql startup. In this shell script, there are some lines (begin with echo_info) used for log messsages, for example:
echo_info "Starting PostgreSQL.."
But I could not see this message in the logs.
NAME READY STATUS RESTARTS AGE ROLE
demo-instance1-4vtv-0 5/5 Running 0 7h36m replica
demo-instance1-dg7j-0 5/5 Running 0 7h36m replica
demo-instance1-f696-0 5/5 Running 0 7h36m master
:~$ oc logs -f demo-instance1-f696-0 -c database | more
2022-07-08 07:42:31,064 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-07-08 07:42:31,068 INFO: Lock owner: None; I am demo-instance1-f696-0
2022-07-08 07:42:31,383 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf-8".
The default text search configuration will be set to "english".
Data page checksums are enabled.
fixing permissions on existing directory /pgdata/pg14 ... ok
creating directory /pgdata/pg14_wal ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
/usr/pgsql-14/bin/pg_ctl -D /pgdata/pg14 -l logfile start
2022-07-08 07:42:35.953 UTC [92] LOG: pgaudit extension initialized
2022-07-08 07:42:35,955 INFO: postmaster pid=92
/tmp/postgres:5432 - no response
2022-07-08 07:42:35.998 UTC [92] LOG: redirecting log output to logging collector process
2022-07-08 07:42:35.998 UTC [92] HINT: Future log output will appear in directory "log".
/tmp/postgres:5432 - accepting connections
/tmp/postgres:5432 - accepting connections
2022-07-08 07:42:37,038 INFO: establishing a new patroni connection to the postgres cluster
2022-07-08 07:42:37,334 INFO: running post_bootstrap
2022-07-08 07:42:37,754 INFO: initialized a new cluster
2022-07-08 07:42:38,039 INFO: no action. I am (demo-instance1-f696-0), the leader with the lock
2022-07-08 07:42:48,504 INFO: no action. I am (demo-instance1-f696-0), the leader with the lock
2022-07-08 07:42:58,476 INFO: no action. I am (demo-instance1-f696-0), the leader with the lock
2022-07-08 07:43:08,497 INFO: no action. I am (demo-instance1-f696-0), the leader with the lock

Gitlab always exited automatically

I am running the gitlab with docker, but it always exits after a period of time
==> /var/log/gitlab/redis/current <==
2019-06-21_18:00:08.72435 459:signal-handler (1561140008) Received SIGTERM scheduling shutdown...
2019-06-21_18:00:08.81864 459:M 21 Jun 18:00:08.817 # User requested shutdown...
2019-06-21_18:00:08.81866 459:M 21 Jun 18:00:08.817 * Saving the final RDB snapshot before exiting.
2019-06-21_18:00:08.83736 459:M 21 Jun 18:00:08.837 * DB saved on disk
2019-06-21_18:00:08.83741 459:M 21 Jun 18:00:08.837 * Removing the pid file.
2019-06-21_18:00:08.83817 459:M 21 Jun 18:00:08.838 * Removing the unix socket file.
2019-06-21_18:00:08.83935 459:M 21 Jun 18:00:08.839 # Redis is now ready to exit, bye bye...
ok: down: redis-exporter: 0s, normally up
==> /var/log/gitlab/gitlab-rails/sidekiq.log <==
2019-06-21_18:00:09.57615 2019-06-21T18:00:09.576Z 807 TID-oviw2sgmf INFO: Shutting down
2019-06-21_18:00:09.57625 2019-06-21T18:00:09.576Z 807 TID-ovivo05i7 INFO: Scheduler exiting...
2019-06-21_18:00:09.57655 2019-06-21T18:00:09.576Z 807 TID-oviw2sgmf INFO: Terminating quiet workers

This was reported in gitlab-org/omnibus-gitlab issue 4137: "runsv send SIGTERM to redis in docker version"
runsv sends SIGTERM to redis every 60 secs
gitlab-org/omnibus-gitlab issue 1611 suggests a docker restart first.
But the general issue is not conclusively resolved yet.

Strange behavior of neo4j-service on Debian 8.1

I have installed neo4j on Debian 8.1 thanks to these instructions : http://debian.neo4j.org/
Now if, as root, I start neo4j with neo4j-service like this
service neo4j-service start
Sometimes it will works correctly but most of the the time, the neo4j-service will timeout. But the interesting fact is that neo4j is indeed started, I can go the the browser and make some queries. But the neo4j-service tells me that it failed :
root#ns***:~# service neo4j-service start
Job for neo4j-service.service failed. See 'systemctl status neo4j-service.service' and 'journalctl -xn' for details.
root#ns***:~# systemctl status neo4j-service.service
● neo4j-service.service - LSB: Neo4j Graph Database server
Loaded: loaded (/etc/init.d/neo4j-service)
Active: failed (Result: timeout) since Fri 2015-10-16 19:03:08 CEST; 6min ago
Process: 24556 ExecStop=/etc/init.d/neo4j-service stop (code=exited, status=0/SUCCESS)
Process: 29730 ExecStart=/etc/init.d/neo4j-service start (code=killed, signal=TERM)
Oct 16 18:58:08 ns***.ip-91-***-***.eu neo4j-service[29730]: WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Oct 16 18:58:08 ns***.ip-91-***-***.eu neo4j-service[29730]: Starting Neo4j Server...WARNING: not changing user
Oct 16 19:03:08 ns***.ip-91-***-***.eu systemd[1]: neo4j-service.service start operation timed out. Terminating.
Oct 16 19:03:08 ns***.ip-91-***-***.eu systemd[1]: Failed to start LSB: Neo4j Graph Database server.
Oct 16 19:03:08 ns***.ip-91-***-***.eu systemd[1]: Unit neo4j-service.service entered failed state.
And sometimes it will tell me that the service started correctly but will not manage to stop it.
Most of the time, I have to kill the process myself to "reset everything" correctly.
Do you know why this is happening ?
Are you aware of any issues with the neo4j-service on Debian 8.1 ?

This approach to running Neo4j is deprecated and you should use neo4j command.
Or you can write your own service wrapper and for that I suggest to you use http://supervisord.org/

ConnectionFailure using mongo in rails 3.1

I have an app setup with Rails 3.1, Mongo 1.4.0, Mongoid 2.2.4.
What I am experiencing is this:
Mongo::ConnectionFailure: Failed to connect to a master node at localhost:27017
I've had this problem before, but it went away on a computer restart... this time it does not.
I don't understand, I didn't do anything. I just put my computer in sleep mode, went home and woke it up, then there it was.
Here is the output of sudo mongod
Fri Nov 25 21:47:14 [initandlisten] MongoDB starting : pid=1963 port=27017 dbpath=/data/db/ 64-bit host=xxx.local
Fri Nov 25 21:47:14 [initandlisten] db version v2.0.0, pdfile version 4.5
Fri Nov 25 21:47:14 [initandlisten] git version: 695c67dff0ffc361b8568a13366f027caa406222
Fri Nov 25 21:47:14 [initandlisten] build info: Darwin erh2.10gen.cc 9.6.0 Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_40
Fri Nov 25 21:47:14 [initandlisten] options: {}
Fri Nov 25 21:47:14 [initandlisten] journal dir=/data/db/journal
Fri Nov 25 21:47:14 [initandlisten] recover : no journal files present, no recovery needed
Fri Nov 25 21:47:15 [websvr] admin web console waiting for connections on port 28017
Fri Nov 25 21:47:15 [initandlisten] waiting for connections on port 27017
And I am able to connect with mongoin terminal.
After 2 hours of Googling I hope the competence of SOs community are able to figure this out.
Please, if you need more information about my app-setup just ask.
Thanks!

What you see is that the connection times out... that happens either after a long period of inactivity, or if you put your computer to sleep.
You can change / increase the timeout value, but this way you can't get rid of the connection timing out eventually.
Some MongoDB drivers allow to set :timeout => false , but Mongoid seems to still have problems with that
(see last 3 links in the list)
Hope this helps.
See also:
Mongodb server goes down, how to prevent Rails app from timing out?
MongoDB: What is connection pooling and timeout?
https://github.com/mongodb/mongo-ruby-driver
How can I query mongodb using mongoid/rails without timing out?
http://groups.google.com/group/mongoid/browse_thread/thread/b5c94e7047b42f8a
https://github.com/mongoid/mongoid/issues/455

Try to change localhost to 127.0.0.1!

Apache shutting down unexpectedly

I have a mongrel server running behind Apache. It works fine; however, every now and then the Apache server shuts downs seemingly by itself. I'm not sure if there is configuration issue or if it's an attack. Here is Apache error log:
[Thu Apr 30 02:15:07 2009] [notice] SIGHUP received. Attempting to restart
[Thu Apr 30 02:15:07 2009] [warn] NameVirtualHost *:0 has no VirtualHosts
[Thu Apr 30 02:15:07 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Thu Apr 30 02:17:13 2009] [error] [client 61.139.105.163] File does not exist: /var/www/fastenv
[Thu Apr 30 02:24:06 2009] [error] [client 61.139.105.163] File does not exist: /var/www/fastenv
[Thu Apr 30 10:49:18 2009] [warn] pid file /var/run/apache2.pid overwritten -- Unclean shutdown of previous Apache run?
[Thu Apr 30 10:49:18 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Thu Apr 30 12:53:08 2009] [notice] SIGHUP received. Attempting to restart
[Thu Apr 30 12:53:08 2009] [warn] NameVirtualHost *:0 has no VirtualHosts
[Thu Apr 30 12:53:08 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Thu Apr 30 12:59:15 2009] [notice] SIGHUP received. Attempting to restart
[Thu Apr 30 12:59:15 2009] [warn] NameVirtualHost *:0 has no VirtualHosts
[Thu Apr 30 12:59:15 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Thu Apr 30 13:58:49 2009] [notice] SIGHUP received. Attempting to restart
[Thu Apr 30 13:58:49 2009] [warn] NameVirtualHost *:0 has no VirtualHosts
[Thu Apr 30 13:58:49 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Fri May 01 10:59:07 2009] [warn] pid file /var/run/apache2.pid overwritten -- Unclean shutdown of previous Apache run?
[Fri May 01 10:59:07 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
[Fri May 01 17:51:15 2009] [warn] pid file /var/run/apache2.pid overwritten -- Unclean shutdown of previous Apache run?
[Fri May 01 17:51:15 2009] [notice] Apache/2.2.3 (Debian) PHP/5.2.0-8+etch13 configured -- resuming normal operations
Not quite sure what is /var/www/fastenv but I don't think there is anything in my application that calls that. Also, website is still in Beta mode with few users and I don't think any have 61.139.105.163 IP address but it's possible that they might have it.
Any ideas? It would be good if you can give me hints where to look or how to go about anaysing this problem

I have the exact same log from the same IP. Looking it up shows it to belong to the Chinese government. It appears to be a scan using server side includes to find out as much as they can about your server. I banned the IP.

Not sure this is entirely programming-related, but anyway... none of those look like serious errors to me. The accesses to /var/www/fastenv just mean that the computer at IP address 61.139.105.163 sent a request for http://www.example.com/fastenv or something like that (it depends on exactly how you've configured your virtual hosts); I'd look at the access log for more information, to see what other requests have been coming from that IP address. It's probably not anything to worry about.
The line about NameVirtualHost *:0 means that somewhere in your configuration file you have an incorrect NameVirtualHost directive, maybe with no arguments. You should probably look for that and remove it, but if the server is running fine anyway, it's not a big deal.
The reason your server is terminating (restarting, actually) appears to be a SIGHUP - that is, something on the system is sending Apache a signal telling it to restart. It's basically the same thing that happens if you run apache2 restart, I think. Without knowing what's sending that signal, there's not more I can say.

61.139.105.163 is known for doing all kinds of hacking type things, just google the IP address. You should definitly ban this IP address.

Click on Apache Config --> Apache(httpd.conf)
Search for #Listen 12.34.56.78:80 and replace it with #Listen 12.34.56.78:8081.
Search for Listen 80 and replace it with Listen 8081.
Now you can start Apache now, and can run it with this URL: localhost:8081/xampp/

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart