Monitoring that process runs and active? - monitoring

We have a number of rabbitmq consumers running. And I want to make sure that every one of those processes is working.
What are the best industry applied approaches for that?
We are considering using prometheus for that, is that the right direction?

You can go the Prometheus route. Use this link to get started, which would fairly quickly set you up with the ability to monitor RabbitMQ w/ Prometheus.
You can use AlertManager to setup alerts on failed processes and top that with automations to make sure you have service continuity.


Cloud Run Don't Wait For Scale Up

I have a web application. The cold start time of the backend service is about 10 second which is very high. I was not able to reduce the cold start time. As a second solution, I am wondering if can requests that makes cloud run service scale up handled by already running instances. After the new scaled containers ready, new requests will be handled by scaled up containers. Does Google Cloud support that?
You have a brand new feature for that. It's Health Probe you can put on your service to detect when the instance is ready to serve traffic, or unhealthy and no new request will be routed to it.
Have a try on it, it should solve your issue!
As a second solution, I am wondering if can requests that makes cloud run service scale up handled by already running instances.
I think what you really want is min-instances. This means you always will have an instance that is ready to serve requests.
Otherwise, I don't think there is any solution that would solve the problem that you have. If new requests come in, you are going to need to scale up either way, and there is nothing around the 10 second cold start. So implement min-instances with a base-line that is appropriate for your traffic.

How does Celery discover new Nodes?

I'm running Celery and RabbitMQ Gunicorn in Docker.
My question is this: I understand that Celery is designed for distributed processing. What I have see no docs on at all is, assuming that I have several machines/nodes on the same LAN, how do they discover each other? Does RabbitMQ play a role? Do celery instances somehow discover each other? Is there a list of suitable hosts somewhere? If so, how do I edit it?
Also, assuming I'm going to use only one node to handle the HTTP requests, do I still need to have gunicorn running on all nodes? I ask this because in the gunicorn start command, it has a setting for the number of workers. And, is this setting applicable only to that node, or as a max total for all connected nodes?
After the first answer, I started working on this. It seems that I need some sort of networking setup, either swarm or bridging etc. I should clarify that I'm using docker-compose to bring up the solution, and I see that a normal swarm setup doesn't work, and I have to use something slightly different if I go that route.
To be clear: I need a way in which I can add celery workers on separate hosts and have them be able to communicate with the "main" host so that I can increase the capacity of the system. If someone could provide a clear process for achieving this or a link to such, it'd be most helpful.
I hope I've expressed this clearly, please let me know if you need any further info.
I feel like #ffledgling didn't fully answer the question so I am adding a note:
Here is a list of all events sent by the worker to the broker (in your case RabbitMq):
As you can see, there are few worker self-related messages/events:
All of them contain a signature of the hostname. Therefore a successful handshake flow (not exactly handshake because master doesn't respond with message but using it as a metaphor here) may look like this:
new worker online --> worker send worker-online message to the queue --> master received and start to read logs from worker host --> master schedule tasks --> ...
Beyond that, host name is a standard body field in every event (both task and worker self-related), here is the documentation:
For example, if you look at task-started event: it also contains a hostname as signature, this is how the master knows who picked up the task and where to read the log of the task from.
I understand that Celery is designed for distributed processing. What
I have see no docs on at all is, assuming that I have several
machines/nodes on the same LAN, how do they discover each other? Does
RabbitMQ play a role? Do celery instances somehow discover each other?
Is there a list of suitable hosts somewhere? If so, how do I edit it?
Celery is a distributed task queue that works using a message brokering system such as RabbitMQ.
What essentially happens all celery workers connect a shared Queue such as RabbitMQ. The master(s) dispatch work by pushing it onto the queue. Workers who are connected to the Queue as well, pull work off of the queue and then attempt to execute it. Once it is finished (successfully or otherwise), it will push the results back onto the Queue, which the master(s) can then query.
Given this architecture, you do not need to add a list of hosts, they "auto-detect" work. You simply need to start them up and ensure they can talk to the Queue.
A slightly more detailed explanation from another SO answer.
Link to the architecture with a diagram.
Also, assuming I'm going to use only one node to handle the HTTP
requests, do I still need to have gunicorn running on all nodes? I ask
this because in the gunicorn start command, it has a setting for the
number of workers. And, is this setting applicable only to that node,
or as a max total for all connected nodes?
No, you do not need guicorn running on all the nodes, just the one you're using to serve HTTP requests via python. Celery workers do not need guicorn. The worker setting in guicorn refers to the number of workers in the HTTP listeners pool. This is separate, independent and unrelted to the set of workers that celery uses.

Nagios: Make sure x out of y services are running

I'm introducing 24/7 monitoring for our systems. To avoid unnecessary pages in the middle of the night I want Nagios to NOT page me, if only one or two of the service checks fail, as this won't have any impact on users: The other servers run the same service and the impact on users is almost zero, so fixing the problem has time until the next day.
But: I want to get paged if too many of the checks fail.
For example: 50 servers run the same service, 2 fail -> I can still sleep.
The service fails on 15 servers -> I get paged because the impact is getting to high.
What I could do is add a lot (!) of notification dependencies that only trigger if numerous hosts are down. The problem: Even though I can specify to get paged if 15 hosts are down, I still have to define exactly which hosts need to be down for this alert to be sent. I rather want to specify that if ANY 15 hosts are down a page is made.
I'd be glad if somebody could help me with that.
Personally I'm using Shinken which has business rules just for that. Shinken is backward compatible with Nagios, so it's easy to drop your nagios configuration into shinken.
It seems there is a similar addon for nagios Nagios Business Process Intelligence Addon, but I'm not having experience with this addon.

Monitoring distributed nodes

We have a Master erlang node that has an application with a supervisor and multiple, dynamically added worker processes. For each worker process, there is another erlang node dynamically started. We would like to monitor all nodes on one screen and detect failures so that corrective action can be taken.
Is there an utility that can let us do this?
I think for almost every erlang distributed application, there is similiar nodes managments requirements. For pman and appmon webtools, I think they are too basic and not enough.
I have read rabbitmq source code, there is a website for management and it seems suitable for your requirement.
In addition, I start to read riak source code now, the nodes management codes seems better than rabbitmq. It is also suitable for your requirement.
I think you could read both of them, and modify based on them and create a new one for your application.

How to monitor a remote erlang node which was down and is restarting

My application runs in an erlang cluster - with usually two or more nodes. There's active monitoring between the nodes (using erlang:monitor_node) which works fine - I can detect and react to the fact that a node that was up is now down.
But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?
(Edited to add)
I think the answer to perform a technique like election of a supervisor is the thought process I was missing. I'll look into that and mark this question as done....
But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?
Just an idea, but how about having the restarting node itself explicitly inform the supervisor/monitoring node that it has finished restarting and that it is available again?
You could use a recurring "heartbeat message" for this purpose, or come up with a custom message specifically meant to be sent once after successful initialization. Something along the lines of:
start(SupervisorPID) ->
SuperVisorPID ! {hello, MyPID};
You could create a global_group then use the global_group:monitor_nodes(true) to monitor the other nodes within the same global group. The process that is monitoring the nodes will get nodeup and nodedown messages.
