I want to use the uwsgi emperor to run multiple services. However, there is some interdependency between services, such that e.g. service A requires service B to be up and running when it is brought up. Is there anyway to encode a startup ordering in the uwsgi emperor? I tried naming the vassal files in such a way that they order lexicographically in the same startup order that I wanted, but this doesn't appear to have the desired affect.
I think the --wait-for-socket option could be the best approach. The instance using it will not start until the socket of the other one is ready to accept requests.
Related
I have a docker-compose consisting of four containers, all of which perform a single function:
An nginx proxy that forwards UI and API requests to the corresponding containers (node container, flask container), as depicted in the image below.
There is also a separate container which executes long running python scripts and works independent of the other containers. I'd now like to create the ability to execute scripts in the "long running scripts" (LRS) container via the API:
What is the best way to do this?
I've seen a few other questions that are somewhat similar to this, but raise more questions than they answer. Amongst the suggestions I've seen are:
Pass docker.sock into the API container; from the API container, exec into LRS and execute the intended script
Doesn't this create serious security vulnerabilities?
Doesn't this require that docker be installed on the API container in order to exec, violating the separation of concerns principle of docker?
HTTP listener on the LRS container, listening for commands from the API in order to execute the script on LRS
Again, doesn't this violate separation of concerns, since I'll now essentially need a light weight API in the LRS container to listen to actions from the principal API?
None of this solutions seem ideal. Am I missing something? How do I achieve the intended functionality?
Generally the solution to run long-running scripts has been a pub-sub model. Your API would drop a message onto an execution Message-Queue. The worker instance would subscribe to that queue, and when messages appear, would execute your long-running script/query/etc. When the execution is complete, either a message will go back on a different queue, or results will be placed in a predetermined location (url).
This has a couple of advantages:
The two solutions are effectively isolated from each other
You can scale out the LRS (worker) solution if you need more capacity by adding additional workers
if the LRS instance goes down the API will not depend on it being up. Work will be queued for when an instance becomes available.
I'm running Celery and RabbitMQ Gunicorn in Docker.
My question is this: I understand that Celery is designed for distributed processing. What I have see no docs on at all is, assuming that I have several machines/nodes on the same LAN, how do they discover each other? Does RabbitMQ play a role? Do celery instances somehow discover each other? Is there a list of suitable hosts somewhere? If so, how do I edit it?
Also, assuming I'm going to use only one node to handle the HTTP requests, do I still need to have gunicorn running on all nodes? I ask this because in the gunicorn start command, it has a setting for the number of workers. And, is this setting applicable only to that node, or as a max total for all connected nodes?
EDIT:
After the first answer, I started working on this. It seems that I need some sort of networking setup, either swarm or bridging etc. I should clarify that I'm using docker-compose to bring up the solution, and I see that a normal swarm setup doesn't work, and I have to use something slightly different if I go that route.
To be clear: I need a way in which I can add celery workers on separate hosts and have them be able to communicate with the "main" host so that I can increase the capacity of the system. If someone could provide a clear process for achieving this or a link to such, it'd be most helpful.
I hope I've expressed this clearly, please let me know if you need any further info.
Thanks!
I feel like #ffledgling didn't fully answer the question so I am adding a note:
Here is a list of all events sent by the worker to the broker (in your case RabbitMq): http://docs.celeryproject.org/en/latest/userguide/monitoring.html#event-reference
As you can see, there are few worker self-related messages/events:
worker-online
worker-heartbeat
worker-offline
All of them contain a signature of the hostname. Therefore a successful handshake flow (not exactly handshake because master doesn't respond with message but using it as a metaphor here) may look like this:
>
new worker online --> worker send worker-online message to the queue --> master received and start to read logs from worker host --> master schedule tasks --> ...
Beyond that, host name is a standard body field in every event (both task and worker self-related), here is the documentation: http://docs.celeryproject.org/en/latest/internals/protocol.html?highlight=event%20reference#standard-body-fields
For example, if you look at task-started event: it also contains a hostname as signature, this is how the master knows who picked up the task and where to read the log of the task from.
I understand that Celery is designed for distributed processing. What
I have see no docs on at all is, assuming that I have several
machines/nodes on the same LAN, how do they discover each other? Does
RabbitMQ play a role? Do celery instances somehow discover each other?
Is there a list of suitable hosts somewhere? If so, how do I edit it?
Celery is a distributed task queue that works using a message brokering system such as RabbitMQ.
What essentially happens all celery workers connect a shared Queue such as RabbitMQ. The master(s) dispatch work by pushing it onto the queue. Workers who are connected to the Queue as well, pull work off of the queue and then attempt to execute it. Once it is finished (successfully or otherwise), it will push the results back onto the Queue, which the master(s) can then query.
Given this architecture, you do not need to add a list of hosts, they "auto-detect" work. You simply need to start them up and ensure they can talk to the Queue.
A slightly more detailed explanation from another SO answer.
Link to the architecture with a diagram.
Also, assuming I'm going to use only one node to handle the HTTP
requests, do I still need to have gunicorn running on all nodes? I ask
this because in the gunicorn start command, it has a setting for the
number of workers. And, is this setting applicable only to that node,
or as a max total for all connected nodes?
No, you do not need guicorn running on all the nodes, just the one you're using to serve HTTP requests via python. Celery workers do not need guicorn. The worker setting in guicorn refers to the number of workers in the HTTP listeners pool. This is separate, independent and unrelted to the set of workers that celery uses.
We're planning to run an Express-based server on Node.js in "cluster mode" using Node.js' cluster support. So there will be 1 master process and 'n' (where 'n' is calculated based on the number of CPUs) child processes running on a single machine. We already have a testbed set up using Faye for pubsub in non-cluster mode and it works great.
Are there any additional considerations we need to be aware of when using Faye on top of a Node cluster? For example, since there will be 'n' HTTP server instances, will it be a problem creating a Faye NodeAdapter in each Node process and attaching it to the HTTP server instance in that process?
Thanks.
-brian
I just realized that the answer to my question is fairly obvious. One thing to be aware of is that Faye will need to access shared state across multiple server instances (processes). In a single-server config, you could probably get away with using Faye's memory engine. In a clustered config, you'd need to use Faye's redis engine or some other engine that allows state to be shared by different processes. I'd prefer not to introduce another persistence component just for this purpose so I may look into implementing my own on top of my current persistent store (Neo4j).
We run an application with a half dozen mongrels. A new feature that we've added is a scheduler (rufus-scheduler) that runs within a mongrel and provides cron-like background task processing. We want ot run this scheduler on only one of our mongrels but we can't figure out how -- during startup (environment.rb) -- to identify the specific mongrel to start the scheduler in.
We have set up a yaml file with a setting for the port # for the mongrel in which we would like to have the scheduler start. During startup, in enviromnent.rb, we would like to query the yaml file, get the port, and then compare this to the instance that is booting -- if it's the same, launch the scheduler.
Someone answered recently that we should look at request.port -- there is no request object when you are booting. Where else is the port # stored? Or, how can we pass a parm to an individual mongrel or have it compare itself to a setting to identify itself?
Thanks in advance...
Russell
I asked the same question couple of weeks back.
Gist:
A plugin named 'Rooster' addresses
this problem.
Use a shared resource
like a file as a way to synchronize.
I am setting up an Apache2 webserver running multiple Ruby on Rails web applications with Phusion Passenger. I know that Passenger spawns Ruby processes for handling requests. I have the following questions:
If more than one request has to be handled at the same time, will Passenger spawn multiple processes or multiple (Ruby) threads? How do I configure it so it always spawns single-threaded processes?
If I have two Rails applications, imagine that a request for app A goes to process 1, then later request for app B arrives. Is it possible that process 1 will handle this request as well? When and how is this possible? In other words, is one process allowed to handle requests for multiple Rails applications?
I have the same Rails application exported in multiple URLs and multiple virtual hosts (such as http:// and https://). Will the same process be able to serve different virtual hosts? (The answer to this seems to be yes, I've set a global variable in answering a request to virtual host A, and I was able to retrieve the value in virtual host B.)
Generally speaking, Passenger spawns new processes by forking an ApplicationSpawner, which has the framework and application code pre-loaded into memory, or a FrameworkSpawner, which just has the framework code.
Passenger, as far as I know, doesn't deal in threads. Instead, as the load increases on an application, it will fork that Application's ApplicationSpawner and initialize another instance. When load decreases, one or more application instances are killed off.
If Passenger is configured in a certain way (I believe by choosing the "smart" spawn method), it will create a FrameworkSpawner, which loads the rails code, but no application code, which can then be forked to load and application using that version of Rails.
So to answer your questions:
It will serve them sequentially, then spawn additional processes if it decides the load is high enough.
No. One process can only belong to a single Rails Application.
I'm kind of sketchy on this one, but your experiment makes sense. Passenger should be smart enough to figure out that even though it's running from different places in the server config, you're talking about the same application. It's probably based on the application's filesystem path.
EDIT: I went and read up on this a bit. Turns out I was mostly right, but the technical details were a bit off. See the Passenger documentation
Yup, Burke is right. In case of the third question, Phusion Passenger recognizes applications by their application root path. So even if you have two virtual hosts, if they both point to the same DocumentRoot then Phusion Passenger will think that they're the same app.