Cowboy Message Queue Design - erlang

I have a web service written in Cowboy and I am planning to use RabbitMQ as the DB layer. So my Cowboy service will be one of the producer which writes to the queue and the consumer writes to the database. There are couple more asynchronous tasks that will come from another service (not Cowboy).
Now the question is where these consumers should go. Should these be part of single erlang app or should I create separate Erlang app for all the consumers.
Any advice would be highly appreciated.

Since Erlang is not the exclusive producer, and since one can usually imagine consumers running without knowledge of the producers, having separate applications is not a bad idea at all. You can have multiple top-level applications in a single Erlang release (that's what the dependencies are, really), so you can always put all the code in the same repository (I usually have a top level apps/ directory for these), and if needed later on split them out to separate repos.
Having them as separate applications certainly makes deciding later on to distribute the application across multiple erlang nodes easier: just start the relevant producer applications s on some nodes, and the consumer application on others.
So while either way will probably work, separate apps is probably a cleaner design and keeps the door open for future expansion in a slightly nicer way.

Related

Durable tasks sub orchestration with micro services

I'm attempting to use azure durable tasks to orchestrate some microservices but am running into a small gap in understanding how taskhubs work as well as coordinating the projects correctly.
I'm trying to create a main orchestrator that is in charge of kicking off sub orchestrations to do the actual work. Below is a diagram of what I'm trying to achieve.
The idea is that each .net Project will be able to scale independent of the other, so if .Net project 2 was under quite a bit of load I'd be able to scale that project only and not have to worry about the other 2 projects. The problem I'm running into is from what I understand the taskhub queue is shared by all the services so there is no way to have each process focus on only it's work, meaning each project can see everything in the queue and it may cause 1 project to dequeue a message intended for project 2. Is this correct?
From reading the documentation it doesn't seem clear that I can send project 2 it's sub orchestration messages as well as send project 3 it's specific orchestration.
Am I thinking about this problem incorrectly, is there a different way I might want to approach this?
What you want cannot be achieve.
As of now, Azure Function only allow orchestrator functions to call activity and sub-orchestrator functions that exist in the same function app. The main reason is a technical one: queues within a task hub are shared across all functions, so there's no way to guarantee that a message intended for FunctionAppA does not get picked up by FunctionAppB.
If cross-project communication is required, the correct method is to use http or queue.

Extract, transform, load within Rabbit?

One of the things that i do pretty often is transforming SQL data into cache and document-based stores, for performance reasons. I don't want my frontend applications hitting my database, so i have high-speed cache solutions, as well efficient Solr and other solutions.
I use RabbitMQ as the central communication hub to achieve this ETL flow, which looks like this: Backend application sends a message to Rabbit with the new data, or changes made into existing data. I then have a node.js script which consumes the queue, makes small batches of data and populates all the necessary systems: Redis, Mongo, Solr, etc.
However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?
However, i'm wondering if there's a better way of doing this. Maybe Rabbit has some kind of scripting support to create erlang logic for queues?
it doesn't. it's just a message queueing system.
personally, I think your current design sounds good.
The only thing I would wonder, is whether or not each of your target systems has a queue of it's own. That way, any one of them can go down and not affect the others.
I would probably do something like this:
back-end produces data message and sends through RMQ
RMQ is configured with a fanout exchange, and has one bound queue per target system
each system receives the message in it's own queue
otherwise, what you have sounds about right to me!

Erlang OTP based application - architecture ideas

I'm trying to write an Erlang application (OTP) that would parse a list of users and then launch workers that will work 24X7 to collect user-data (using three different APIs) from remote servers and store it in ets.
What would be the ideal architecture for this kind of application. Do I launch a bunch of workers - one for each user (assuming small number users)? What will happen if number of users increases very rapidly?
Also, to call different APIs I need to put up a Timer mechanism in the worker process.
Any hint will be really appreciated.
Spawning new process for each user is not a such bad idea. There are http servers that do this for each connection, and they doing quite fine.
First of all cost of creating new process is minimal. And cost of maintaining processes is even smaller. If one of the has nothing to do, it won't do anything; there is none (almost) runtime overhead from inactive processes, which in the end means that you are doing only the work you have to do (this is in fact the source of Erlang systems reactivity).
Some issue might be memory usage. Each process has it's own memory stack, and in use-case when they actually do not need to store any internal data, you might be allocating some unnecessary memory. But this also could be modified (even during runtime), and in most cases such memory will be garbage collected.
Actually I would not worry about such things too soon. Issues you might encounter might depend on many things, mostly amount of outside data or user activity, and you can not really design this. Most probably you won't encounter any of them for quite some time. There's no need for premature optimization, especially if you could bind yourself to design that would slow down rest of your development process. In Erlang, with processes being main source of abstraction you can easily swap this process-per-user with pool-of-workers, and ets with external service. But only if you really need it.
What's most important is fact that representing "user" as process would be closest to problem domain. "Users" are independent entities, and deserve separate processes (they have their own state, and they can act or react independent to each other). It is quite similar to using Objects and Classes in other languages (it is over-simplification, but it should get you going).
If you were writing this in Python or C++ would you worry about how many objects you were creating? Only in extreme cases. In Erlang the same general rule applies for processes. Don't worry about how many you are creating.
As for architecture, the only element that is an architectural issue in your question is whether you should design a fixed worker pool or a 1-for-1 worker pool. The shape of the supervision tree would be an outcome of whichever way you choose.
If you are scraping data your real bottleneck isn't going to be how many processes you have, it will be how many network requests you are able to make per second on each API you are trying to access. You will almost certainly get throttled.
(A few months ago I wrote a test demonstration of a very similar system to what you are describing. The limiting factor was API request limits from providers like fb, YouTube, g+, Yahoo, not number of processes.)
As always with Erlang, write some system first, and then benchmark it for real before worrying about performance. You will usually find that performance isn't an issue, and the times that it is you will discover that it is much easier to optimize one small part of an existing system than to design an optimized system from scratch. So just go for it and write something that basically does what you want right now, and worry about optimization tweaks after you have something that basically does what you want. After getting some concrete performance data (memory, request latency, etc.) is the time to start thinking about performance.
Your problem will almost certainly be on the API providers' side or your network latency, not congestion within the Erlang VM.

Erlang supervisor processes

I have been learning Erlang intensively, and after finishing 'Programming Erlang' from Joe Armstrong, there is one thing that I keep coming back to.
In my mind a Supervisor spawns One process per child handler. So each declared gen_server type handler will run as a separate process.
What happens if you are building a tiny web server and you want each requests to be its own process. Do you still conform to OTP principles and use a gen_server somehow (how ?), or do you create your own behaviour?
How does Cowboy handle this for eg. ? Does it still use gen_server ?
tl;dr: I find that trying to figure out the "correct" supervision structure a the beginning of a project is a form of premature optimization.
The "right" way to design your supervision tree depends on what the worker parts of your application are doing. In the case of a web server I would probably first explore something along the lines of:
top supervisor (singular)
data service supervisor (one per service type)
worker pool (all workers under the service sup)
client connection supervisor (one)
connection worker pool (or one per connection, have to play with it to decide)
logical supervisor (as appropriate -- massive variance here, depending on problem domain)
workers or supervisors (as appropriate -- have to explore/know the problem domain to have any idea how this should be structured)
So that's several workers per supervisor type at the lower level. I haven't used Cowboy so I don't know how it is organized. The point I'm trying to make is that while the mechanics of handling data services serving web pages are relatively trivial, the part of the system that actually does the core problem-solving work might not be and this is going to dictate everything interesting about the system.
It is a bad thing to have your problem-solving bits mixed in the same module as your web-displaying or connection handling bits. Ideally you should be able to use the same logic units in a native application, a web application and a networked service without any changes.
Ultimately the answer to whether you should have 1:1 supervisors to workers or 1:n depends on what you're doing and what restart strategy gives you the best balance among recovery to a known consistent state, latency felt by the user, and resource usage.
One of my favorite things about Erlang is that I can start with a naive supervisor structure like the one above, play with it until I see where its not so good, and rather easily switch things around and experiment with alternatives without fundamentally altering my system much. (The same goes for playing with alternative data representations if you write proper abstractions around them.) So first, get something that works in testing. Then load it up and see if you can break it. Then start worrying about the details, after you understand where the problems actually are.
It is a common pattern to spawn one server per client in erlang, You will then use a supervisor using the simple_one_to_one strategy for the children servers. This allows to ask the server to start a server on_demand. Generally this is used when you don't know how many processes you will need, and when the processes are independent (a crash of one process should not impacts the other).
There is a very good information in the site learningyousomeerlang.com (LYSE supervisor chapter). the whole site is worth to read.

Simple_one_for_one application

I have a supervisor which starts simple_one_for_one children. Each child is in fact a supervisor which has its own tree. Each child is started with an unique ID, so I can distinguish them. Each gen_server is then started with start_link(Id), where:
-define(SERVER(Id), {global, {Id, ?MODULE}}).
start_link(Id) ->
gen_server:start_link(?SERVER(Id), ?MODULE, [Id], []).
So, each gen_server can easily be addresed with {global, {Id, module_name}}.
Now I'd like to make this child supervisor into application. So, my mother supervisor should start applications instead of supervisors. That should be straightforward, except one part: passing ID to an application. Starting supervisor with an ID is easy: supervisor:start_child(?SERVER, [Id]). How do I do it for application? How can I start several applications of the same name (so I can access the same .app file) with different ID (so I can start my children with supervisor:start_child(?SERVER, [Id]))?
If my question is not clear enough, here is my code. So, currently, es_simulator_dispatcher starts es_simulator_sup. I'd like to have this: es_simulator_dispatcher starts es_simulator_app which starts es_simulator_sup. That's all there is to it :-)
Thanks in advance,
dijxtra
Applications don't run under anything else, they are a top-level abstraction. When you start an application with application:start/1 the application is started by the application controller which manages applications. Applications contain code and data, and maybe at runtime a supervision tree of processes doing the applications thing at runtime. Running multiple invocations of an application does not really make sense because of the nature of applications.
I would suggest reading OTP Design Principles User's Guide for a description of the components of OTP, how they relate and how they are intended to be used.
I don't think applications where meant for dynamic construction like you want. I'd make a single application, because in Erlang, applications are bundles of code more than they are bundles of running processes (you can say they are an artifact of compile-time moreso than of runtime).
Usually you feed configuration to an application through the built-in configuration system. That is, you use application:get_env(Key) to read something it should use. There is also an application:set_env(...) to feed specific configuration into one - but the preferred way is the config file on disk. This may or may not work in your case.
In some sense, what you are up to corresponds to creating 200 Apache configuration files and then spawn 200 Apache systems next to each other, rather than running a single one and then handle the multiple domains inside it.

Resources