Hi I want to implement distributed caches as an exercise. The cache module is based on gen_server. The caches are started by an CacheSupervisor module. At first I tried running it all on one node, which worked well. Now I am trying to distribute my caches on two nodes, which live in two open console windows on my laptop.
PS:
While writing this question I realised that I forgot to connect my third window to the other nodes. I fixed it, but I am still having the original error.
Consoles:
After connecting my nodes I callCacheSupervisor.start_link() in my third window, this results in the follwing error message.
Error:
** (EXIT from #PID<0.112.0>) shutdown: failed to start child: :de
** (EXIT) an exception was raised:
** (ArgumentError) argument error
erlang.erl:2619: :erlang.spawn(:node1#DELL_XPS, {:ok, #PID<0.128.0>})
(stdlib) supervisor.erl:365: :supervisor.do_start_child/2
(stdlib) supervisor.erl:348: :supervisor.start_children/3
(stdlib) supervisor.erl:314: :supervisor.init_children/2
(stdlib) gen_server.erl:328: :gen_server.init_it/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
I am guessing that the error indicates that the :gen_server.start_link(..) inside start_link(name) of my Cache Module resolves to {:ok, #PID<0.128.0>} which seems to be incorrect, but I am having no Idea where to put the Node.spawn() else
Module Cache:
defmodule Cache do
use GenServer
def handle_cast({:put, url, page}, {pages, size}) do
new_pages = Dict.put(pages, url, page)
new_size = size + byte_size(page)
{:noreply, {new_pages, new_size}}
end
def handle_call({:get, url}, _from, {pages, size}) do
{:reply, pages[url], {pages, size}}
end
def handle_call({:size}, _from, {pages, size}) do
{:reply, size, {pages, size}}
end
def start_link(name) do
IO.puts(elem(name,0))
Node.spawn(String.to_atom(elem(name, 0)), :gen_server.start_link({:local,elem(name, 1)}, __MODULE__, {HashDict.new, 0}, []))
end
def put(name, url, page) do
:gen_server.cast(name, {:put, url, page})
end
def get(name, url) do
:gen_server.call(name, {:get, url})
end
def size(name) do
:gen_server.call(name, {:size})
end
end
Module CacheSupervisor:
defmodule CacheSupervisor do
use Supervisor
def init(_args) do
workers = Enum.map( [{"node1#DELL_XPS", :de},{"node1#DELL_XPS", :edu}, {"node2#DELL_XPS", :com} ,{"node2#DELL_XPS", :it}, {"node2#DELL_XPS", :rest}],
fn(n)-> worker(Cache, [n], id: elem(n, 1)) end)
supervise(workers, strategy: :one_for_one)
end
def start_link() do
:supervisor.start_link(__MODULE__, [])
end
end
:erlang.spawn(:node1#DELL_XPS, {:ok, #PID<0.128.0>})
:erlang.spawn/2 is the same function as Node.spawn/2. The function expects node name (which you have provided) and a function. Your GenServer.start_link call returned {:ok, Pid} as it should. Since a tuple can't be treated like a function Node.spawn/2 crashes.
I would not recommend spawning processes on separate nodes like this. If remote node goes down, not only will you lose that node in your cluster, but you will also have to deal with the fallout from all your spawned processes. This will result an app that is more brittle than it would otherwise be. If you want to have your cache GenServers running on multiple nodes I'd suggest running the application you are building on multiple nodes, and having an instance of your CacheSupervisor on each node. Then each CacheSupervisor starts up it's own GenServers underneath it. This is more robust because if a node goes down the remaining nodes will be unaffected. Of course you application logic will need to take this into account, losing a node could mean losing cache data or client connections. See this answer for more details: How does an Erlang gen_server start_link a gen_server on another node?
If you really really want to a spawn process on a remote node like this you could do this:
:erlang.spawn_link(:node1#DELL_XPS, fun() ->
{:ok, #PID<0.128.0>} = :gen_server.start_link({:local,elem(name, 1)}, __MODULE__, {HashDict.new, 0}, [])
receive
% Block forever
:exit -> :ok
end
end)
Note that you must use spawn_link, as supervisors expect to be linked to their children. If the supervisor is not linked it will not know when the child crashes and won't be able to restart the process.
Related
Is it possible to register a :poolboy pool in registry (:gproc or Registry in elixir 1.4) after it was started?
I need to implement some kind on pub/sub architecture on pools. And I'd like to register multiple pools under the same alias.
Registry has :duplicate, and :gproc has :p, but it looks like none of them works with :via tuples, so I can't use it in name of my pool.
It looks like Registry only allows registering the current process under a given name.
So to use registry, you'd need to start a process to act as a proxy to each Poolboy pool, eg a GenServer module PoolProxy.
defmodule PoolProxy do
use GenServer
def init(poolname) do
{:ok, _} = Registry.register(Registry.PoolPubSub, "PoolPubSub", nil)
{:ok, poolname}
end
def handle_call(:notify_pool, _from, poolname) do
# interact with poolboy pool here...
end
end
Once registered, you can then pub-sub to the proxy processes, with
Registry.dispatch(Registry.PoolPubSub, "PoolPubSub", fn entries ->
for {pid, _} <- entries, do: GenServer.call(pid, :notify_pool)
end)
I'm reading 7 concurrency models in 7 weeks and the author asks to implement
cache that distributes cache entries across multiple actors according
to a hash function. Create a supervisor that starts multiple cache
actors and routes incoming messages to the appropriate cache worker.
What action should this supervisor take if one of the cache workers
fails?
defmodule CacheSupervisor do
def start(number_of_caches) do
spawn(__MODULE__, :init, [number_of_caches])
end
def init(number_of_caches) do
Process.flag(:trap_exit, true)
pids = Enum.into(Enum.map(0..number_of_caches-1, fn(id) ->
{id, Cache.start}
end), %{})
loop(pids, number_of_caches)
end
def loop(pids, number_of_caches) do
receive do
{:put, url, page} -> send(compute_cache_pid(pids, url, number_of_caches), {:put, url, page})
loop(pids, number_of_caches)
{:get, sender, ref, url} -> send(compute_cache_pid(pids, url, number_of_caches), {:get, sender, ref, url})
loop(pids, number_of_caches)
{:EXIT, pid, reason} -> IO.puts("Cache #{pid} failed with reason #{inspect reason} - restarting it")
loop(repair(pids, pid), number_of_caches)
end
end
def restart(pids, pid) do
#not clear how to repair, since we don't know the index of the failed process
end
def compute_cache_pid(pids, url, number_of_caches) do
pids[rem(:erlang.phash2(url), number_of_caches)]
end
end
When a process fails, supervisor only get its pid. Using just pid, I can't understand which bucket needs a new process. I probably could create a second map pid -> bucket_index, but it seems ugly.
Can I somehow attach additional attributes when I spawn a process? Can I read those attributes when process exits?
What would be the best way to solve it (without OTP)?
This implementation also makes supervisor too "smart". As I understand supervisors should be simple and only care about restarting failed processes.
I'm not sure how to separate restarting and routing logic into 2 separate processes: when supervisor restarts some process, router should be aware of it and have up-to-date pids.
Having a module/function mymodule , how to start it multiple times under the supervisor behavior ?
I need for example 2 instances of the same process (mymodule) to be started concurrently. I called the children identifiers as child1 and child2. They both point to the mymodule module that I want to start. I have specified two different functions to to start each instance of the worker process "mymodule" ( start_link1 and start_link2 )
-module(my_supervisor).
-behaviour(supervisor).
-export([start_link/0, init/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, _Arg = []).
init([]) ->
{ok, {{one_for_one, 10, 10},
[{child1,
{mymodule, start_link1, []},
permanent,
10000,
worker,
[mymodule]}
,
{child2,
{mymodule, start_link2, []},
permanent,
10000,
worker,
[mymodule]}
]}}.
The worker has two distinguished start_link functions ( start_link1 and start_link2 ) for testing purposes:
-module(mymodule).
-behaviour(gen_server).
start_link1() ->
log_something("at link 1"),
gen_server:start_link({global, child1}, ?MODULE, [], []).
start_link2() ->
log_something("at link 2"),
gen_server:start_link({global, child2}, ?MODULE, [], []).
init([]) ->
....
With the above I can see in my log the message "at link 1" but it does reveal "at link 2" anywhere. It also does not perform anything in the instance of link1 : just dies apparently.
The only scenario that works is when the name "child1" matches the worker module name "mymodule".
As #MilleBessö asks are you trying to two processes which have the same registered name? Does mymodule:start_link register the mymodule process under a fixed name? If so then trying to start a second one will cause a clash. Ot are you trying to start multiple my_supervisor supervisors? Then you will also get a name clash. You have not included the code for my_module.
Remember you can only have one process registered under a name. This holds for both local registered processes and those registered using global.
EDIT: Does the supervisor die as well?
A gen_server, and all other behaviours, aren't considered to be properly started until the init callback has completed and returned a correct value ({ok,State}). So if there is an error in mymodule:init/1 then this will crash the child process before it has been initialised and the supervisor will give up. While a supervisor can and will restart children when they die it does require that they all start correctly. From supervisor:start_link/3
If the supervisor and its child processes are successfully created (i.e. if all child process start functions return {ok,Child}, {ok,Child,Info}, or ignore) the function returns {ok,Pid}, where Pid is the pid of the supervisor. If there already exists a process with the specified SupName the function returns {error,{already_started,Pid}}, where Pid is the pid of that process.
If Module:init/1 returns ignore, this function returns ignore as well and the supervisor terminates with reason normal. If Module:init/1 fails or returns an incorrect value, this function returns {error,Term} where Term is a term with information about the error, and the supervisor terminates with reason Term.
I don't know if this is the problem but it give the same behaviour as you get.
Check the docs for supervisor:start_link(). The first parameter you pass in here is the name it uses to register with global, which provides a global name -> pid lookup. Since this must be unique, your second process fails to start, since the name is already taken.
Edit: Here is link to the docs: http://erldocs.com/R15B/stdlib/supervisor.html?i=5&search=supervisor:start#start_link/3
Check also simple-one-for-one supervisor restart scenario. It allows to start multiple processes with same child specification in more automated way.
It seems that an erlang process will stay alive until the 5 sec default timeout even if it has finished it's work.
I have gen_server call that issues a command to the window CLI which can be completed in less than 1 sec but the process waits 5 sec before I see the result of the operation. What's going on? is it soemthing to do with the timeout, or might it be something else.
EDIT This call doesn't do anything for 5 seconds (the default timeout!)
handle_call({create_app, Path, Name, Args}, _From, State) ->
case filelib:ensure_dir(Path) of
{error, Reason} ->
{reply, Reason, State};
_ ->
file:set_cwd(Path),
Response = os:cmd(string:join(["Rails", Name, Args], " ")),
{reply, Response, State}
end;
I'm guessing the os:cmd is taking that long to return the results. It's possible that maybe the os:cmd is having trouble telling when the rails command is completed and doesn't return till the process triggers the timeout. But from your code I'd say the most likely culprit is the os:cmd call.
Does the return contain everything you expect it to?
You still have not added any information on what the problem is. But I see some other things I'd like to comment on.
Current working dir
You are using file:set_cwd(Path) so the started command will inherit that path. The cwd of the file server is global. You should probably not be using it at all in application code. It's useful for setting the cwd to where you want erlang crash dumps to be written etc.
Your desire to let rail execute with the cwd according to Path is better served with something like this:
_ ->
Response = os:cmd(string:join(["cd", Path, "&&", "Rails", Name, Args], " ")),
{reply, Response, State}
That is, start a shell to parse the command line, have the shell change cwd and the start Rails.
Blocking a gen_server
The gen_server is there to serialize processing. That is, it processes one message after the other. It doesn't handle them all concurrently. It is its reason for existence to not handle them concurrently.
You are (in relation to other costs) doing some very costly computation in the gen_server: starting an external process that runs this rails application. Is it your intention to have at most one rails application running at any one time? (I've heard about ruby on rails requiring tons of memory per process, so it might be a sensible decision).
If you dont need to update the State with any values from a costly call, as in your example code, then you can use an explicit gen_server:reply/2 call.
_ ->
spawn_link(fun () -> rails_cmd(From, Path, Name, Args) end),
{no_reply, State}
And then you have
rails_cmd(From, Path, Name, Args) ->
Response = os:cmd(string:join(["cd", Path, "&&", "Rails", Name, Args], " ")),
gen_server:reply(From, Response).
After seeing this article, I have been tinkering with mochiweb. While trying to replicate what's done in the article - basically setting up a mochiweb server, having two erlang nodes, and then calling a function defined in one node in the other (after setting net_adm:ping() between the two nodes so they know each other).
I was able to follow everything till that function call part. In n1#localhost, which is the mochiweb server, I call (just as done in the article):
router:login(IdInt, self()).
And then, in n2#localhost, which is the router.erl script, I have defined the login function:
login(Id, Pid) when is_pid(Pid) ->
gen_server:call(?SERVER, {login, Id, Pid}).
handle_call({login, Id, Pid}, _From, State) when is_pid(Pid) ->
ets:insert(State#state.pid2id, {Pid, Id}),
ets:insert(State#state.id2pid, {Id, Pid}),
link(Pid), % tell us if they exit, so we can log them out
io:format("~w logged in as ~w\n",[Pid, Id]),
{reply, ok, State};
I have pasted only the relevant parts of the code. However, when I now access the webserver on the browser - I get this error report on n1#localhost:
=CRASH REPORT==== 11-Jun-2009::12:39:49 ===
crasher:
initial call: mochiweb_socket_server:acceptor_loop/1
pid: <0.62.0>
registered_name: []
exception error: undefined function router:login/2
in function mochiconntest_web:loop/2
in call from mochiweb_http:headers/5
ancestors: [mochiconntest_web,mochiconntest_sup,<0.59.0>]
messages: []
links: [<0.61.0>,#Port<0.897>]
dictionary: [{mochiweb_request_path,"/test/123"}]
trap_exit: false
status: running
heap_size: 1597
stack_size: 24
reductions: 1551
neighbours:
=ERROR REPORT==== 11-Jun-2009::12:39:49 ===
{mochiweb_socket_server,235,{child_error,undef}}
After googling around, I got a basic gist of what the error is trying to say - basically it says that the login function being called in n1#localhost is not defined - but it is defined in n2#localhost (and both the nodes know each other - I did nodes(). to check) !! Please tell me where I am going wrong!
You are right - the code for router:login is not actually available on your host n1#localhost - it is the code within that function (the gen_server:call function) which routes the call to n2#localhost (via that ?SERVER macro) and that's where the real implementation is. The top level function is simply a way of wrapping up that call to the appropriate node.
But you do need at least the implementation of login
login(Id, Pid) when is_pid(Pid) ->
gen_server:call(?SERVER, {login, Id, Pid}).
available on n1#localhost.
(Updated)
You'd need to define, replace or us a ?SERVER macro as well. In the sample code in the article this is
-define(SERVER, global:whereis_name(?MODULE)).
but this uses the ?MODULE macro which would be wrong in your case. Basically when the gen_server process (router) is started it registers itself as ?MODULE, in this case that maps to the atom 'router' which other nodes can see (using global:whereis_name(router) ). So you should be able to just write:
login(Id, Pid) when is_pid(Pid) ->
gen_server:call(global:whereis_name(router), {login, Id, Pid}).
So the effect of calling login on n1#localhost would make a gen_server call to the router:handle_call method on n2#localhost, assuming the router gen_server process is running and has registered itself. The return value of that call comes back to your process on n1#localhost.
In the examples in your question it looks like you only loaded the router module on one node. Nodes do not by default automatically load code from eachother, so just because the function is defined on n2 doesn't mean you can call it locally on n1 (n1 would need to be able to load it in the normal way).
The code given looks like it copes properly with running in a distributed system (you can start the router server on one node and calling the API functions on other nodes will send router requests to the correct place). So you just need to put a copy of the router module on n1 and it should just work. It's possible to have n1 load the router module from n2, but it's a bit of a hassle compared to just giving n1 a copy of the module in its code path.
Interestingly, in the router code it's unnecessary to do gen_server:call(global:whereis_name(?MODULE, Message)). as the gen_server:call/2 function knows how to lookup global registrations itself. -define(SERVER, {global, ?MODULE}). would work fine if you changed the start_link function to start_link() -> gen_server:start_link(?SERVER, ?MODULE, [], []).