Mnesia Query Cursors - Working with them in Practical applications

Mnesia Query Cursors - Working with them in Practical applications - erlang

In most applications, its hard to avoid the need to query large amounts of information which a user wants to browse through. This is what led me to cursors. With mnesia, cursors are implemented using qlc:cursor/1 or qlc:cursor/2. After working with them for a while and facing this problem many times,
11> qlc:next_answers(QC,3).
** exception error: {qlc_cursor_pid_no_longer_exists,<0.59.0>}
in function qlc:next_loop/3 (qlc.erl, line 1359)
12>
It has occured to me that the whole cursor thing has to be within one mnesia transaction: executes as a whole once. like this below
E:\>erl
Eshell V5.9 (abort with ^G)
1> mnesia:start().
ok
2> rd(obj,{key,value}).
obj
3> mnesia:create_table(obj,[{attributes,record_info(fields,obj)}]).
{atomic,ok}
4> Write = fun(Obj) -> mnesia:transaction(fun() -> mnesia:write(Obj) end) end.
#Fun<erl_eval.6.111823515>
5> [Write(#obj{key = N,value = N * 2}) || N <- lists:seq(1,100)],ok.
ok
6> mnesia:transaction(fun() ->
QC = cursor_server:cursor(qlc:q([XX || XX <- mnesia:table(obj)])),
Ans = qlc:next_answers(QC,3),
io:format("\n\tAns: ~p~n",[Ans])
end).
Ans: [{obj,20,40},{obj,21,42},{obj,86,172}]
{atomic,ok}
7>
When you attempt to call say: qlc:next_answers/2 outside a mnesia transaction, you will get an exception. Not only just out of the transaction, but even if that method is executed by a DIFFERENT process than the one which created the cursor, problems are bound to happen.
Another intresting finding is that, as soon as you get out of a mnesia transaction, one of the processes which are involved in a mnesia cursor (apparently mnesia spawns a process in the background), exits, causing the cursor to be invalid. Look at this below:
-module(cursor_server).
-compile(export_all).
cursor(Q)->
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:cursor(QH,[]) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:cursor(Q,[])
end.
%% --- End of module -------------------------------------------
Then in shell, i use that method:
7> QC = cursor_server:cursor(qlc:q([XX || XX <- mnesia:table(obj)])).
{qlc_cursor,{<0.59.0>,<0.30.0>}}
8> erlang:is_process_alive(list_to_pid("<0.59.0>")).
false
9> erlang:is_process_alive(list_to_pid("<0.30.0>")).
true
10> self().
<0.30.0>
11> qlc:next_answers(QC,3).
** exception error: {qlc_cursor_pid_no_longer_exists,<0.59.0>}
in function qlc:next_loop/3 (qlc.erl, line 1359)
12>
So, this makes it very Extremely hard to build a web application in which a user needs to browse a particular set of results, group by group say: give him/her the first 20, then next 20 e.t.c. This involves, getting the first results, send them to the web page, then wait for the user to click NEXT then ask qlc:cursor/2 for the next 20 and so on. These operations cannot be done, while hanging inside a mnesia transaction !!! The only possible way, is by spawning a process which will hang there, receiving and sending back next answers as messages and receiving the next_answers requests as messages like this:
-define(CURSOR_TIMEOUT,timer:hours(1)).
%% initial request is made here below
request(PageSize)->
Me = self(),
CursorPid = spawn(?MODULE,cursor_pid,[Me,PageSize]),
receive
{initial_answers,Ans} ->
%% find a way of hidding the Cursor Pid
%% in the page so that the subsequent requests
%% come along with it
{Ans,pid_to_list(CursorPid)}
after ?CURSOR_TIMEOUT -> timedout
end.
cursor_pid(ParentPid,PageSize)->
F = fun(Pid,N)->
QC = cursor_server:cursor(qlc:q([XX || XX <- mnesia:table(obj)])),
Ans = qlc:next_answers(QC,N),
Pid ! {initial_answers,Ans},
receive
{From,{next_answers,Num}} ->
From ! {next_answers,qlc:next_answers(QC,Num)},
%% Problem here ! how to loop back
%% check: Erlang Y-Combinator
delete ->
%% it could have died already, so we be careful here !
try qlc:delete_cursor(QC) of
_ -> ok
catch
_:_ -> ok
end,
exit(normal)
after ?CURSOR_TIMEOUT -> exit(normal)
end
end,
mnesia:activity(transaction,F,[ParentPid,PageSize],mnesia_frag).
next_answers(CursorPid,PageSize)->
list_to_pid(CursorPid) ! {self(),{next_answers,PageSize}},
receive
{next_answers,Ans} ->
{Ans,pid_to_list(CursorPid)}
after ?CURSOR_TIMEOUT -> timedout
end.
That would create a more complex problem of managing process exits, tracking / monitoring e.t.c. I wonder why the mnesia implementers didnot see this !
Now, that brings me to my questions. I have been walking around the web for solutions and you could please check out these links from which the questions arise: mnemosyne, Ulf Wiger's Solution to Cursor Problems, AMNESIA - an RDBMS implementation of mnesia.
1. Does anyone have an idea on how to handle mnesia query cursors in a different way from what is documented, and is worth sharing ?
2. What are the reasons why mnesia implemeters decided to force the cursors within a single transaction: even the calls for the next_answers ?
3. Is there anything, from what i have presented, that i do not understand clearly (other than my bad buggy illustration code - please ignore those) ?
4. AMNESIA (on section 4.7 of the link i gave above), has a good implementation of cursors, because the subsequent calls for the next_answers, do not need to be in the same transaction, NOR by the same process. Would you advise anyone to switch from mnesia to amnesia due to this and also, is this library still supported ?
5. Ulf Wiger , (the author of many erlang libraries esp. GPROC), suggests the use of mnesia:select/4. How would i use it to solve cursor problems in a web application ? NOTE: Please do not advise me to leave mnesia and use something else, because i want to use mnesia for this specific problem. I appreciate your time to read through all this question.

The motivation is that transaction grabs (in your case) read locks.
Locks can not be kept outside of transactions.
If you want, you can run it in a dirty_context, but you loose the
transactional properties, i.e. the table may change between invocations.
make_cursor() ->
QD = qlc:sort(mnesia:table(person, [{traverse, select}])),
mnesia:activity(async_dirty, fun() -> qlc:cursor(QD) end, mnesia_frag).
get_next(Cursor) ->
Get = fun() -> qlc:next_answers(Cursor,5) end,
mnesia:activity(async_dirty, Get, mnesia_frag).
del_cursor(Cursor) ->
qlc:delete_cursor(Cursor).

I think this may help you :
use async_dirty instead of transaction
{Record,Cont}=mnesia:activity(async_dirty, fun mnesia:select/4,[md,[{Match_head,[Guard],[Result]}],Limit,read])
then read next Limit number of records:
mnesia:activity(async_dirty, fun mnesia:select/1,[Cont])
full code:
-record(md,{id,name}).
batch_delete(Id,Limit) ->
Match_head = #md{id='$1',name='$2'},
Guard = {'<','$1',Id},
Result = '$_',
{Record,Cont} = mnesia:activity(async_dirty, fun mnesia:select/4,[md,[{Match_head,[Guard],[Result]}],Limit,read]),
delete_next({Record,Cont}).
delete_next('$end_of_table') ->
over;
delete_next({Record,Cont}) ->
delete(Record),
delete_next(mnesia:activity(async_dirty, fun mnesia:select/1,[Cont])).
delete(Records) ->
io:format("delete(~p)~n",[Records]),
F = fun() ->
[ mnesia:delete_object(O) || O <- Records]
end,
mnesia:transaction(F).
remember you can not use cursor out of one transaction

Related

Are erlang:send_after/3 and timer:send_after/3 intended to behave differently?

I wanted to send a message to a process after a delay, and discovered erlang:send_after/4.
When looking at the docs it looked like this is exactly what I wanted:
erlang:send_after(Time, Dest, Msg, Options) -> TimerRef
Starts a timer. When the timer expires, the message Msg is sent to the
process identified by Dest.
However, it doesn't seem to work when the destination is running on another node - it tells me one of the arguments are bad.
1> P = spawn('node#host', module, function, [Arg]).
<10585.83.0>
2> erlang:send_after(1000, P, {123}).
** exception error: bad argument
in function erlang:send_after/3
called as erlang:send_after(1000,<10585.83.0>,{123})
Doing the same thing with timer:send_after/3 appears to work fine:
1> P = spawn('node#host', module, function, [Arg]).
<10101.10.0>
2> timer:send_after(1000, P, {123}).
{ok,{-576458842589535,#Ref<0.1843049418.1937244161.31646>}}
And, the docs for timer:send_after/3 state almost the same thing as the erlang version:
send_after(Time, Pid, Message) -> {ok, TRef} | {error, Reason}
Evaluates Pid ! Message after Time milliseconds.
So the question is, why do these two functions, which on the face of it do the same thing, behave differently? Is erlang:send_after broken, or mis-advertised? Or maybe timer:send_after isn't doing what I think it is?

TL;DR
Your assumption is correct: these are intended to do the same thing, but are implemented differently.
Discussion
Things in the timer module such as timer:send_after/2,3 work through the gen_server that defines that as a service. Like any other service, this one can get overloaded if you assign a really huge number of tasks (timers to track) to it.
erlang:send_after/3,4, on the other hand, is a BIF implemented directly within the runtime and therefore have access to system primitives like the hardware timer. If you have a ton of timers this is definitely the way to go. In most programs you won't notice the difference, though.
There is actually a note about this in the Erlang Efficiency Guide:
3.1 Timer Module
Creating timers using erlang:send_after/3 and erlang:start_timer/3 , is much more efficient than using the timers provided by the timer module in STDLIB. The timer module uses a separate process to manage the timers. That process can easily become overloaded if many processes create and cancel timers frequently (especially when using the SMP emulator).
The functions in the timer module that do not manage timers (such as timer:tc/3 or timer:sleep/1), do not call the timer-server process and are therefore harmless.
A workaround
A workaround to gain the efficiency of the BIF without the same-node restriction is to have a process of your own that does nothing but wait for a message to forward to another node:
-module(foo_forward).
-export([send_after/3, cancel/1]).
% Obviously this is an example only. You would want to write this to
% be compliant with proc_lib, write a proper init/N and integrate with
% OTP. Note that this snippet is missing the OTP service functions.
start() ->
spawn(fun() -> loop(self(), [], none) end).
send_after(Time, Dest, Message) ->
erlang:send_after(Time, self(), {forward, Dest, Message}).
loop(Parent, Debug, State) ->
receive
{forward, Dest, Message} ->
Dest ! Message,
loop(Parent, Debug, State);
{system, From, Request} ->
sys:handle_msg(Request, From, Parent, ?MODULE, Debug, State);
Unexpected ->
ok = log(warning, "Received message: ~tp", [Unexpected]),
loop(Parent, Debug, State)
end.
The above example is a bit shallow, but hopefully it expresses the point. It should be possible to get the efficiency of the BIF erlang:send_after/3,4 but still manage to send messages across nodes as well as give you the freedom to cancel a message using erlang:cancel_timer/1
But why?
The puzzle (and bug) is why erlang:send_after/3,4 does not want to work across nodes. The example you provided above looks a bit odd as the first assignment to P was the Pid <10101.10.0>, but the crashed call was reported as <10585.83.0> -- clearly not the same.
For the moment I do not know why erlang:send_after/3,4 doesn't work, but I can say with confidence that the mechanism of operation between the two is not the same. I'll look into it, but I imagine that the BIF version is actually doing some funny business within the runtime to gain efficiency and as a result signalling the target process by directly updating its mailbox instead of actually sending an Erlang message on the higher Erlang-to-Erlang level.
Maybe it is good that we have both, but this should definitely be clearly marked in the docs, and it evidently is not (I just checked).

There is some difference in timeout order if you have many timers.
The example below shows erlang:send_after does not guarantee order, but
timer:send_after does.
1> A = lists:seq(1,10).
[1,2,3,4,5,6,7,8,9,10]
2> [erlang:send_after(100, self(), X) || X <- A].
...
3> flush().
Shell got 2
Shell got 3
Shell got 4
Shell got 5
Shell got 6
Shell got 7
Shell got 8
Shell got 9
Shell got 10
Shell got 1
ok
4> [timer:send_after(100, self(), X) || X <- A].
...
5> flush().
Shell got 1
Shell got 2
Shell got 3
Shell got 4
Shell got 5
Shell got 6
Shell got 7
Shell got 8
Shell got 9
Shell got 10
ok

Extract values from list of tuples continuously in Erlang

I am learning Erlang from a Ruby background and having some difficulty grasping the thought process. The problem I am trying to solve is the following:
I need to make the same request to an api, each time I receive a unique ID in the response which I need to pass into the next request until there is not ID returned. From each response I need to extract certain data and use it for other things as well.
First get the iterator:
ShardIteratorResponse = kinetic:get_shard_iterator(GetShardIteratorPayload).
{ok,[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89D"...>>}]}
Parse out the shard_iterator..
{_, [{_, ShardIterator}]} = ShardIteratorResponse.
Make the request to kinesis for the streams records...
GetRecordsPayload = [{<<"ShardIterator">>, <<ShardIterator/binary>>}].
[{<<"ShardIterator">>,
<<"AAAAAAAAAAGU+v0fDvpmu/02z5Q5OJZhPo/tU7fjftFF/H9M7J9niRJB8MIZiB9E1ntZGL90dIj3TW6MUWMUX67NEj4GO89DETABlwVV"...>>}]
14> RecordsResponse = kinetic:get_records(GetRecordsPayload).
{ok,[{<<"NextShardIterator">>,
<<"AAAAAAAAAAFy3dnTJYkWr3gq0CGo3hkj1t47ccUS10f5nADQXWkBZaJvVgTMcY+nZ9p4AZCdUYVmr3dmygWjcMdugHLQEg6x"...>>},
{<<"Records">>,
[{[{<<"Data">>,<<"Zmlyc3QgcmVjb3JkISEh">>},
{<<"PartitionKey">>,<<"BlanePartitionKey">>},
{<<"SequenceNumber">>,
<<"49545722516689138064543799042897648239478878787235479554">>}]}]}]}
What I am struggling with is how do I write a loop that keeps hitting the kinesis endpoint for that stream until there are no more shard iterators, aka I want all records. Since I can't re-assign the variables as I would in Ruby.

WARNING: My code might be bugged but it's "close". I've never ran it and don't see how last iterator can look like.
I see you are trying to do your job entirely in shell. It's possible but hard. You can use named function and recursion (since release 17.0 it's easier), for example:
F = fun (ShardIteratorPayload) ->
{_, [{_, ShardIterator}]} = kinetic:get_shard_iterator(ShardIteratorPayload),
FunLoop =
fun Loop(<<>>, Accumulator) -> % no clue how last iterator can look like
lists:reverse(Accumulator);
Loop(ShardIterator, Accumulator) ->
{ok, [{_, NextShardIterator}, {<<"Records">>, Records}]} =
kinetic:get_records([{<<"ShardIterator">>, <<ShardIterator/binary>>}]),
Loop(NextShardIterator, [Records | Accumulator])
end,
FunLoop(ShardIterator, [])
end.
AllRecords = F(GetShardIteratorPayload).
But it's too complicated to type in shell...
It's much easier to code it in modules.
A common pattern in erlang is to spawn another process or processes to fetch your data. To keep it simple you can spawn another process by calling spawn or spawn_link but don't bother with links now and use just spawn/3.
Let's compile simple consumer module:
-module(kinetic_simple_consumer).
-export([start/1]).
start(GetShardIteratorPayload) ->
Pid = spawn(kinetic_simple_fetcher, start, [self(), GetShardIteratorPayload]),
consumer_loop(Pid).
consumer_loop(FetcherPid) ->
receive
{FetcherPid, finished} ->
ok;
{FetcherPid, {records, Records}} ->
consume(Records),
consumer_loop(FetcherPid);
UnexpectedMsg ->
io:format("DROPPING:~n~p~n", [UnexpectedMsg]),
consumer_loop(FetcherPid)
end.
consume(Records) ->
io:format("RECEIVED:~n~p~n",[Records]).
And fetcher:
-module(kinetic_simple_fetcher).
-export([start/2]).
start(ConsumerPid, GetShardIteratorPayload) ->
{ok, [ShardIterator]} = kinetic:get_shard_iterator(GetShardIteratorPayload),
fetcher_loop(ConsumerPid, ShardIterator).
fetcher_loop(ConsumerPid, {_, <<>>}) -> % no clue how last iterator can look like
ConsumerPid ! {self(), finished};
fetcher_loop(ConsumerPid, ShardIterator) ->
{ok, [NextShardIterator, {<<"Records">>, Records}]} =
kinetic:get_records(shard_iterator(ShardIterator)),
ConsumerPid ! {self(), {records, Records}},
fetcher_loop(ConsumerPid, NextShardIterator).
shard_iterator({_, ShardIterator}) ->
[{<<"ShardIterator">>, <<ShardIterator/binary>>}].
As you can see both processes can do their job concurrently.
Try from your shell:
kinetic_simple_consumer:start(GetShardIteratorPayload).
Now your see that your shell process turns to consumer and you will have your shell back after fetcher will send {ItsPid, finished}.
Next time instead of
kinetic_simple_consumer:start(GetShardIteratorPayload).
run:
spawn(kinetic_simple_consumer, start, [GetShardIteratorPayload]).
You should play with spawning processes - it's erlang main strength.

In Erlang, you can write loop using tail recursive functions. I don't know the kinetic API, so for simplicity, I just assume, that kinetic:next_iterator/1 return {ok, NextIterator} or {error, Reason} when there are no more shards.
loop({error, Reason}) ->
ok;
loop({ok, Iterator}) ->
do_something_with(Iterator),
Result = kinetic:next_iterator(Iterator),
loop(Result).
You are replacing loop with iteration. First clause deals with case, where there are no more shards left (always start recursion with the end condition). Second clause deals with case, where we got some iterator, we do something with it and call next.
The recursive call is last instruction in the function body, which is called tail recursion. Erlang optimizes such calls - they don't use call stack, so they can run infinitely in constant memory (you will not get anything like "Stack level too deep")

Optimized way of getting values from an erlang process loop

Hi I'm a newbie in Erlang and I just started learning about processes. Here I have a typical process loop:
loop(X,Y,Z) ->
receive
{do} ->
NewX = X+1,
NewY = Y+1,
NewZ = Z+1,
Product = NewX * NewY * NewZ,
% do something
loop(NewX,NewY,NewZ)
end.
How do I get the latest value of Product from a function let's say get_product()? I know that message passing will be the logical option but is there a more optimal way of extracting the value?

Here are methods to communicate between Erlang processes I am aware of, and my (possibly wrong) assessment of theirs relative performance.
Message passing. This method will suit most of your needs. I don't know how it is actually implemented, but from my point of view it should be as fast as putting a pointer into a queue and retrieving it back.
Exterior methods, e.g. sockets, files, pipes. These methods might be faster for communicating between different nodes, depending on a problem you solve, your solution and environment your program will be executed in. Inter-node communication in Erlang is done via TCP connections, so if you want to use self written code to communicate via TCP sockets, you should try really hard to outperform Erlang's implementation.
ETS, Dets. These methods won't be faster than message passing (ETS) or file (Dets) assuming best possible implementation.
NIF. You can write one method to save value in your NIF library and one to retrieve it. This one has a potential to outperform message passing since you can just save a value into a variable and return it back when needed and it has no overhead on pattern matching in receive.
Process dictionary. You can get another process dictionary using erlang:process_info(Pid, dictionary) call, in the Pid process you can put value in that dictionary using put(Key, Value) call.
Also, if you want to speed up your Erlang application take a look at HiPE, it might help.
Before switching from message passing to anything from this list to gain in speed you should measure it first!

I assumed this is what you want:
-module(lab).
-compile(export_all).
start() ->
InitialState = {1,1,1},
Pid = spawn(?MODULE, loop, [InitialState]),
register(server, Pid).
loop(State) ->
{X, Y, Z} = State,
receive
tick ->
NewX = X+1,
NewY = Y+1,
NewZ = Z+1,
NewState = {NewX, NewY, NewZ},
loop(NewState);
{get_product, From} ->
Product = X * Y * Z,
From ! Product,
loop(State);
_ ->
io:format("Unknown message received.~n"),
loop(State)
end.
get_product() ->
server ! {get_product, self()},
receive
Product ->
Product
end.
tick() ->
server ! tick.
From within the Erlang shell:
1> c(lab).
{ok,lab}
2> lab:start().
true
3> lab:get_product().
1
4> lab:tick().
tick
5> lab:get_product().
8
6> lab:tick().
tick
7> lab:tick().
tick
8> lab:get_product().
64

erlang supervisor best way to handle ibrowse:send_req conn_failed

new to Erlang and just having a bit of trouble getting my head around the new paradigm!
OK, so I have this internal function within an OTP gen_server:
my_func() ->
Result = ibrowse:send_req(?ROOTPAGE,[{"User-Agent",?USERAGENT}],get),
case Result of
{ok, "200", _, Xml} -> %<<do some stuff that won't interest you>>
,ok;
{error,{conn_failed,{error,nxdomain}}} -> <<what the heck do I do here?>>
end.
If I leave out the case for handling the connection failed then I get an exit signal propagated to the supervisor and it gets shut down along with the server.
What I want to happen (at least I think this is what I want to happen) is that on a connection failure I'd like to pause and then retry send_req say 10 times and at that point the supervisor can fail.
If I do something ugly like this...
{error,{conn_failed,{error,nxdomain}}} -> stop()
it shuts down the server process and yes, I get to use my (try 10 times within 10 seconds) restart strategy until it fails, which is also the desired result however the return value from the server to the supervisor is 'ok' when I would really like to return {error,error_but_please_dont_fall_over_mr_supervisor}.
I strongly suspect in this scenario that I'm supposed to handle all the business stuff like retrying failed connections within 'my_func' rather than trying to get the process to stop and then having the supervisor restart it in order to try it again.
Question: what is the 'Erlang way' in this scenario ?

I'm new to erlang too.. but how about something like this?
The code is long just because of the comments. My solution (I hope I've understood correctly your question) will receive the maximum number of attempts and then do a tail-recursive call, that will stop by pattern-matching the max number of attempts with the next one. Uses timer:sleep() to pause to simplify things.
%% #doc Instead of having my_func/0, you have
%% my_func/1, so we can "inject" the max number of
%% attempts. This one will call your tail-recursive
%% one
my_func(MaxAttempts) ->
my_func(MaxAttempts, 0).
%% #doc This one will match when the maximum number
%% of attempts have been reached, terminates the
%% tail recursion.
my_func(MaxAttempts, MaxAttempts) ->
{error, too_many_retries};
%% #doc Here's where we do the work, by having
%% an accumulator that is incremented with each
%% failed attempt.
my_func(MaxAttempts, Counter) ->
io:format("Attempt #~B~n", [Counter]),
% Simulating the error here.
Result = {error,{conn_failed,{error,nxdomain}}},
case Result of
{ok, "200", _, Xml} -> ok;
{error,{conn_failed,{error,nxdomain}}} ->
% Wait, then tail-recursive call.
timer:sleep(1000),
my_func(MaxAttempts, Counter + 1)
end.
EDIT: If this code is in a process which is supervised, I think it's better to have a simple_one_for_one, where you can add dinamically whatever workers you need, this is to avoid delaying initialization due to timeouts (in a one_for_one the workers are started in order, and having sleep's at that point will stop the other processes from initializing).
EDIT2: Added an example shell execution:
1> c(my_func).
my_func.erl:26: Warning: variable 'Xml' is unused
{ok,my_func}
2> my_func:my_func(5).
Attempt #0
Attempt #1
Attempt #2
Attempt #3
Attempt #4
{error,too_many_retries}
With 1s delays between each printed message.

How to save state in an Erlang process?

I am learning Erlang and trying to figure out how I can, and should, save state inside a process.
For example, I am trying to write a program that given a list of numbers in a file, tells me whether a number appears in that file. My approach is to uses two processes
cache which reads the content of the file into a set, then waits for numbers to check, and then replies whether they appear in the set.
is_member_loop(Data_file) ->
Numbers = read_numbers(Data_file),
receive
{From, Number} ->
From ! {self(), lists:member(Number, Numbers)},
is_member_loop(Data_file)
end.
client which sends numbers to cache and waits for the true or false response.
check_number(Number) ->
NumbersPid ! {self(), Number},
receive
{NumbersPid, Is_member} ->
Is_member
end.
This approach is obviously naive since the file is read for every request. However, I am quite new at Erlang and it is unclear to me what would be the preferred way of keeping state between different requests.
Should I be using the process dictionary? Is there a different mechanism I am not aware of for that sort of process state?
Update
The most obvious solution, as suggested by user601836, is to pass the set of numbers as a param to is_member_loop instead of the filename. It seems to be a common idiom in Erlang and there is a good example in the fantastic online book Learn you some Erlang.
I think, however, that the question still holds for more complex state that I'd want to preserve in my process.

Simple solution, you can pass to your function is_member_loop(Data_file) the list of numbers rather then the file name.
The best solution when you deal with a state consists in using a gen_server. To learn more you should take a look at records and gen_server behaviour (this may also be useful).
In practice:
1) start with a module (yourmodule.erl) based on gen_server behaviour
2) read your file in the init function of the gen_server and pass it as state field:
init([]) ->
Numbers = read_numbers(Data_file),
{ok, #state{numbers=Numbers}}.
3) write a function which will be used to trigger a call to the gen_server
check_number(Number) ->
gen_server:call(?MODULE, {check_number, Number}).
4) write the code in order to handle messages triggered from your function
handle_call({check_number, Number}, _From, #state{numbers=Numbers} = State) ->
Reply = lists:member(Number, Numbers)},
{reply, Reply, State};
handle_call(_Request, _From, State) ->
Reply = ok,
{reply, Reply, State}.
5) export from yourmodule.erl function check_number
-export([check_number/1]).
Two things to be explained about point 4:
a) we extract values inside the record State using pattern matching
b) As you may see I left the generic handle call, otherwise your gen_server will fail due to wrong pattern matching whenever a message different from {check_number, Number} is received
Note: if you are new to erlang, don't use process dictionary

Not sure how idiomatic this is, since I'm not exactly an Erlang pro yet, but I'd handle this by using ETS. Basically,
read_numbers_to_ets(DataFile) ->
Table = ets:new(numbers, [ordered_set]),
insert_numbers(Table, DataFile),
Table.
insert_numbers(Table, DataFile) ->
case read_next_number(DataFile) of
eof -> ok;
Num -> ets:insert(numbers, {Num})
end.
you could then define your is_member as
is_member(TableId, Number) ->
case ets:match(TableId, {Number}) of
[] -> false; %% no match from ets
[[]] -> true %% ets found the number you're looking for in that table
end.
Instead of taking a Data_file, your is_member_loop would take the id of the table to do a lookup on.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart