Getting lots of data from Mnesia - fastest way - erlang

I have a record:
-record(bigdata, {mykey,some1,some2}).
Is doing a
mnesia:match_object({bigdata, mykey, some1,'_'})
the fastest way fetching more than 5000 rows?
Clarification:
Creating "custom" keys is an option (so I can do a read) but is doing 5000 reads fastest than match_object on one single key?

I'm curious as to the problem you are solving, how many rows are in the table, etc., without that information this might not be a relevant answer, but...
If you have a bag, then it might be better to use read/2 on the key and then traverse the list of records being returned. It would be best, if possible, to structure your data to avoid selects and match.
In general select/2 is preferred to match_object as it tends to better avoid full table scans. Also, dirty_select is going to be faster then select/2 assuming you do not need transactional support. And, if you can live with the constraints, Mensa allows you to go against the underlying ets table directly which is very fast, but look at the documentation as it is appropriate only in very rarified situations.

Mnesia is more a key-value storage system, and it will traverse all its records for getting match.
To fetch in a fast way, you should design the storage structure to directly support the query. To Make some1 as key or index. Then fetch them by read or index_read.

The statement Fastest Way to return more than 5000 rows depends on the problem in question. What is the database structure ? What do we want ? what is the record structure ? After those, then, it boils down to how you write your read functions. If we are sure about the primary key, then we use mnesia:read/1 or mnesia:read/2 if not, its better and more beautiful to use Query List comprehensions. Its more flexible to search nested records and with complex conditional queries. see usage below:
-include_lib("stdlib/include/qlc.hrl").
-record(bigdata, {mykey,some1,some2}).
%% query list comprehenshions
select(Q)->
%% to prevent against nested transactions
%% to ensure it also works whether table
%% is fragmented or not, we will use
%% mnesia:activity/4
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:e(QH) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:e(Q)
end.
%% to read by a given field or even several
%% you use a list comprehension and pass the guards
%% to filter those records accordingly
read_by_field(some2,Value)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 == Value]),
select(QueryHandle).
%% selecting by several conditions
read_by_several()->
%% you can pass as many guard expressions
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 =< 300,
X#bigdata.some1 > 50
]),
select(QueryHandle).
%% Its possible to pass a 'fun' which will do the
%% record selection in the query list comprehension
auto_reader(ValidatorFun)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
ValidatorFun(X) == true]),
select(QueryHandle).
read_using_auto()->
F = fun({bigdata,SomeKey,_,Some2}) -> true;
(_) -> false
end,
auto_reader(F).
So i think if you want fastest way, we need more clarification and problem detail. Speed depends on many factors my dear !

Related

Ecto multiple streams in 1 transaction

Background
PS: the following situation describes an hypothetical scenario, where I own a company that sells things to customers.
I have an Ecto query that is so big, that my machine cannot handle it. With billions of results returned, there is probably not enough RAM in the world that can handle it.
The solution here (or so my research indicates) is to use streams. Streams were made for potentially infinite sets of results, which would fit my use case.
https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2
Problem
So lets imagine that I want to delete All users that bought a given item. Maybe that item was not really legal in their country, and now me, the poor guy in IT, has to fix things so the world doesn't come down crashing.
Naive way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
|> Repo.all()
Users
|> where([u], u.purchase_id in ^purchase_ids)
|> Repo.delete_all()
This is the naive way. I call it naive, because of 2 issues:
We have so many purchases, that the machine's memory will overflow (looking at purchase_ids query)
purchase_ids will likely have more than 100K ids, so the second query (where we delete things) will fail as it hits Postgres parameters limit of 32K: https://stackoverflow.com/a/42251312/1337392
What can I say, our product is highly addictive and very well priced!
Our customers simply cant get enough of it. Don't know why. Nope. No reason comes to mind. None at all.
With these problems in mind, I cannot help my customers and grow my [s]empire[/s], I mean, little home owned business.
I did find this possible solution:
Stream way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
stream = Repo.stream(purchase_ids)
Repo.transacion(fn ->
ids = Enum.to_list(stream)
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
Questions
However, I am not convinced this will work:
I am using Enum.to_list and saving everything into a variable, placing everything into memory again. So I am not gaining any advantage by using Repo.stream.
I still have too many ids for my Repo.delete_all to work without blowing up
I guess the one advantage here is that this now a transaction, so either everything goes or nothing goes.
So, the following questions arise:
How do I properly make use of streams in this scenario?
Can I delete items by streaming parameters (ids) or do I have to manually batch them?
Can I stream ids to Repo.delete_all ?
One cannot directly feed Repo.delete_all/1 with a stream, but Stream.chunk_every/2 is your friend here. One can do somewhat like below (500 is the default value for :max_rows hence we’d use it in chunk_every/2 as well.)
Repo.transacion(fn ->
max_rows = 500
purchase_ids
|> Repo.stream(max_rows: max_rows)
|> Stream.chunk_every(max_rows)
|> Stream.each(fn ids ->
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
|> Stream.run()
end, timeout: :infinity)
I know this isn't an answer to your question about using streams, but in this scenario, streams might not be necessary depending on the amount of data you are trying to delete. You should be able to delete all of these matching users with a subquery in one query without passing any variables into memory aside from the item_id:
sq =
Purchases
|> where(item_id: ^item_id)
|> select([p], p.id)
Users
|> where([u], u.purchase_id in subquery(sq))
|> Repo.delete_all(timeout: :infinity)

How do I select a random element from an ets set in Erlang/Elixir?

I have a large number of processes that I need to keep track of in an ets set, and then randomly select single processes. So I created the set like this:
:ets.new(:pid_lookup, [:set, :protected, :named_table])
then for argument's sake let's just stick self() in it 1000 times:
Enum.map 1..1000, fn x -> :ets.insert(:pid_lookup, {x, self()}) end
Now I need to select one at random. I know I could just select a random one using :ets.lookup(:pid_lookup, :rand.uniform(1000)), but what if I don't know the size of the set (in the above case, 1000) in advance?
How do I find out the size of an ets set? And/or is there a better way to choose a random pid from an ets data structure?
If keys are sequential number
tab = :ets.new(:tab, [])
Enum.each(1..1000, & :ets.insert(tab, {&1, :value}))
size = :ets.info(tab, :size)
# size = 1000
value_picked_randomly = :ets.lookup(tab, Enum.random(1..1000))
:ets.info(tab, :size) returns a size of a table; which is a number of records inserted on given table.
If you don't know that the keys are
first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
if function_that_may_return_true() do
key = case :ets.next(tab, key) do
:'$end_of_table' -> throw :reached_end_of_table
key -> func.(key)
end
else
:ets.lookup(tab, key)
end
end
func.()
func will iterate over the ets table and returns a random value.
This will be time consuming, so it will not be an ideal solution for tables with large number of records.
As I understood from the comments, this is an XY Problem.
What you essentially need is to track down the changing list and pick up one of its elements randomly. ETS in general and :ets.set in particular are by no mean intended to be queried for size. They serve different purposes.
Spawn an Agent within your supervision tree, holding the list of PIDs of already started servers and use Kernel.length/1 to query its size, or even use Enum.random/1 if the list is not really huge (the latter traverses the whole enumerable to get a random element.)

Mnesia Errors case_clause in QLC query without a case clause

I have the following function for a hacky project:
% The Record variable is some known record with an associated table.
Query = qlc:q([Existing ||
Existing <- mnesia:table(Table),
ExistingFields = record_to_fields(Existing),
RecordFields = record_to_fields(Record),
ExistingFields == RecordFields
]).
The function record_to_fields/1 simply drops the record name and ID from the tuple so that I can compare the fields themselves. If anyone wants context, it's because I pre-generate a unique ID for a record before attempting to insert it into Mnesia, and I want to make sure that a record with identical fields (but different ID) does not exist.
This results in the following (redacted for clarity) stack trace:
{aborted, {{case_clause, {stuff}},
[{db, '-my_func/2-fun-1-',8, ...
Which points to the line where I declare Query, however there is no case clause in sight. What is causing this error?
(Will answer myself, but I appreciate a comment that could explain how I could achieve what I want)
EDIT: this wouldn't be necessary if I could simply mark certain fields as unique, and Mnesia had a dedicated insert/1 or create/1 function.
For your example, I think your solution is clearer anyway (although it seems you can pull the record_to_fields(Record) portion outside the comprehension so it isn't getting calculated over and over.)
Yes, list comprehensions can only have generators and assignments. But you can cheat a little by writing an assignment as a one-element generator. For instance, you can re-write your expression as this:
RecordFields = record_to_fields(Record),
Query = qlc:q([Existing ||
Existing <- mnesia:table(Table),
ExistingFields <- [record_to_fields(Existing)],
ExistingFields == RecordFields
]).
As it turns out, the QLC DSL does not allow assignments, only generators and filters; as per the documentation (emphasis mine):
Syntactically QLCs have the same parts as ordinary list
comprehensions:
[Expression || Qualifier1, Qualifier2, ...]
Expression (the template)
is any Erlang expression. Qualifiers are either filters or generators.
Filters are Erlang expressions returning boolean(). Generators have
the form Pattern <- ListExpression, where ListExpression is an
expression evaluating to a query handle or a list.
Which means we cannot variable assignments within a QLC query.
Thus my only option, insofar as I know, is to simply write out the query as:
Query = qlc:q([Existing ||
Existing <- mnesia:table(Table),
record_to_fields(Existing) == record_to_fields(Record)
]).

Determine if record with field of certain value exists in list

I need to determine if a record with a given value exists in a list, what is the most efficient way to do this?
i think like this:
[ L || L = #record{state=determined} <- List ].
And the most efficient way is:
lists:any(fun(#record{state=deter}) -> true; (_) -> false end, List).
The first aproach is applicable if your list contains few records with determined field in the list and you'll get it all.
The second aproach is the most efficient because we are using standart library and if we'll get nedeed record we'll will stop iteration over the list.

erlang - how can I match tuple contents with qlc and mnesia?

I have a mnesia table for this record.
-record(peer, {
peer_key, %% key is the tuple {FileId, PeerId}
last_seen,
last_event,
uploaded = 0,
downloaded = 0,
left = 0,
ip_port,
key
}).
Peer_key is a tuple {FileId, ClientId}, now I need to extract the ip_port field from all peers that have a specific FileId.
I came up with a workable solution, but I'm not sure if this is a good approach:
qlc:q([IpPort || #peer{peer_key={FileId,_}, ip_port=IpPort} <- mnesia:table(peer), FileId=:=RequiredFileId])
Thanks.
Using on ordered_set table type with a tuple primary key like { FileId, PeerId } and then partially binding a prefix of the tuple like { RequiredFileId, _ } will be very efficient as only the range of keys with that prefix will be examined, not a full table scan. You can use qlc:info/1 to examine the query plan and ensure that any selects that are occurring are binding the key prefix.
Your query time will grow linearly with the table size, as it requires scanning through all rows. So benchmark it with realistic table data to see if it really is workable.
If you need to speed it up you should focus on being able to quickly find all peers that carry the file id. This could be done with a table of bag-type with [fileid, peerid] as attributes. Given a file-id you would get all peers ids. With that you could construct your peer table keys to look up.
Of course, you would also need to maintain that bag-type table inside every transaction that change the peer-table.
Another option would be to repeat fileid and add a mnesia index on that column. I am just not that into mnesia's own secondary indexes.

Resources