How to effectively delete a lot of mnesia tables - erlang

I faced a situation when I need to delete a lot of mnesia tables on the node (about 20000). Since there is a name pattern for these tables I can collect and delete them this way:
Tables = [Table || Table <- mnesia:system_info(tables), re:run(atom_to_list(Table), "<pattern>") /= nomatch],
lists:foreach(
fun (Table) ->
mnesia:delete_table(Table)
end,
Tables).
However deleting them one by one is very slow and it takes very long to delete 20k tables.
Is there any way to do it more effectively?

you can spawn processes.
lists:foreach(
fun (Table) ->
spawn(mnesia, delete_table, [Table])
end,
Tables).

Related

Ecto multiple streams in 1 transaction

Background
PS: the following situation describes an hypothetical scenario, where I own a company that sells things to customers.
I have an Ecto query that is so big, that my machine cannot handle it. With billions of results returned, there is probably not enough RAM in the world that can handle it.
The solution here (or so my research indicates) is to use streams. Streams were made for potentially infinite sets of results, which would fit my use case.
https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2
Problem
So lets imagine that I want to delete All users that bought a given item. Maybe that item was not really legal in their country, and now me, the poor guy in IT, has to fix things so the world doesn't come down crashing.
Naive way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
|> Repo.all()
Users
|> where([u], u.purchase_id in ^purchase_ids)
|> Repo.delete_all()
This is the naive way. I call it naive, because of 2 issues:
We have so many purchases, that the machine's memory will overflow (looking at purchase_ids query)
purchase_ids will likely have more than 100K ids, so the second query (where we delete things) will fail as it hits Postgres parameters limit of 32K: https://stackoverflow.com/a/42251312/1337392
What can I say, our product is highly addictive and very well priced!
Our customers simply cant get enough of it. Don't know why. Nope. No reason comes to mind. None at all.
With these problems in mind, I cannot help my customers and grow my [s]empire[/s], I mean, little home owned business.
I did find this possible solution:
Stream way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
stream = Repo.stream(purchase_ids)
Repo.transacion(fn ->
ids = Enum.to_list(stream)
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
Questions
However, I am not convinced this will work:
I am using Enum.to_list and saving everything into a variable, placing everything into memory again. So I am not gaining any advantage by using Repo.stream.
I still have too many ids for my Repo.delete_all to work without blowing up
I guess the one advantage here is that this now a transaction, so either everything goes or nothing goes.
So, the following questions arise:
How do I properly make use of streams in this scenario?
Can I delete items by streaming parameters (ids) or do I have to manually batch them?
Can I stream ids to Repo.delete_all ?
One cannot directly feed Repo.delete_all/1 with a stream, but Stream.chunk_every/2 is your friend here. One can do somewhat like below (500 is the default value for :max_rows hence we’d use it in chunk_every/2 as well.)
Repo.transacion(fn ->
max_rows = 500
purchase_ids
|> Repo.stream(max_rows: max_rows)
|> Stream.chunk_every(max_rows)
|> Stream.each(fn ids ->
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
|> Stream.run()
end, timeout: :infinity)
I know this isn't an answer to your question about using streams, but in this scenario, streams might not be necessary depending on the amount of data you are trying to delete. You should be able to delete all of these matching users with a subquery in one query without passing any variables into memory aside from the item_id:
sq =
Purchases
|> where(item_id: ^item_id)
|> select([p], p.id)
Users
|> where([u], u.purchase_id in subquery(sq))
|> Repo.delete_all(timeout: :infinity)

How do I select a random element from an ets set in Erlang/Elixir?

I have a large number of processes that I need to keep track of in an ets set, and then randomly select single processes. So I created the set like this:
:ets.new(:pid_lookup, [:set, :protected, :named_table])
then for argument's sake let's just stick self() in it 1000 times:
Enum.map 1..1000, fn x -> :ets.insert(:pid_lookup, {x, self()}) end
Now I need to select one at random. I know I could just select a random one using :ets.lookup(:pid_lookup, :rand.uniform(1000)), but what if I don't know the size of the set (in the above case, 1000) in advance?
How do I find out the size of an ets set? And/or is there a better way to choose a random pid from an ets data structure?
If keys are sequential number
tab = :ets.new(:tab, [])
Enum.each(1..1000, & :ets.insert(tab, {&1, :value}))
size = :ets.info(tab, :size)
# size = 1000
value_picked_randomly = :ets.lookup(tab, Enum.random(1..1000))
:ets.info(tab, :size) returns a size of a table; which is a number of records inserted on given table.
If you don't know that the keys are
first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
if function_that_may_return_true() do
key = case :ets.next(tab, key) do
:'$end_of_table' -> throw :reached_end_of_table
key -> func.(key)
end
else
:ets.lookup(tab, key)
end
end
func.()
func will iterate over the ets table and returns a random value.
This will be time consuming, so it will not be an ideal solution for tables with large number of records.
As I understood from the comments, this is an XY Problem.
What you essentially need is to track down the changing list and pick up one of its elements randomly. ETS in general and :ets.set in particular are by no mean intended to be queried for size. They serve different purposes.
Spawn an Agent within your supervision tree, holding the list of PIDs of already started servers and use Kernel.length/1 to query its size, or even use Enum.random/1 if the list is not really huge (the latter traverses the whole enumerable to get a random element.)

How to extract data from mnesia backup file

Problem statement
I have a mnesia backup file and would like to extract values from it. There are 3 tables(to make it simple), Employee, Skills, and attendance. So the mnesia back up file contains all those data from these three tables.
Emplyee table is :
Empid (Key)
Name
SkillId
AttendanceId
Skill table is
SkillId (Key)
Skill Name
Attendance table is
Code (Key)
AttendanceId
Percentage
What i have tried
I have used
ets:foldl(Fetch,OutputFile,Table)
Fetch : is separate function to traverse the record fetched to bring in desired output format.
OutputFile : it writes to this file
Table : name of the table
Expecting
I am gettig records with AttendanceId(as this is the key) where as i Want to get code only. It displays employee informations and attendance id.
Help me out.
Backup and restore is described in the mnesia user guide here.
To read an existing backup, without restoring it, use mnesia:traverse_backup/4.
1> mnesia:backup(backup_file).
ok
2> Fun = fun(BackupItems, Acc) -> {[], []} end.
#Fun<erl_eval.12.90072148>
3> mnesia:traverse_backup(backup_file, mnesia_backup, [], read_only, Fun, []).
{ok,[]}
Now add something to the Fun to get what you want.

Getting lots of data from Mnesia - fastest way

I have a record:
-record(bigdata, {mykey,some1,some2}).
Is doing a
mnesia:match_object({bigdata, mykey, some1,'_'})
the fastest way fetching more than 5000 rows?
Clarification:
Creating "custom" keys is an option (so I can do a read) but is doing 5000 reads fastest than match_object on one single key?
I'm curious as to the problem you are solving, how many rows are in the table, etc., without that information this might not be a relevant answer, but...
If you have a bag, then it might be better to use read/2 on the key and then traverse the list of records being returned. It would be best, if possible, to structure your data to avoid selects and match.
In general select/2 is preferred to match_object as it tends to better avoid full table scans. Also, dirty_select is going to be faster then select/2 assuming you do not need transactional support. And, if you can live with the constraints, Mensa allows you to go against the underlying ets table directly which is very fast, but look at the documentation as it is appropriate only in very rarified situations.
Mnesia is more a key-value storage system, and it will traverse all its records for getting match.
To fetch in a fast way, you should design the storage structure to directly support the query. To Make some1 as key or index. Then fetch them by read or index_read.
The statement Fastest Way to return more than 5000 rows depends on the problem in question. What is the database structure ? What do we want ? what is the record structure ? After those, then, it boils down to how you write your read functions. If we are sure about the primary key, then we use mnesia:read/1 or mnesia:read/2 if not, its better and more beautiful to use Query List comprehensions. Its more flexible to search nested records and with complex conditional queries. see usage below:
-include_lib("stdlib/include/qlc.hrl").
-record(bigdata, {mykey,some1,some2}).
%% query list comprehenshions
select(Q)->
%% to prevent against nested transactions
%% to ensure it also works whether table
%% is fragmented or not, we will use
%% mnesia:activity/4
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:e(QH) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:e(Q)
end.
%% to read by a given field or even several
%% you use a list comprehension and pass the guards
%% to filter those records accordingly
read_by_field(some2,Value)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 == Value]),
select(QueryHandle).
%% selecting by several conditions
read_by_several()->
%% you can pass as many guard expressions
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 =< 300,
X#bigdata.some1 > 50
]),
select(QueryHandle).
%% Its possible to pass a 'fun' which will do the
%% record selection in the query list comprehension
auto_reader(ValidatorFun)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
ValidatorFun(X) == true]),
select(QueryHandle).
read_using_auto()->
F = fun({bigdata,SomeKey,_,Some2}) -> true;
(_) -> false
end,
auto_reader(F).
So i think if you want fastest way, we need more clarification and problem detail. Speed depends on many factors my dear !

Create several Mnesia tables with the same columns

I want to create the following schema in Mnesia. Have three tables, called t1, t2 and t3, each of them storing elements of the following record:
-record(pe, {pid, event}).
I tried creating the tables with:
Attrs = record_info(fields, pe),
Tbls = [t1, t2, t3],
[mnesia:create_table(Tbl, [{attributes, Attrs}]) || Tbl <- Tbls],
and then write some content using the following line (P and E have values):
mnesia:write(t1, #pe{pid=P, event=E}, write)
but I got a bad type error. (Relevant commands were passed to transactions, so it's not a sync problem.)
All the textbook examples of Mnesia show how to create different tables for different records. Can someone please reply with an example for creating different tables for the same record?
regarding your "DDT" for creating the tables, I don't see any mystake at first sight, just remember that using tables with names different from the record names makes you lose the "simple" commands (like mnesia:write/1) because they use element(1, RecordTuple) to retrieve table name.
When defining tables, you can use option {record_name, RecordName} (in your case: {record_name, pe}) to tell mnesia that first atom in tuple representing records in table is not the table name, but instead the atom you passed with record_name; so in case of your table t1 it makes mnesia expecting 'pe' records when inserting or looking up for records.
If you want to insert a record in all tables, you might use a script similar to the one used to create table (but in a function wrapper for mnesia transaction context):
insert_record_in_all_tables(Pid, Event, Tables) ->
mnesia:transaction(fun() -> [mnesia:write(T, #pe{pid=Pid, event=Event}, write) || T <- Tables] end).
Hope this helps!

Resources