erlang - how can I match tuple contents with qlc and mnesia? - erlang

I have a mnesia table for this record.
-record(peer, {
peer_key, %% key is the tuple {FileId, PeerId}
last_seen,
last_event,
uploaded = 0,
downloaded = 0,
left = 0,
ip_port,
key
}).
Peer_key is a tuple {FileId, ClientId}, now I need to extract the ip_port field from all peers that have a specific FileId.
I came up with a workable solution, but I'm not sure if this is a good approach:
qlc:q([IpPort || #peer{peer_key={FileId,_}, ip_port=IpPort} <- mnesia:table(peer), FileId=:=RequiredFileId])
Thanks.

Using on ordered_set table type with a tuple primary key like { FileId, PeerId } and then partially binding a prefix of the tuple like { RequiredFileId, _ } will be very efficient as only the range of keys with that prefix will be examined, not a full table scan. You can use qlc:info/1 to examine the query plan and ensure that any selects that are occurring are binding the key prefix.

Your query time will grow linearly with the table size, as it requires scanning through all rows. So benchmark it with realistic table data to see if it really is workable.
If you need to speed it up you should focus on being able to quickly find all peers that carry the file id. This could be done with a table of bag-type with [fileid, peerid] as attributes. Given a file-id you would get all peers ids. With that you could construct your peer table keys to look up.
Of course, you would also need to maintain that bag-type table inside every transaction that change the peer-table.
Another option would be to repeat fileid and add a mnesia index on that column. I am just not that into mnesia's own secondary indexes.

Related

How do I select a random element from an ets set in Erlang/Elixir?

I have a large number of processes that I need to keep track of in an ets set, and then randomly select single processes. So I created the set like this:
:ets.new(:pid_lookup, [:set, :protected, :named_table])
then for argument's sake let's just stick self() in it 1000 times:
Enum.map 1..1000, fn x -> :ets.insert(:pid_lookup, {x, self()}) end
Now I need to select one at random. I know I could just select a random one using :ets.lookup(:pid_lookup, :rand.uniform(1000)), but what if I don't know the size of the set (in the above case, 1000) in advance?
How do I find out the size of an ets set? And/or is there a better way to choose a random pid from an ets data structure?
If keys are sequential number
tab = :ets.new(:tab, [])
Enum.each(1..1000, & :ets.insert(tab, {&1, :value}))
size = :ets.info(tab, :size)
# size = 1000
value_picked_randomly = :ets.lookup(tab, Enum.random(1..1000))
:ets.info(tab, :size) returns a size of a table; which is a number of records inserted on given table.
If you don't know that the keys are
first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
if function_that_may_return_true() do
key = case :ets.next(tab, key) do
:'$end_of_table' -> throw :reached_end_of_table
key -> func.(key)
end
else
:ets.lookup(tab, key)
end
end
func.()
func will iterate over the ets table and returns a random value.
This will be time consuming, so it will not be an ideal solution for tables with large number of records.
As I understood from the comments, this is an XY Problem.
What you essentially need is to track down the changing list and pick up one of its elements randomly. ETS in general and :ets.set in particular are by no mean intended to be queried for size. They serve different purposes.
Spawn an Agent within your supervision tree, holding the list of PIDs of already started servers and use Kernel.length/1 to query its size, or even use Enum.random/1 if the list is not really huge (the latter traverses the whole enumerable to get a random element.)

How do I remove rows of an RDD whose key is not in another RDD?

Let's say I have a PairRDD, students (id, name). I would like to only keep rows where id is in another RDD, activeStudents (id).
The solution I have is to create a PairDD from activeStudents, (id, id), and the do a join with students.
Is there a more elegant way of doing this?
Thats a pretty good solution to start with. If active students is small enough you could collect the ids as a map and then filter with the id presence (this avoids having to a do a shuffle).
Much like you thought, you can do an outer join if both RDDs contain keys and values.
val students: RDD[(Long, String)]
val activeStudents: RDD[Long]
val activeMap: RDD[(Long, Unit)] = activeStudents.map(_ -> ())
val activeWithName: RDD[(Long, String)] =
students.leftOuterJoin(activeMap).flatMapValues {
case (name, Some(())) => Some(name)
case (name, None) => None
}
If you don't have to join those two data sets then you should definitely avoid it.
I had a similar problem recently and I successfully solved it using a broadcasted Set, which I used in UDF to check whether each RDD row (rather value from one of its columns) is in that Set. That UDF is than used as the basis for the filter transformation.
More here: whats-the-most-efficient-way-to-filter-a-dataframe.
Hope this helps. Ask if it's not clear.

Getting lots of data from Mnesia - fastest way

I have a record:
-record(bigdata, {mykey,some1,some2}).
Is doing a
mnesia:match_object({bigdata, mykey, some1,'_'})
the fastest way fetching more than 5000 rows?
Clarification:
Creating "custom" keys is an option (so I can do a read) but is doing 5000 reads fastest than match_object on one single key?
I'm curious as to the problem you are solving, how many rows are in the table, etc., without that information this might not be a relevant answer, but...
If you have a bag, then it might be better to use read/2 on the key and then traverse the list of records being returned. It would be best, if possible, to structure your data to avoid selects and match.
In general select/2 is preferred to match_object as it tends to better avoid full table scans. Also, dirty_select is going to be faster then select/2 assuming you do not need transactional support. And, if you can live with the constraints, Mensa allows you to go against the underlying ets table directly which is very fast, but look at the documentation as it is appropriate only in very rarified situations.
Mnesia is more a key-value storage system, and it will traverse all its records for getting match.
To fetch in a fast way, you should design the storage structure to directly support the query. To Make some1 as key or index. Then fetch them by read or index_read.
The statement Fastest Way to return more than 5000 rows depends on the problem in question. What is the database structure ? What do we want ? what is the record structure ? After those, then, it boils down to how you write your read functions. If we are sure about the primary key, then we use mnesia:read/1 or mnesia:read/2 if not, its better and more beautiful to use Query List comprehensions. Its more flexible to search nested records and with complex conditional queries. see usage below:
-include_lib("stdlib/include/qlc.hrl").
-record(bigdata, {mykey,some1,some2}).
%% query list comprehenshions
select(Q)->
%% to prevent against nested transactions
%% to ensure it also works whether table
%% is fragmented or not, we will use
%% mnesia:activity/4
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:e(QH) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:e(Q)
end.
%% to read by a given field or even several
%% you use a list comprehension and pass the guards
%% to filter those records accordingly
read_by_field(some2,Value)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 == Value]),
select(QueryHandle).
%% selecting by several conditions
read_by_several()->
%% you can pass as many guard expressions
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 =< 300,
X#bigdata.some1 > 50
]),
select(QueryHandle).
%% Its possible to pass a 'fun' which will do the
%% record selection in the query list comprehension
auto_reader(ValidatorFun)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
ValidatorFun(X) == true]),
select(QueryHandle).
read_using_auto()->
F = fun({bigdata,SomeKey,_,Some2}) -> true;
(_) -> false
end,
auto_reader(F).
So i think if you want fastest way, we need more clarification and problem detail. Speed depends on many factors my dear !

Lua table C api

I know of:
http://lua-users.org/wiki/SimpleLuaApiExample
It shows me how to build up a table (key, value) pair entry by entry.
Suppose instead, I want to build a gigantic table (say something a 1000 entry table, where both key & value are strings), is there a fast way to do this in lua (rather than 4 func calls per entry:
push
key
value
rawset
What you have written is the fast way to solve this problem. Lua tables are brilliantly engineered, and fast enough that there is no need for some kind of bogus "hint" to say "I expect this table to grow to contain 1000 elements."
For string keys, you can use lua_setfield.
Unfortunately, for associative tables (string keys, non-consecutive-integer keys), no, there is not.
For array-type tables (where the regular 1...N integer indexing is being used), there are some performance-optimized functions, lua_rawgeti and lua_rawseti: http://www.lua.org/pil/27.1.html
You can use createtable to create a table that already has the required number of slots. However, after that, there is no way to do it faster other than
for(int i = 0; i < 1000; i++) {
lua_push... // key
lua_push... // value
lua_rawset(L, tableindex);
}

Combining table, web service data in Grails

I'm trying to figure out the best approach to display combined tables based on matching logic and input search criteria.
Here is the situation:
We have a table of customers stored locally. The fields of interest are ssn, first name, last name and date of birth.
We also have a web service which provides the same information. Some of the customers from the web service are the same as the local file, some different.
SSN is not required in either.
I need to combine this data to be viewed on a Grails display.
The criteria for combination are 1) match on SSN. 2) For any remaining records, exact match on first name, last name and date of birth.
There's no need at this point for soundex or approximate logic.
It looks like what I should do is extract all the records from both inputs into a single collection, somehow making it a set on SSN. Then remove the blank ssn.
This will handle the SSN matching (once I figure out how to make that a set).
Then, I need to go back to the original two input sources (cached in a collection to prevent a re-read) and remove any records that exist in the SSN set derived previously.
Then, create another set based on first name, last name and date of birth - again if I can figure out how to make a set.
Then combine the two derived collections into a single collection. The collection should be sorted for display purposes.
Does this make sense? I think the search criteria will limit the number of record pulled in so I can do this in memory.
Essentially, I'm looking for some ideas on how the Grails code would look for achieving the above logic (assuming this is a good approach). The local customer table is a domain object, while what I'm getting from the WS is an array list of objects.
Also, I'm not entirely clear on how the maxresults, firstResult, and order used for the display would be affected. I think I need to read in all the records which match the search criteria first, do the combining, and display from the derived collection.
The traditional Java way of doing this would be to copy both the local and remote objects into TreeSet containers with a custom comparator, first for SSN, second for name/birthdate.
This might look something like:
def localCustomers = Customer.list()
def remoteCustomers = RemoteService.get()
TreeSet ssnFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.ssn <=> c2.ssn}))
ssnFilter.addAll(localCustomers)
ssnFilter.addAll(remoteCustomers)
TreeSet nameDobFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.firstName + c1.lastName + c1.dob <=> c2.firstName + c2.lastName + c2.dob}))
nameDobFilter.addAll(ssnFilter)
def filteredCustomers = nameDobFilter as List
At this point, filteredCustomers has all the records, except those that are duplicates by your two criteria.
Another approach is to filter the lists by sorting and doing a foldr operation, combining adjacent elements if they match. This way, you have an opportunity to combine the data from both sources.
For example:
def combineByNameAndDob(customers) {
customers.sort() {
c1, c2 -> (c1.firstName + c1.lastName + c1.dob) <=>
(c2.firstName + c2.lastName + c2.dob)
}.inject([]) { cs, c ->
if (cs && c.equalsByNameAndDob(cs[-1])) {
cs[-1].combine(c) //combine the attributes of both records
cs
} else {
cs << c
}
}
}

Resources