Mnesia: time and space efficiency of read, match_object, select, and qlc queries - erlang

Mnesia has four methods of reading from database: read, match_object, select, qlc. Besides their dirty counterparts of course. Each of them is more expressive than previous ones.
Which of them use indices?
Given the query in one of this methods will the same queries in more expressive methods be less efficient by time/memory usage? How much?
UPD.
As I GIVE CRAP ANSWERS mentioned, read is just a key-value lookup, but after a while of exploration I found also functions index_read and index_write, which work in the same manner but use indices instead of primary key.

One at a time, though from memory:
read always uses a Key-lookup on the keypos. It is basically the key-value lookup.
match_object and select will optimize the query if it can on the keypos key. That is, it only uses that key for optimization. It never utilizes further index types.
qlc has a query-compiler and will attempt to use additional indexes if possible, but it all depends on the query planner and if it triggers. erl -man qlc has the details and you can ask it to output its plan.
Mnesia tables are basically key-value maps from terms to terms. Usually, this means that if the key part is something the query can latch onto and use, then it is used. Otherwise, you will be looking at a full-table scan. This may be expensive, but do note that the scan is in-memory and thus usually fairly fast.
Also, take note of the table type: set is a hash-table and can't utilize a partial key match. ordered_set is a tree and can do a partial match:
Example - if we have a key {Id, Timestamp}, querying on {Id, '_'} as the key is reasonably fast on an ordered_set because the lexicographic ordering means we can utilize the tree for a fast walk. This is equivalent of specifying a composite INDEX/PRIMARY KEY in a traditional RDBMS.
If you can arrange data such that you can do simple queries without additional indexes, then that representation is preferred. Also note that additional indexes are implemented as bags, so if you have many matches for an index, then it is very inefficient. In other words, you should probably not index on a position in the tuples where there are few distinct values. It is better to index on things with many different (mostly) distinct values, like an e-mail address for a user-column for instance.

Related

How to interface Z3 with a database backend?

I would like to use Z3 to reason about a configuration problem using some data from a relational database containing physical properties of materials.
As suggested in this post, I could use an outer loop around the solver. But this works only for sorts with finite domains: I don't see how it would work on infinite domains.
I could represent the whole data tables by Z3 functions from primary keys to attributes, using the if-then-else construct, but the reasoning might use only a few rows in the table: it does not seem efficient.
Another approach would be to create a custom background theory solver that would determine the truth values of atoms by database lookup : has that been done before ?
Do you see some other ways to do it ?

What kind of sort does Cocoa use?

I'm always amazed by the abstractions our modern languages or frameworks create, even the ones considered relatively low level such as Objective-C/Cocoa.
Here I'm interested in the type of sort executed when one calls sortedArrayUsingComparator: on an NSArray. Is it dynamic, like analyzing the current constraints of the environment (particularly free memory) and the attributes of the array (length, unique values), and pick the best sort accordingly, or does it always use the same one, like Quick or Merge Sort?
It should be possible to test that by analyzing the running time of the method relatively to N, just wondering if anyone already bothered to.
This has been described at a developers conference. The sort doesn't need any memory. It checks if there is a sorted range of numbers at the start or the end or both and takes advantage of that. You can ask yourself how you would sort an 100,000 entry array if the first 50,000 are sorted in descending order.

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

Mnesia: how to use indexed operations correctly when selecting rows based on criteria involving multiple, indexed columns

Problem:
How to select records efficiently from a table where the select is based on criteria involving two indexed columns.
Example
I have a record,
#rec{key, value, type, last_update, other_stuff}
I have indexes on key (default), type and last_update columns
type is typically an atom or string
last_update is an integer (unix-style milliseconds since 1970)
I want, for example all records whose type = Type and have been updated since a specific time-stamp.
I do the following (wrapped in a non-dirty transaction)
lookup_by_type(Type, Since) ->
MatchHead = #rec{type=Type, last_update = '$1', _= '_'},
Guard = {'>', '$1', Since},
Result = '$_',
case mnesia:select(rec,[{MatchHead, [Guard],[Result]}]) of
[] -> {error, not_found};
Rslts -> {ok, Rslts}
end.
Question
Is the lookup_by_type function even using the underlying indexes?
Is there a better way to utilize indexes in this case
Is there an entirely different approach I should be taking?
Thank you all
One way, which will probably help you, is to look at QLC queries. These are more SQL/declarative and they will utilize indexes if possible by themselves IIRC.
But the main problem is that indexes in mnesia are hashes and thus do not support range queries. Thus you can only efficiently index on the type field currently and not on the last_update field.
One way around that is to make the table ordered_set and then shove the last_update to be the primary key. The key parameter can then be indexed if you need fast access to it. One storage possibility is something like: {{last_update, key}, key, type, ...}. Thus you can quickly answer queries because last_update is orderable.
Another way around it is to store last-update separately. Keep a table {last_update, key} which is an ordered set and use that to limit the amount of things to scan on the larger table in a query.
Remember that mnesia is best used as a small in-memory database. Thus scans are not necessarily a problem due to them being in-memory and thus pretty fast. Its main power though is the ability to do key/value lookups in a dirty way on data for quick query.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources