Optimize finding matches in array of strings using array of regex - ruby-on-rails

I have two arrays. One that is populated with values pulled from the database and another that is an user uploaded array of regex patterns to match:
db_array = ["ABCDEFG", "HIJKLMN", "OPQRSTU", "VWXYZ", etc...]
matching_array = ["(OPQRST)(3)?(.{1,})?", "(WXY)(1)?(.{1,})?", "(HIJKLMN)(3)?(.{1,})?", etc...]
Is there a better way to find any/all the matches in the db_array using the matching_array rather than iterating through the matching array and then the db_array and pulling any matches?
matching_array.map{|regex| db_array.select{|a| /#{regex}/.match}}
The issue is that both of these arrays can be over 3000+ records and that takes a substantial amount of time. Especially since the matching_array is built up to multiple times using different pattern criteria. Trying to limit the amount of db calls I make as well since I dont want to constantly be hitting the server.

If you can be sure that the regular expressions don't use PCRE extensions (look behind, etc), then you can use a much faster regex library.
Google maintains one called re2 (https://github.com/google/re2).
Ruby bindings are available (https://github.com/stefanor/ruby-re2, among others).

Related

Faceting in Solr when index contains millions of documents

I'm working on a project that uses a solr index with a few million documents and we've recently hit a memory problem. Faceting has become unusable on a couple of our fields - solr runs out of heap memroy - because of the number of documents containing those fields.
What options do we have besides increasing the memory? We see memory increases as a temporary solution because the number of documents goes up by a few 100k documents per day.
I'm looking at the minute into solrcloud but I'm not sure this is the right solution.
Any suggestions?
Thanks!
FacetFields: Allow for facet counts based on distinct values in a field. There are two methods for FacetFields, one that performs well with few distinct values in a field, and the other for when a field contains many distinct values (generally, thousands and up – you should test what works best for you).
The first method, facet.method=enum, works by issuing a FacetQuery for every unique value in the field. As mentioned, this is an excellent method when the number of distinct values in a field is small. It requires excessive memory though, and breaks down when the number of distinct values gets large. When using this method, be careful to ensure that your FilterCache is large enough to contain at least one filter for every distinct value you plan on faceting on.
The second method uses the Lucene FieldCache (future version of Solr will actually use a different non-inverted structure – the UnInvertedField). This method is actually slower and more memory intensive for fields with a low number of unique values, but if you have a lot of uniques, this is the way to go. This method uses the FieldCache to look up the values for the given field for each document, and every time a document with a given value is found, the value has its count incremented.
Please check the allotted memory for each cache and if you can tweak FieldCache to handle the situation. (As you have mentioned, type3 and type4 have large number of documents.
Source for the above information is Scaling Lucene and Solr. I found one more article which talks about solr faceting You are faceting it wrong.
Before solrcould you can think of solr multiple core.
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
With SolrCloud, a single index can span multiple Solr instances.
This means that a single index can be made up of multiple SolrCore's on different machines.
These SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy.
If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
SolrCloud adds the distributed capabilities in Solr.
With this enable you can have highly available, fault tolerant cluster of Solr servers.
Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.
You can get more info about SolrCloud here
https://cwiki.apache.org/confluence/display/solr/SolrCloud

What kind of sort does Cocoa use?

I'm always amazed by the abstractions our modern languages or frameworks create, even the ones considered relatively low level such as Objective-C/Cocoa.
Here I'm interested in the type of sort executed when one calls sortedArrayUsingComparator: on an NSArray. Is it dynamic, like analyzing the current constraints of the environment (particularly free memory) and the attributes of the array (length, unique values), and pick the best sort accordingly, or does it always use the same one, like Quick or Merge Sort?
It should be possible to test that by analyzing the running time of the method relatively to N, just wondering if anyone already bothered to.
This has been described at a developers conference. The sort doesn't need any memory. It checks if there is a sorted range of numbers at the start or the end or both and takes advantage of that. You can ask yourself how you would sort an 100,000 entry array if the first 50,000 are sorted in descending order.

Trying to replace this block-based NSSortDescriptor with a Core Data-friendly one

I've got an entity type which, as one of its properties, has a single character. I want to retrieve all such entities which match a predicate, and I want them to be sorted first by that character, and then by an index number (which is another of its properties).
This is simple if I just use the built-in sort descriptors... however, the single character can be anything from a letter to a number to punctuation to an emoji. And when I use the built-in sort, I get punctuation first, then numbers, and then so on. What I want is A-Z first, then numbers, then punctuation, then finally emoji or other non-alphanumeric-and-non-punctuation (those last ones I don't really care about their order).
This is easy enough to implement as a block-based NSSortDescriptor, but I can't figure out how to do it in a way that I can send it off to Core Data as part of a fetch request (i.e., no blocks allowed). I'd be fine with breaking it into a couple different requests, if that's the only way to do it, and then joining the resulting arrays afterward; but I'd prefer to do it in one fetch if possible.
Thanks!
When you create the objects in the first place, run your sort logic and save a resulting 'characterType' into another property. Now, on your fetch request, use 3 sort descriptors, with this character type identifier first, then the character and then the other index.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources