Analyze similarities in model data using Elasticsearch and Rails - ruby-on-rails

I would like to use Elasticsearch to analyze data and display it to the user.
When a user views a record for a model, I want to display a list of 'similar' records in the database for that model, and the percentage of similarity. This would match against every field on the model.
I am aware that with the Searchkick gem I can use a command to find similar records:
product = Product.first
product.similar(fields: ["name"], where: {size: "12 oz"})
I would like to take this further and compare entire records (and eventually associations).
Is this feasible with Elasticsearch / Searchkick in Rails, or should I use another method to analyze the data?

There is a feature built exactly for this purpose in Elasticsearch called more_like_this. The documentation for the mlt query goes into great details about how you can achieve exactly what you want to do.
The content you provide to the like field will be analyzed and the most relevant terms for each field will be used to retrieve documents with as many of those relevant terms. If you have all your records stored in Elasticsearch, you can use the Multi GET syntax to specify a document already in your index as content of the like field like this:
"like" : [
{
"_index" : "model",
"_type" : "model",
"_id" : "1"
}
]
Remember that you cannot use index aliases when using this syntax (so you'll have to do a document lookup first if you are not sure which index your document is currently residing in).
If you don't specify the fields field, all fields in the source document will be used. My suggestion to avoid bad surprises, is to always specify the list of fields you want your similar documents to match.
If you have non-textual fields that you want to match perfectly with the source document, you might want to consider using a bool query, programmatically creating the filter section to limit documents returned by the mlt query to only a filtered subset of your entire index.
You can build these queries in Searchkick using the advanced search feature, manually specifying the body of search requests.

Read up on using More Like This Query. This is the query produced by product.similar(). It operates only on text fields. If you also want to compare numeric or date fields, you'll have to incorporate these rules into a scoring script to do what you're asking.

Related

Dynamic Queries using Couch_Potato

The documentation for creating a fairly straightforward view is easy enough to find:
view :completed, :key => :name, :conditions => 'doc.completed === true'
How, though, does one construct a view with a condition created on the fly? For example, if I want to use a query along the lines of
doc.owner_id == my_var
Where my_var is set programatically.
Is this even possible? I'm very new to NoSQL so apologies if I'm making no sense.
Views in CouchDB are incrementally built / indexed as data is inserted / updated into that particular database. So in order to take full advantage of the power behind views you won't want to dynamically query them. You'll want to construct your views in such a way that you can efficiently access the data based on the expected usage patterns of the application. In my experience it's not uncommon to have multiple views each giving you a different way to access / query the same data. I find it helpful to think of CouchDB views as a way to systematically denormalize your documents.
On the other hand there are also ways to generalize your indexes in your views so you can use a single view for endless combinations of queries.
For example, you have an "articles" database, and each article document contains a list of tags. If you want to set up a query to dynamically retrieve all articles tagged with a handful of tags, you could emit multiple entries to the view on the same document:
// this article is tagged with "tag1","tag2","tag3"
emit("tag1",doc._id);
emit("tag2",doc._id);
emit("tag3",doc._id);
....
Now you have a way to query: Give me all articles tagged with these words: ["tag1","tag2",etc]
For more info on how to query multiple keys see "Parameter -> keys" in the table of Querying Options here:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
One problem with the above example is it would produce duplicates if a single document was tagged with both or all of the tags you were querying for. You can easily de-dupe the results of the view by using a CouchDB "List Function". More info about list functions can be found here:
http://guide.couchdb.org/draft/transforming.html
Another way to construct views for even more robust "dynamic" access to the data would be to compose your indexes out of complex data types such as JavaScript arrays. Also incorporating "range queries" can help. So for example if you have a 3-item array in your index, but only have the first 2 values, you can set up a range query to pull all documents that match the first 2 items of the array. Some useful info about that can be found here:
http://guide.couchdb.org/draft/views.html
Refer to the "startkey", and "endkey" options under "Querying Options" table here:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
It's good to know how CouchDB indexes itself. It uses a "B+ tree" data structure:
http://guide.couchdb.org/draft/btree.html
Keep this in mind when thinking about how to compose your indexes. This has specific implications about how you need to construct your indexes. For example, you can't expect to get good performance on a view if you query with a range on the first item in the array. For example:
startkey = [a,1,2]
endkey = [z,1,2]
You'll get the performance you'd expect if your query is:
startkey = [1,2,a]
endkey = [1,2,z]
This, in more general terms, means that index order does matter when querying views. Not just on basis of performance, but on basis of what documents will be returned. If you index a document in a view with [1,2,3], you can't expect it to show up in query for index [3,2,1], [2,1,3], or any other combination.
In my experience, most data-access problems can be solved elegantly and efficiently with CouchDB and the basic tools it provides. If / when your project needs true dynamic access to the data, I generally still use CouchDB for common data access needs, but I'll also integrate ElasticSearch using an ElasticSearch plugin which streams your data from CouchDB into ElasticSearch as it becomes available:
http://www.elasticsearch.org/
https://github.com/elasticsearch/elasticsearch-river-couchdb

Rails Reporting Object - Map to New Object

I have a complex multi-object query that is being used for reporting to customers and isn't dependent on new transactional data. With the volume of data I'm dealing with, it seems more efficient to use a custom query instead of loading all of my models up (which I was originally doing and even with effective filtering, it was getting slow).
I've gone as far as creating a class to represent my report (combining attributes from several models), but I'm stuck on something that seems fairly easy now.
I have a new class named AttendeeReport
It has the following attributes:
attendee_id, first_name, last_name, email_address, rsvp_id, response_id, etc
This is used for reporting (in jquery datatables, and possibly elsewhere) and I'm using this query to pull from multiple models correctly into an array of my correct attributes:
list = Attendee.joins(:event).joins(:focus_group_prospect).joins(:post_focus_group_survey).joins(:lead).joins(:vendor).where("events.focus_group_client_id = ?", client_id).select("attendees.id as attendee_id, leads.first_name, etc").all
How would map that result to a new array of AttendeeReport without using each field in the array? Is there a merge or map feature that I'm not familiar with that does this dynamically?
Note: I've dramatically decreased the number of attributes shown since they don't matter.
Update: Reflection?
Thinking on this if there isn't an easy way to do this via a map or merge, wondering if a reflection approach would work too (iterate over all attributes and match them up).

Caching results of query when filters are unchanged

I am building a contact management sort system. I am having a list page which has several filters to filter the results such as "area", "category", etc. And also I have search fields for name, address and contact info.
Suppose I set area as "Chicago" and category as "Family" and then press "apply filters" (filters and search fields will be submitted), I will get the result. Now if I had mentioned something in name filed then Il attach a where query to the resulting activerelation.
Suppose Ive got a result with above filters in one request. Then I want to search a different name, Ill have to query the database with the filters of are and category again which is not necessary.. is there a way to cache results from previous search?
I recommend not worrying about this until you can show you have a problem.
If you did have a problem you could:
Return all results and do the filtering in JavaScript
Cache all results on the server and do the filtering in Ruby there

How do database indices make search faster

I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.

Building a (simple) twitter-clone with CouchDB

I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
user-schema:
{
_id:xxxxxxx,
_rev:yyyyyy,
"type":"user",
"user_id":1,
"username":"john",
...
}
tweet-schema:
{
"_id":"xxxx",
"_rev":"yyyy",
"type":"tweet",
"text":"Sample Text",
"user_id":1,
...
"created_at":"2011-10-17 10:21:36 +000"
}
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
function(doc){
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
}
}
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.

Resources