I have an alphabetized index of people. My goal is to find what page of that index a person is listed on. For instance, "Tim Curry" might be listed on page 5 of the "T" section. Currently I'm getting the page number with ActiveRecord; Elasticsearch results are 20 per page, so I can work out the page number based on the index. But it seems wiser to get the page number directly from Elasticsearch if at all possible to ensure that I'm getting the right page. Is there a way to get this data from ES?
def page_index
letter= self.name[0].downcase
index=Person.where("lower(name) like?", "#{letter}%").order("lower(name)").pluck(:id).index(self.id)
page=index/20 + 1
end
This functionality does not come bundled with ElasticSearch. Using the results per page and index is the correct approach if that is the functionality you are looking for.
Since it's not clear exactly which document you need, or what the overall UX you are trying to achieve is, I would keep in mind you can always search your index(ces) for a specific document via various mean (filtered query term on name if you need "Tim Curry", id or _uid etc.).
Also ES is a full-text based search client, finding one Object and it's properties might be better served via a database call.
Again this is slightly heresay, as I don't know what exactly you need or are trying to achieve overall, however finding the page of a specific result in your set of returned results is best down via accessing index in your results and simple math.
Related
I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.
I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
user-schema:
{
_id:xxxxxxx,
_rev:yyyyyy,
"type":"user",
"user_id":1,
"username":"john",
...
}
tweet-schema:
{
"_id":"xxxx",
"_rev":"yyyy",
"type":"tweet",
"text":"Sample Text",
"user_id":1,
...
"created_at":"2011-10-17 10:21:36 +000"
}
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
function(doc){
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
}
}
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.
The problem with your typical rails pagination gem is that it does 2 queries: one for the page you're on and one for the total count. When you don't care about how many pages there are (e.g. in an endless scroll), that 2nd query is unnecessary (just add 1 to your LIMIT clause in the 1st query and you know if there are more or not).
Is there a gem that'll do pagination without the 2nd query? The 2nd query is expensive when applying non-indexed filters in my WHERE clause on large datasets and indexing all my various filters is unacceptable because I need my inserts to be fast.
Thanks!
Figured it out. When using the will_paginate gem, you can supply your own total_entries option to AR:Base.paginate. This makes it so the 2nd query doesn't run.
This works for sufficiently large datasets where you only care about recent entries.
This isn't necessarily acceptable if you actually expect to hit the end of your list because if the list size is divisible by per_page you're going to query an empty set on your last query. With endless scroll, this is fine. With a manual "load more" button, you'll be displaying "load more" at the very end when there are no more items to load.
The standard approach, as you've identified, is to fetch N+1 records when you need N and if you get more than N records in the response, there is at least one additional page of results you can display.
The only reason you'd want to do an explicit COUNT(*) call is if you need to know specifically how many more records you will need to fetch. On some engines this can take a good chunk of time to compute so it is best avoided especially if the value is never directly used.
Since this is so simple, you really don't need a plugin to do it. Plugins like will_paginate is more concerned with the number of pages available so it does the count operation.
I am implementing a complex search module with result page support paging. Most of examples provided just passes pagenumber as a parameter for the Index action, and the action uses the pagenumber to perform a query each time the user hit a different page number.
My problem is that my search take many many more criteria(more than 10 criterias) than just simple pagenumber. Therefore, I would like to preserve either search criteria or search result data after users' first submission, so that I only have to pass the pagenumber back and forth.
So I don't know which way is better: preserve search criterias, so every time when click to new page, call controller action do search again? Or preserve search result data, so application don't need query database again and again, but the data been preserved will big. If you have any idea, how to implement? Thanks in advence.
Preserving the search criteria in the querystring is generally best. It will allow users to bookmark the search.
Preserving search result data brings up issues of potential stale data and consumes more resources server-side. This wouldn't work well with large data sets anyway, as you would only be selecting one page at a time, so caching in memory wouldn't help much when the user navigates to the next page.
/I'd suggest you generate a unique key for each search, and store an object that contains all the search criteria in memory, or DB, with that unique key. then pass the unique key on the querystring./
So your means, save search criteria with unique key in DB, anytime I need THE search result again(include change page index), get unique key from querystring, run query again. That your suggestion, right? Thank you very much for your advice. Very help.
I've been using Ferret as my full-text search engine in a small project I'm working on.
Through the documentation and a few examples online, i've been able to pull together a tag cloud generator using the full-text index to help with tag cloud generation using the IndexReader.terms method.
It's worked quite well up to now, when I want to get term data based on a search result.
For example, if the user searches for "cake", I want to show them a tag cloud of terms used in association with the term "cake".
I've been looking for examples of where the terms method can be used in association with a search result set or similar?
Currently I'm using the following method to generate my list of tags:
reader = Ferret::Index::IndexReader.new(Scrape.find_last_index_version)
terms = []
reader.terms(:all_quotes).each do |term, doc_freq|
terms << [term, doc_freq]
end
Cheers.
It's more like a term frequency chart (like a wordle) than a tag cloud? Or are these in a tag field? Anyway, the index doesn't keep track of term frequency within each possible document subset (such as the results of a search), so that method wouldn't be fast, even if it existed. For a single document, you can get the TermFreqVector and provide suggested documents that are good matches for other frequent terms in that document. So, you could take some of the top results, grab the term vectors from each one, and just add them up, but those aggregate functions don't exist natively (they generally try not to put slow operations in there.)