I am using Lucene.net to perform faceted searches for a MVC based web app hosted on Azure.
The index consists of approx. 2 million entries.
Each entry has 1 Analysed field and about 25 Non Analysed fields.
All fields need to be stored.
Currently the entire app works fine with a 25% complete sample index but falls over when the full index is created.
At which point i start getting an outofmemory exception from this line :
sfs = new SimpleFacetedSearch(Newreader, "Product_Id");
Each document is an SKU and each product has about 44 SKUs.
My intention ( and what was working before the full index was created) was to perform a facet search on "Product_Id" giving the total unique Products in order to allow for paging and only creating object models for the required number of product (24 products per page for example).
The layout of the page is such that i need all the SKU data for a product but need to limit by unique products. (i.e 24 products/page not SKUs/page)
So in essence I either need to figure out why i am getting the outofmemory exception. (Lucene seems to be handling much larger index for people, so maybe i am doing something wrong)
OR
I need to filter the SKUs down by unique product ID in another way.
I tried iterating through a loop tracking the document productIds and only grabbing the full model if required which worked well when trying to meet the 24 per page quota of the first few results pages (as you didn't have to iterate so many times) but was awful for the last few pages.
Related
I have an alphabetized index of people. My goal is to find what page of that index a person is listed on. For instance, "Tim Curry" might be listed on page 5 of the "T" section. Currently I'm getting the page number with ActiveRecord; Elasticsearch results are 20 per page, so I can work out the page number based on the index. But it seems wiser to get the page number directly from Elasticsearch if at all possible to ensure that I'm getting the right page. Is there a way to get this data from ES?
def page_index
letter= self.name[0].downcase
index=Person.where("lower(name) like?", "#{letter}%").order("lower(name)").pluck(:id).index(self.id)
page=index/20 + 1
end
This functionality does not come bundled with ElasticSearch. Using the results per page and index is the correct approach if that is the functionality you are looking for.
Since it's not clear exactly which document you need, or what the overall UX you are trying to achieve is, I would keep in mind you can always search your index(ces) for a specific document via various mean (filtered query term on name if you need "Tim Curry", id or _uid etc.).
Also ES is a full-text based search client, finding one Object and it's properties might be better served via a database call.
Again this is slightly heresay, as I don't know what exactly you need or are trying to achieve overall, however finding the page of a specific result in your set of returned results is best down via accessing index in your results and simple math.
I am trying to solve a problem on mahout. The question is we have users and courses, a user can view a course or can take a course. If a user is viewing a course frequently then i have to recommend to take the course. I have data like userid and itemid and there is no preferences associated with.
EX:
1 2
1 7
2 4
2 8
3 5
4 6
where in first column 1 is userid and in 2nd column 2 is course id.The twist is in 2nd column can hold both viewed or/and complete of a particular course.suppose courseA which is viewed has id 2 and same courseA which is taken has id 7 for user 1. if a user other than user 1 coming and viewing the courseA than i have to predict courceA to be taken.now the problem here is if all the user viewing a course but not taking it, then user based recommendation in mahout will be failed.because for business perspective we have to give them the course that they are viewing should be taken. Do i need to factorize my dataset here or which algo is best suitable for this kind of problem.
One problem is that viewing may not predict (and certainly won't predict as well) that the user wants to take the course. You should look at the new cross-cooccurrence recommender stuff in Mahout v1. It's part of a complete revamp of Mahout on Spark using a new Scala DSL and built in optimizer for linear algebra. The command line job you are looking for is spark-itemsimilarity and it can ingest your user and item ids directly without translating them into cardinal non-negative numbers.
The algo takes the actions you know you want to recommend (user takes a course) these are the strongest "indicators" that can be used in your recommender. Then finds correlated views, views that led to the user taking that course. This is done with the spark-itemsimilarity job, which can take two actions at a time finding correlations, filtering out noise, and producing two "indicators". From the job you get two sparse matrices, each row is an item from the "user takes a course" action dataset and the values are an ordered list of item ids that are most similar. The first output will be items similar by other peoples taking the course, the second will be items similar by other people viewing and taking the course.
Input uses application specific IDs. You can leave you data mixed if you include a filter term that ids the action. It looks something like:
user-id-1,item-id1,user-took-class
user-id-1,item-id2,user-viewed-class-page
user-id-1,item-id5,user-viewed-class-page
...
The output is text delimited (think CSV but you can control the format) and is all item-id tokens that by default looks like this:
item-id-1,item-id-100 item-id-200 item-id-250 ...
This is an item id, comma, and an ordered list of similar items separated by spaces. Index this with a search engine and use the current user's history of action 1 to query against the primary indicator and the user's history of action 2 against the secondary cross-cooccurrence indicator. These can be indexed together as two fields of the same doc so there is only one query against two fields. This also gives you a server that is as scalable as Solr or Elasticsearch. You just create the data models with Mahout then index and query them with a search engine.
Mahout docs:http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Preso on the theory and other things you can do with these techniques: http://www.slideshare.net/pferrel/unified-recommender-39986309
Using this technique you can take virtually the entire user clickstream recorded as separate actions and use them to make better recs. The actions don't even have to be on the same items. You can use the user's search term history, for instance, and get a cross-cooccurrence indicator. In this case the output would have search terms that lead users to take the course and so your query would be the current user's search term history.
Basically here is the set up:
You have a number of marketplace items and you want to sort them by price. If the cache expires when someone is browsing, they will suddenly be presented potential duplicate entries. This seems like a really terrible public API experience and we are looking to avoid this problem.
Some basic philosophies I have seen include:
Reddit's, in which they track the last id seen by the client, but they still handle duplicates.
Will Paginate, which is a simple implementation that basically returns results based on a multiple of items you want returned and an offset
Then there are many varied solutions that involve Redis sorted sets, etc. But these also don't really solve the problem of how to remove the duplicate entries
Does anyone have a fairly reliable way to deal with paginating sorted, dynamic lists without dupicates?
If the items you need to paginate are sorted properly ( on unique values ) then the only thing you need to do is to select the results by that value instead of by offset.
simple SQL example
SELECT * FROM items LIMIT 10; /*page 1*/
lets say row #10 has id = 42 (and id is the primary key)
SELECT * FROM items WHERE id < 42 LIMIT 10; /* page 2*/
If you are using postgresql (probably mysql has same problem) this solves also the problem that using OFFSET sucks in terms of performances (OFFSET N LIMIT M needs to scan N rows!)
If sorting is not unique (eg. sorting on creation timestamp can lead to multiple items created at the same time) you are going to have the duplication problem
After hearing about NoSQL for a couple years I finally started playing with RavenDB today in an .Net MVC app (simple blog). Getting the embedded database up and running was pretty quick and painless.
However, I've found that after inserting objects into the document store, they are not always there when the subsequent page refreshes. When I refresh the page they do show up. I read somewhere that this is due to stale indexes.
My question is, how are you supposed to use this in production on a site with inserts happening all the time (example: e-commerce). Isn't this always going to result in stale indexes and unreliable query results?
Think of what actually happens with a traditional database like SQL Server.
When an item is created, updated, or deleted from a table, any indexes associated with table also have to be updated.
The more indexes you have on a table, the slower your write operations will be.
If you create a new index on an existing table, it isn't used at all until it is fully built. If no other index can answer a query, then a slow table scan occurs.
If others attempt to query from an existing index while it is being modified, the reader will block until the modification is complete, because of the requirement for Consistency being higher priority than Availability.
This can often lead to slow reads, timeouts, and deadlocks.
The NoSQL concept of "Eventual Consistency" is designed to alleviate these concerns. It is optimized reads by prioritizing Availability higher than Consistency. RavenDB is not unique in this regard, but it is somewhat special in that it still has the ability to be consistent. If you are retrieving single document, such as reviewing an order or an end user viewing their profile, these operations are ACID compliant, and are not affected by the "eventual consistency" design.
To understand "eventual consistency", think about a typical user looking at a list of products on your web site. At the same time, the sales staff of your company is modifying the catalog, adding new products, changing prices, etc. One could argue that it's probably not super important that the list be fully consistent with these changes. After all, a user visiting the site a couple of seconds earlier would have received data without the changes anyway. The most important thing is to deliver product results quickly. Blocking the query because a write was in progress would mean a slower response time to the customer, and thus a poorer experience on your web site, and perhaps a lost sale.
So, in RavenDB:
Writes occur against the document store.
Single Load operations go directly to the document store.
Queries occur against the index store
As documents are being written, data is being copied from the document store to the index store, for those indexes that are already defined.
At any time you query an index, you will get whatever is already in that index, regardless of the state of the copying that's going on in the background. This is why sometimes indexes are "stale".
If you query without specifying an index, and Raven needs a new index to answer your query, it will start building an index on the fly and return you some of those results right away. It only blocks long enough to give you one page of results. It then continues building the index in the background so next time you query you will have more data available.
So now lets give an example that shows the down side to this approach.
A sales person goes to a "products list" page that is sorted alphabetically.
On the first page, they see that "Apples" aren't currently being sold.
So they click "add product", and go to a new page where they enter "Apples".
They are then returned to the "products list" page and they still don't see any Apples because the index is stale. WTF - right?
Addressing this problem requires the understanding that not all viewers of data should be considered equal. That particular sales person might demand to see the newly added product, but a customer isn't going to know or care about it with the same level of urgency.
So on the "products list" page that the sales person is viewing, you might do something like:
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
While on the customer's view of the catalog, you would not want to add that customization line.
If you wanted to get super precise, you could use a slightly more optimized strategy:
When going back from the "add product" page to the "list products" page, pass along the ProductID that was just added.
Just before you query on that page, if the ProductID was passed in then change your query code to:
var product = session.Load(productId);
var etag = session.Advanced.GetEtagFor(product);
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOf(etag))
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
This will ensure that you only wait as long as absolutely necessary to get just that one product's changes included in the results list along with the other results from the index.
You could optimize this slightly by passing the etag back instead of the ProductId, but that might be less reusable from other places in your application.
But do keep in mind that if the list is sorted alphabetically, and we added "Plums" instead of "Apples", then you might not have seen these results instantly anyway. By the time the user had skipped to the page that includes that product, it would likely have been there already.
You are running into stale queries.
That is a by design part of RavenDB. You need to make distinction between queries (BASE) and loading by id (ACID).
We need to find all the courses for a user whose startDate is less than today's date and endDate is greater than today's date. We are using API
/d2l/api/lp/{ver}/enrollments/myenrollments/?orgUnitTypeId=3
In one particular case I have more than 18 thousand courses against one user. The service can not return 18 thousand records at one go, I can only get 100 records at a time, so I need to use bookmark fields to fetch data in set of 100 records. Bookmark is the courseId of the last 100th record that we fetched, to get next set of 100 records.
/d2l/api/lp/{ver}/enrollments/myenrollments/?orgUnitTypeId=3&bookmark=12528
I need to repeat the loop 180 times, which results in "Request time out" error.
I need to filter the record on the basis of startDate and endDate, no sorting criteria is mentioned which can sort the data on the basis of startDate or endDate. Can anyone help me to find out the way to sort these data, or tell any other API which can do such type of sorting?
Note: All the 18 thousand records has property "IsActive":true
Rather than getting to the list of org units by user, you can try getting to the user by the list of org units. You could try using /d2l/api/lp/{ver}/orgstructure/{orgUnitId}/descendants/?ouTypeId={courseOfferingType} to retrieve the entire list of course offering IDs descended from the highest common ancestor known for the user's enrollments. You can then loop through /d2l/api/lp/{ver}/courses/{orgUnitId} to fetch back the course offering info for each one of those org units to pre-filter and cut out all the course offerings you don't care about based on dates. Then, for the ones left, you can check for the user's enrollment in each one of those to figure out which of your smaller set the user matches with.
This will certainly result in more calls to the service, not less, so it only has two advantages I can see:
You should be able to get the entire starting set of course offerings you need off the hop rather than getting it back in pages (although it's entirely possible that this call will get turned into a paging call in the future and the "fetch all the org units at once" nature it currently has deprecated).
If you need to do this entire use-case for more than one user, you can fetch the org structure data once, cache it, and then only do queries for users on the subset of the data.
In the mean time, I think it's totally reasonable to request an enhancement on the enrollments calls to provide better filtering (active/nonactive, start-dates, end-dates, and etc): I suspect that such a request might see more traction than a request to give control to clients over paging (i.e. number of responses in each page frame).