Azure Search Features - bury, elevate, blacklist - search-engine

Does Azure search provide below features?
Bury Scenario with negative boost. I was going through page: Scoring profile. It says, to bury a document, factor of 0-1 can be given. But this might not always be correct, what if document_1 gets a score of 5 and and document_2 score of 2 and the boost factor we give as 0.5, even than the score of document_1 would be greater. So, Do Azure search provides negative boost?
Let's say I want to give negative boost to a specific brand name for my products, for any search.
Does Azure search provides features like elevate IDs/Exclude Ids features like in solr, for elevating or blacklisting products?
Solr Elevate/Blacklist feature
Does Azure Search provides minimum match feature of edismax parser in solr? Minimum number of keywords to match from query, for a document to appear on result.
Solr Minimum Match
If I make any change in Scoring profile, does that mean index rebuild is required to use scoring profile? As I see its defined with schema.

#Bikas, to your questions
This topic discusses the idea of leveraging negative boost
For this elevation, I wonder if tag boosting would work for you where you would tag specific documents that could be elevated as needed?
There is a good discussion about this here. But no, Azure Search does not support dismax or edismax.
If you modify a scoring profile, it will be instantly leveraged by the queries and not index rebuild is required.

Related

Recommender System: Is it content-based filtering?

Can someone please help me clarify.
I am currently using collaborative filtering (ALS) which returns a recommendation list with scores corresponding to the recommended items. In addition to this, I am boosting the scores (+0.1) if the items contain a tag that corresponds with what the user has specified they prefer such as "romantic movies". To me, this is considered a hybrid collaborative approach since it's boosting the Collaborative filtering results with content-based filtering (Please correct me if I am wrong).
Now, what if I did the same approach without doing Collaborative filtering? would it be considered Content-based Filtering? since I will be still recommending dishes based on the content and attributes of each dish corresponding to what the user has specified they like (such as "romantic movies").
The reason why I'm confused is because I've seen content-based filtering where they apply an algorithm such as Naive Bayes etc, and this approach would be similar to a simple search of the items (on the contents).
Not sure you can do what you suggest because you have no score to boost without CF.
You are indeed using a hybrid, much the same as the Universal Recommender. To do purely content-based recommendations you have to implement two methods
Personalized recommendations: here you have to look at the content of items the user preferred and find items that have similar content. This can be done by using something like the Mahout spark-rowsimilarity job to create a model of item: list-of-similar-items then indexing the results with a search engine and using the user's preferred item ids as the query. This is being added to the Universal Recommender.
"People who liked this also liked these": these are items similar to one being viewed, for example, and are the same for all users. They are not personalized and so are useful even for anonymous users with no history. This can be done with the same indexed ids as above but using the items similar to the one being viewed as the query. One might think to use only the similar items themselves but by using them as a query you can put the categorical boost in the search engine query and have boosted items returned. This already works in the Universal Recommender but the similar items are not in the model yet.
That said mixing content with collaborative-filtering will almost surely give better results since CF works better when the data is available. The only time to rely on content-based recommendations is when your catalog is of one-off items, which never get enough CF interactions or you have rich content, which has a short lifetime like breaking news.
BTW anyone who wants to help add the pure content-based part to the Universal Recommender can contact the new maintainers of it at ActionML.com

Mahout trust aware collaborative filtering

I'm trying develop a trust-aware collaborative filtering approach. I have two epinions datasets. One with who trusts who: <ID_truster, ID_trusted>. And one with ratings: <ID_truster, ITEM, RATING>.
How can I make recommendations (User-User based) using only ratings from people who I trust?
At the moment I only make recommendations using the second dataset, taking in consideration every user.
Thank you
The closest thing I can think of is to use a user-neighborhood-based approach, and only include trusted users in the neighborhood. You would need to write some extra code for that, to disqualify untrusted users, by returning a very negative similarity value for them. Look at the UserSimilarity interface.

How to gauge or compare relative frequency of arbitrary words without a search engine API?

More than a few times I've wanted to programmatically pick the better of two words or phrase using frequency of use on the Internet as a heuristic.
The obvious way, and the way to do it manually, is to enter each term into a search engine and note how many "hits".
But the big search engines have deprecated their search APIs or limit to 100 queries per day free of charge even with an API key. Not great if you're working on a free project. Also the big search engines have a "no scraping" clause in their terms of service.
I need it to work for arbitrary, perhaps even unidentified languages, and from a device with limited storage. This rules out having a local corpus or database.
One area of application is tools for Wiktionary editors, helping them choose the main spelling of several variants even if the don't know the language. The one I have in mind right now is using frequency as a heuristic to help choose the best conversion between a spelling in a foreign script and a lossy transliteration in the Latin alphabet.

how to find the abnormal id from so many ids

We run an affiliate program. Users who sign up can gain points when they successfully recruit other users. However, spammers are abusing this program, and automatically signing up large numbers of accounts. We want to prevent this from happening by closing down clearly machine-generated accounts. My idea for this is to write a program to identify machine-generated account names, or at least select a subset for manual inspection.
So far, we have found that there are two types of abnormal ids:
The first one is that there are some ids looks very similar to others, such as:
wss12345
wss12346
wss12347
test1
test2
...
The second one is that there are some ids looks like randomly generated with out rules, such as:
MiDjiSxxiDekiE
NiMjKhJixLy
DAFDAB7643
...
For the first one, I use the Levenshtein(edit) distance. This method can find out some ids, which was illustrate in type 1. (I have done this, and can get good performance)
For the second one, I can calculate the probabilty for the ids, just like:
id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)
So I can use the probability to filter out the abnormal ids. (Just an idea; I haven't tried it out.)
Can anyone give me other suggestions about this topic? How else could I approach this problem? Can you see flaws or omissions in my attempts?
Assuming that these new accounts refer back to the the recruiter's ID, I'd look at the rate and/or sheer number of new accounts associated with a given recruiter.
Some analysis on IP addresses or similar may also indicate if multiple users are coming from the same computer.
I'd use a dictionary of words, and kind of do the reverse of detecting poor passwords -- human user names should have dictionary words, personal names, lack punctuation, not include repeated characters, be mostly lower case etc.
Sort of going back to 1. above -- if a recruiter has an anamalously tight cluster of IDs, using the features you've already identified, would be a good flag. I think that this might be, essentially, #larsmans comment directly under the question.
I'd be curious to know if re-purposing password checking algorithms (item 3) provides any benefit.
You're not telling us what sort of site you are running, so this is a bit on the speculative side; but consider Stack Overflow as a prime example of successfully promoting good behavior through the use of a user reputation system, and weeding out many kinds of unwanted behaviors.
A quick, hackish fix might be to progressively deduct from the score when the amount of dormant recruit accounts grows larger, but a more rewarding and compelling fix is to award higher reputation scores for actually contributing to the site's content. However, this depends on the type of site you have; a stock market tips site, say, obviously works quite differently from a techical discussion forum.

Dynamic business rules engine for ruby on rails

I have an application which will require a "dynamic business rules" engine. Some of the business rules changes very frequently. Some of then applies for a limited set of business accounts. For example: my customer have a process where they qualify stores, based on their size, number of the sales person, number of products, location, etc. But he manages different account, and each account give different "weights" to each attribute.
How do I implement this engine using Ruby? I know Java has drools, but I find drools annoying and complex. And I prefer not having to use JRuby...
Regards,
Rubem
If you're sure a rule engine is what you need, you will need to find one you can use in Ruby. A quick Google search brought up Rools (http://rools.rubyforge.org/) and Ruby Rules (http://xircles.codehaus.org/projects/ruby-rules). I'm not sure of the status of either project though. Using JRuby with Drools might be your best bet but then again, I'm a Java developer and a big Drools advocate. :)
Without knowing all the details, it's a little hard to say how that should be implemented. It also depends on how you want the rules to be updated. One approach is to write a collection of rules similar to this: "if a store exists with more than 50 sales people and the store hasn't had its weight updated to reflect that, then update the store's weight." However, in some way you could compare that to hardcoding.
A better approach might be to create Weight objects with criteria that need to be met for the weight to apply. Then you could write one rule that matches on both Weights and Stores: "if a Store exists that matches a Weight's criteria and the Store doesn't already have that Weight assigned to it, then add the Weigh to the Store." Then the business folks could just create and update Weights, possibly in a web front-ended database, instead of maintaining rules.

Resources