I went through tutorials on implementing Recommender System and most of them takes one variable (rank).
I want to implement an Item-Based Recommender System which takes multiple variables.
Eg : Let's say an Item (bar) has following varables (values ranging from -10 to +10, to express opposite polarities)
- price (cheap to expensive)
- environment (casual to fine)
- age range (young to adults)
Now I want to recommend items (bar) looking at the list of bars registered in User's history.
Is this kind of a "multi dimensional recommender system" possible to implement using Mahout or any other framework ?
You want the multi-modal, multi-indicator, multi-variable, how every you want to describe it—Universal Recommender. It can handle all this data. We've tested it on real datasets and get significant boost in precision test because of what we call "secondary indicators".
Good intuition. Give the UR a look: blog.actionml.com, check out the slides in one post. Code here: https://github.com/actionml/template-scala-parallel-universal-recommendation/tree/v0.3.0 Built on the new Spark version of Mahout: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
Related
Background
I'm writing a Swift application that requires the classification of user events by categories. These categories can be things like:
Athletics
Cinema
Food
Work
However, I have a set list of these categories, and do not wish to make any more than the minimal amount I believe is needed to be able to classify any type of event.
Question
Is there a machine learning (nlp) procedure that does the following?
Takes a block of text (in my case, a description of an event).
Creates a "percentage match" to each possible classification.
For instance, suppose the description of an event is as follows:
Fun, energetic bike ride for people of all ages.
The algorithm in which this description would be passed in would return an object that looks something like this:
{
athletics: 0.8,
cinema: 0.1,
food: 0.06,
work: 0.04
}
where the values of each key in the object is a confidence.
If anyone can guide me in the right direction (or even send some general resources or solutions specific to iOS dev), I'd be super appreciative!
You are talking about typical classification model. I believe iOS offers you APIs to do this inside your app. Here Look for natural language processing bit - NLP
Also you are probably being downvoted because this forum typically looks to solve specific programming queries and not generic ones (this is an assumption and there could be another reason for downvotes.)
How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?
There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.
Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?
Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.
I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.
{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).
We plan to use Mahout for a movie recommendation system. And we also plan to
use SVD for model building.
When a new user comes we will require him/her to rate a certain number of movies (say 10).
The problem is that, in order to make a recommendation to this new user we have to rebuild the entire model again.
Is there a better way to this?
Thanks
Yes... though not in Mahout. The implementations there are by nature built around periodic reloading and rebuilding of a data model. In some implementations this still lets you use new data on the fly, like neighborhood-based implementations. I don't think the SVD-based in-memory one does this (I didn't write it.)
In theory, you can start making recommendations from the very first click or rating, by projecting the target item/movie back into the user-feature space via fold-in. To greatly simplify -- if your rank-k approximate factorization of input A is Ak = Uk * Sk * Vk', then for a new user u, you want a new row Uk_u for your update. You have A_u.
Uk = Ak * (Vk')^-1 * (Sk)^-1. The good news is that those two inverses on the right are trivial. (Vk')^-1 = Vk because it has orthonormal columns. (Sk)^-1 is just a matter of taking the reciprocal of Sk's diagonal elements.
So Uk_u = Ak_u * (Vk')^-1 * (Sk)^-1. You don't have Ak_u, but, you have A_u which is approximately the same, so you use that.
If you like Mahout, and like matrix factorization, I suggest you consider the ALS algorithm. It's a simpler process, so is faster (but makes the fold-in a little harder -- see the end of a recent explanation I gave). It works nicely for recommendations.
This also exists in Mahout, though the fold-in isn't implemented. Myrrix, which is where I am continuing work from Mahout, implements all of this.
I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?
You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.
Here's an example:
Say the input is buy aspirin in south la. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I don't want to make an active link)
Here's the first paragraph of the text:
The bare of purchasing prescription drugs from Canada is big
in the U.S. at this moment. This is
because in the U.S. prescription drug
prices bang skyrocketed making it
arduous for those who bang limited or
concentrated incomes to buy their much
needed medications. Americans pay more
for their drugs than anyone in the
class.
The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.
N-gram Language Models
You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.
You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.
Better Scoring through Bayes Law
When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).
However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.
To get either of these, you'll need to use Bayes Law, which states
P(B|A)P(A)
P(A|B) = ------------
P(B)
Using Bayes law, we have
P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)
and
P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)
P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.
The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.
Classification Only
However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.
Tools
In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.
Define 'quality' of a web - page? What is the metric?
If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.
The markup and hosting of those pages may however be sound engineering ..
But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...
For each result set per keyword query, do a separate google query to find number of sites linking to this site, if no other site links to this site, then exclude it. I think this would be a good start at least.
if you are looking for performance related metrics then Y!Slow [plugin for firefox] could be useful.
http://developer.yahoo.com/yslow/
You can use a supervised learning model to do this type of classification. The general process goes as follows:
Get a sample set for training. This will need to provide examples of documents you want to cover. The more general you want to be the larger the example set you need to use. If you want to just focus on websites related to aspirin then that shrinks the necessary sample set.
Extract features from the documents. This could be the words pulled from the website.
Feed the features into a classifier such as ones provided in (MALLET or WEKA).
Evaluate the model using something like k-fold cross validation.
Use the model to rate new websites.
When you talk about not caring if you mark a good site as a bad site this is called recall. Recall measures of the ones you should get back how many you actually got back. Precision measures of the ones you marked as 'good' and 'bad' how many were correct. Since you state your goal to be more precise and recall isn't as important you can then tweak your model to have higher precision.