Find similar items based on item attributes - mahout

Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.

I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.

Related

Hybrid recommendation system with matrix factorization and linear regression

I'm following a tutorial that for creating a recommendation system in BigQueryML. The tutorial uses matrix factorization first to calculate user and item factors. In the end I have a model that can be queried with user ids or item ids to get recommendations.
The next step is feeding the factors and additional item + user features into a linear regression model to incorporate more context.
"Essentially, we have a couple of attributes about the movie, the
product factors array corresponding to the movie, a couple of
attributes about the user, and the user factors array corresponding to
the user. These form the inputs to our “hybrid” recommendations model
that builds off the matrix factorization model and adds in metadata
about users and movies."
I just don't understand why the dataset for linear regression excludes the user and item ids:
SELECT
p.* EXCEPT(movieId),
u.* EXCEPT(userId),
rating
FROM productFeatures p, userFeatures u
JOIN movielens.ratings r
ON r.movieId = p.movieId AND r.userId = u.userId
My question is:
How will I be able to get recommendations for a user from the linear model, when I don't have the user or item ids in the model?
Here you can find the full code:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/recommendation_systems/solutions/als_bqml_hybrid.ipynb
In the example you have shared, the goal is to fit a linear regression to the discovered factor values so that a novel set of factor values can be used to predict the rating. In this kind of setup, you don't want information about which samples are being used; the only crucial information is the training features (the factor scores) and the rating (the training/test label). For more on this topic, take a look at "Dimensionality reduction using non-negative matrix factorization for information retrieval."
If you included the movie ids and user ids in as features, your regression would try to learn on those, which would either add noise to the model or learn that low ids = lower score etc. This is possible, especially if this ids are in some kind of order you're not aware of, such as chronological or by genre.
Note: You could use movie-specific or user-specific information to build a model, but you would have many, many dimensions of data, and that tends to create poorly performing models. The idea here is to avoid the problem of dimensionality by first reducing the dimensionality of the problem space. Matrix factorization is just one method among many to do this. See, for example, PCA, LDA, and word2vec.

Is it possible to cluster data with grouped rows of data in unsupervised learning?

I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.
In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Is it Item based or content based Collaborative filtering?

I am currently working on an existing system that recommends items that are similar to previous items that the user has liked.
It uses Alternating least squares Collaborative Filtering to find feature vectors of users and items. It then uses the feature vectors of the items and uses the cosine similarity measure to find similar items to it.
However, I would like some clarification as to whether this is item based CF or content based filtering? My inclination is that it is both. Since it is using a similarity measure to compare items, but the items are on the content of the feature vector?
Thanks,
If I understand correctly that you extract feature vectors for the items from users-like-items data, then it is pure item-based CF.
In order to be content based filtering, features of the item itself should be used: for example, if the items are movies, content based filtering should utilize such features like length of the movie, or its director, or so on, but not the features based on other users' preferences.
I guess your inclination is right, you are combining both content and collaborative filtering. If you are using content based then the vectors of item and users can be termed as x_i's of your data (like data points) whereas A_ij which is the cell in the input array stating what rating user i has given to item j can be termed as y_i.
You are using cosine-similarity to find similarity between item-item and user-user.
I guess in your scenario you should go for collaborative.
Try to make matrix of item-item and then calculate the cosine similarity.

Resources