ALS methods - train, trainImplicit and fit

ALS methods - train, trainImplicit and fit - machine-learning

What is the difference between Difference between als.train(, als.fit() , als.traimImplicit()

First of all we should know difference between implicit and explicit feedback.
explicit preference (also referred as "explicit feedback"), such as "rating" given to item by users.
implicit preference (also referred as "implicit feedback"), such as "view" and "buy" history.
For better understanding you can look at below two links:
Why does ALS.trainImplicit give better predictions for explicit ratings?
https://stats.stackexchange.com/questions/133565/how-to-set-preferences-for-als-implicit-feedback-in-collaborative-filtering
train and trainimplicit are used in mllib package which is used for rdd data. With spark dataframe, spark has new module with name ml. In ml package it uses spark dataframe for calculating ratings and the method name is fit. fit method from ml use matrix factorization. for more detail check doc for ALS(ml) class.
https://github.com/apache/spark/blob/926e3a1efe9e142804fcbf52146b22700640ae1b/python/pyspark/ml/recommendation.py
Also,ml module is faster than mllib.
What's the difference between Spark ML and MLLIB packages

Related

Strategies to assign specific weights to training instances

I am working on a Machine Learning Classification Model in which the user can provide label instances that should help improve the model.
More relevance needs to be given to the latest instances given by the user than for those instances that were previously available for training.
In particular, I am developing my machine learning models in python using Sklearn libraries.
So far I've only found the strategy of oversampling particular instances as a possible solution to the problem. With this strategy I would create multiple copies of the instances for which I want to give higher relevance.
Other strategy that I've found, but it seems not help under these conditions is:
Strategies that focus on giving weights for each class. This strategy is highly used in multiple libraries like Sklearn by default. However, this generalizes the idea to a class level and doesn't help me to put focus on particular instances
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
I read some suggestions to multiple the loss function by some factors for instances in tensor flow models, but this seems to be mostly applicable to neural network models in Tensor flow.
I wonder if anyone has information of other approaches that might helps with this problem

I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
This is not accurate; most scikit-learn classifiers provide a sample_weight argument in their fit methods, which does exactly that. For example, here is the documentation reference for Logistic Regression:
sample_weight : array-like, shape (n_samples,) optional
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Similar arguments exist for most scikit-learn classifiers, e.g. decision trees, random forests etc, even for linear regression (not a classifier). Be sure to check the SVM: Weighted samples example in the docs.
The situation is roughly similar for other frameworks; see for example own answer in Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?
What's more, scikit-learn also provides a utility function to compute sample_weight in cases of imbalanced datasets: sklearn.utils.class_weight.compute_sample_weight

Figuring out why scikit-learn DecisionTreeClassifier decided to exclude a feature from the resulting decision tree?

I'm using scikit-learn's DecisionTreeClassifier to construct a decision tree for a particular feature-set. To my surprise, one feature which was thought to be significant - was excluded.
Is there a way to take a peek under the hood, and figure out why the algorithm chose to exclude that feature?
Or really, get more information / analytics about any part of the decision-tree construction process?

Regarding your problem with a feature ignoring, its hard to tell why, but I can suggest to "play" with the weights of the sample_weight flag to change the weight each sample get, and therefore give more weight to the mentioned feature, which you can read an excellent explanation here.
Also, for debugging, there is a way to save an image of the trained tree, as demonstrated in the documentation:
The export_graphviz exporter supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. IPython notebooks can also render these plots inline using the Image() function:
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None, # clf: the trained classifier
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

I am trying with various SVM variants in scikit-learn along with CountVectorizer and HashingVectorizer. They use fit or fit_transform in different examples, confusing me which to be used when.
Any clarification would be much honored.

They serve a similar purpose. The documentation provides some pro's and con's for the HashingVectorizer :
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.

machine learning from words found in text

I would like to use a supervised machine learning algorithm to predict a binary function (true or false) for a set of sentences based on the presence or absence of words in the sentences.
Ideally, I would like to avoid having to hardcode the set of words used to decide on the output so that the algorithm automatically learns which words are (together ?) most likely to trigger specific outputs.
http://shop.oreilly.com/product/9780596529321.do (Programming Collective Intelligence) has a nice section in chapter 4 titled "Learning From Clicks" which describes how to do this by using 1 layer of hiden nodes in a neural network with one new hidden node for each new combination of input words.
Similarly, it is possible to create a feature for each word in the training data set and train pretty much any classic machine learning algorithm using these features. Adding new training data will generate new features which will require me to re-train the algorithm from scratch.
Which brings me to my questions:
is it actually a problem if I have to retrain everything from scratch whenever the training data set is extended ?
what kind of algorithm would more experience machine learning users recommend to use for this kind of problem ?
what criteria should I use in picking an algorithm versus another ? (other than actually trying them all and see which perform better with precision/recall metrics)
if you have worked on similar problems, what about extending the features with 2-grams (1 if a specific 2-gram is present, 0 if not) ? 3-grams ?

You could look into the general area of topic modelling if you want to find words which are generally found together.
The most simple approach would be to use latent semantic analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ), which is just applying SVD to a term document matrix. You'd then need to do some additional post hoc analysis to fit this to your particular outcome.
A more involved, and much more complex approach would be to use latent dirichlet allocation ( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
In terms of just adding new features (words) that is fine as long as you are going to retrain. You can also use TF/IDF to give that particular word a value when representing the matrix (Instead of just a 1 or 0).
I don't know what programming language you are trying to do this in, but I know there are libraries out there in Java and Pythont hat do all of the above.

A few implementation details for a Support-Vector Machine (SVM)

In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.

I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).

Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.

There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.

Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

ALS methods - train, trainImplicit and fit - machine-learning

What is the difference between Difference between als.train(, als.fit() , als.traimImplicit()

Related

Strategies to assign specific weights to training instances

Figuring out why scikit-learn DecisionTreeClassifier decided to exclude a feature from the resulting decision tree?

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

machine learning from words found in text

A few implementation details for a Support-Vector Machine (SVM)

Categories

Resources