Strange predictions using SVD in mahout - mahout

I'm trying to build svdrecommender using mahout. Code is simple:
DataModel model = new FileDataModel(new File("C:\\data.csv"));
SVDRecommender recommender = new SVDRecommender(model, new SVDPlusPlusFactorizer(model, 10, 20));
All my ratings are doubles between 0 and 1. However recommender in most cases predicts values above 1. How could it happen? Is it a feature of svd algorithm?

SVDRecommender uses approximate decomposition of ratings' matrix into two other matrixes. So their product can contain arbitrary numbers in cells.

Related

XGBoost feature importance (TFIDF + TruncateSVD)

I have an XGBoost model that runs TFIDF vectorization and TruncateSVD reduction on text features. I want to understand feature importance of the model.
This is how I process text features in my dataset:
.......
tfidf = TfidfVectorizer(tokenizer=tokenize)
tfs = tfidf.fit_transform(token_dict)
svd = TruncatedSVD(n_components=15)
temp = pd.DataFrame(svd.fit_transform(tfs))
temp.rename(columns=lambda x: text_feature+'_'+str(x), inplace=True)
dataset=dataset.join(temp,how='inner')
.......
It works okayish and now I'm trying to understand importance of the features in the dataset. I generate the charts using:
xgb.plot_importance(model, max_num_features=15)
pyplot.show()
And get something similar to:
this chart
What would be the right way to "map" importance SVD dimensions to the dimensions of the initial dataset? So I know importance of summary and not summary_1, summary_2, summary_X.
Thanks
one thing you can try is getting the how important each original feature is to creating new features. you can get it using the following:
feature_importance_scores = np.abs(svd.components_).sum(axis=0)
feature_importance_scores /= feature_importance_scores.sum() # normalize to make it more clear
you can get the overall importance by multiplying these values with xgb.feature_importances_

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
I save the model in pickle and used it to classify new documents:
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?
EDIT: I am trying to explain better what I want to get.
Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?
I am using sklearn v 0.19
As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:
tfidfVect = text_model.named_steps['vect']
Now you can use the transform() method of the vectorizer to get the tfidf values.
tfidf_vals = tfidfVect.transform(new_document)
The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().

How to get paragraph vector for a new paragraph?

I have a set of users and their content(1 document per user containing tweets of that user). I am planning to use a distributed vector representation of some size N for each user. One way is to take pre trained wordvectors on twitter data and average them to get distributed vector of an user. I am planning to use doc2vec for better results.But I am not quite sure if I understood the DM model given in Distributed Representations of Sentences and Documents.
I understand that we are assigning one vector per paragraph and while predicting next word we are using that and then backpropagating the error to update the paragraph vector as well as word vector. How to use this to predict paragraph vector of a new paragraph?
Edit : Any toy code for gensim to compute paragraph vector of new document would be appreciated.
The following code is based on gensim's doc2vec tutorial. We can instantiate and train a doc2vec model to generate embeddings of size 300 with a context window of size 10 as follows:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(size=300, window=10, min_count=2, iter=64, workers=16)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
Having trained our model, we can compute a vector for a new unseen document as follows:
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_simlar([inferred_vector], topn=len(model.docvecs))
This will return a 300-dimensional representation of our test document and compute top-N most similar documents from the training set based on cosine similarity.

How to get jlibsvm prediction probability in multi-class classification

I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.

Too small RMSE. Recommender systems

Sorry, I'am newbie at recommender systems, but i wrote few lines of code using apache mahout lib. Well, my dataset is pretty small, 500x100 with 8102 cells known.
So, my dataset is actually a subset of Yelp dataset from "Yelp business rating prediction" competition. I just take top 100 most commented restaurants, and then take 500 most active customers.
I created SVDRecommender and then I evaluated RMSE. And so the result is about 0.4... Why is it so small? Maybe i just don't understand something and my dataset is not so sparse, but then i tried with larger and more sparse dataset and RMSE become even smaller (about 0.18)! Could anyone explain me such behaviour?
DataModel model = new FileDataModel(new File("datamf.csv"));
final RatingSGDFactorizer factorizer = new RatingSGDFactorizer(model, 20, 200);
final Factorization f = factorizer.factorize();
RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) throws TasteException {
//build here whatever existing or customized recommendation algorithm
return new SVDRecommender(model, factorizer);
}
};
RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
double score = evaluator.evaluate(builder,
null,
model,
0.6,
1);
System.out.println(score);
RMSE is calculated by looking at predicted ratings versus their hidden ground-truth. So a sparse dataset may only have very few hidden ratings to predict, or your algorithm may not be able to predict for many hidden ratings because there's no correlation to other ratings. This means that even though your RMSE is low ("better"), your coverage will be low because you aren't predicting very many items.
There's another issue: RMSE is completely dataset dependent. On the MovieLens ratings dataset which has star ratings 0.5 to 5.0 stars, an RMSE of roughly 0.9 is common. But on another dataset with 0.0 to 1.0 points, I've observed an RMSE of around 0.2. Look at the properties of your dataset and see if 0.4 makes sense.

Resources