Mahout precomputed Item-item similarity - slow recommendation - mahout

I am having performance issues with precomuted item-item similarities in Mahout.
I have 4 million users with roughly the same amount of items, with around 100M user-item preferences. I want to do content-based recommendation based on the Cosine similarity of the TF-IDF vectors of the documents.
Since computing this on the fly is slow, I precomputed the pairwise similarity of the top 50 most similar documents as follows:
I used seq2sparse to produce TF-IDF vectors.
I used mahout rowId to produce mahout matrix
I used mahout rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess to produce the top 50 most similar documents
I used hadoop to precompute all of this. For 4 million items, the output was only 2.5GB.
Then I loaded the content of the files produced by the reducers into Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ... using the docIndex to decode the ids of the documents. They were already integers, but rowId have decoded them starting from 1, so I have to get it back.
For recommendation I use the following code:
ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);
I am trying it with limited data model (1.6M items), but I loaded all the item-item pairwise similarities in memory. I manage to load everything in main memory using 40GB.
When I want to do recommendation for one user
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);
The elapsed time for the recommendation process is 554.938583083 seconds, and besides it did not produce any recommendation. Right now I am really concern about the performance of the recommendation. I played with the numbers of CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but I didn't get any improvements in performance.
Isn't it the idea of precomputing everything suppose to speed up the recommendation process?
Could someone please help me and tell me where I am doing wrong and what I am doing wrong.
Also why loading the parwise similarities in main memory explodes exponentially? 2.5GB of files was loaded in 40GB of main memory in Collection<GenericItemSimilarity.ItemItemSimilarity> mahout matrix?. I know that the files are serialized using IntWritable, VectorWritable hashMap key-values, and the key has to repeat for every vector value in the ItemItemSimilarity matrix, but this is little too much, don't you think?
Thank you in advance.

I stand corrected about the time needed for computing the recommendation using Collection for precomputed values. Apparently I have put the long startTime = System.nanoTime();on the top of my code, not before List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);. This counted the time needed to load the dataset and the precomputed item-item similarities into the main memory.
However I stand behind the memory consumptions. I improved it though using custom ItemSimilarity and loading a HashMap<Long, HashMap<Long, Double> of the precomputed similarity. I used the trove library in order to reduce the space requirements.
Here is a detail code. The custom ItemSimilarity:
public class TextItemSimilarity implements ItemSimilarity{
private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix;
public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){
this.correlationMatrix = correlationMatrix;
}
#Override
public void refresh(Collection<Refreshable> alreadyRefreshed) {
}
#Override
public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1);
if(similarToItemId1 != null && !similarToItemId1.isEmpty() && similarToItemId1.contains(itemID2)){
return similarToItemId1.get(itemID2);
}
return 0;
}
#Override
public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException {
double[] result = new double[itemID2s.length];
for (int i = 0; i < itemID2s.length; i++) {
result[i] = itemSimilarity(itemID1, itemID2s[i]);
}
return result;
}
#Override
public long[] allSimilarItemIDs(long itemID) throws TasteException {
return correlationMatrix.get(itemID).keys();
}
}
The total memory consumption together with my data set using Collection<GenericItemSimilarity.ItemItemSimilarity> is 30GB, and when using TLongObjectHashMap<TLongDoubleHashMap> and the custom TextItemSimilarity the space requirement is 17GB.
The time performance is 0.05 sec using Collection<GenericItemSimilarity.ItemItemSimilarity>, and 0.07 sec using TLongObjectHashMap<TLongDoubleHashMap>. Also I believe that big role in the performance plays using CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy
I guess if you want to save some space use trove HashMap, and if you want just little better performance, you can use Collection<GenericItemSimilarity.ItemItemSimilarity>.

Related

Does GPU accelerate data preprocessing in ML tasks?

I am doing a machine learning (value prediction) task. While I am preprocessing data, it takes a very long time. I have a csv file with around 640000 rows, and I am trying to subtract the dates of consecutive rows and calculate the time duration. The csv file looks as attached. For example, 2011-08-17 to 2011-08-19 takes 2 days, and I would like to write 2 to the "time duration" column. I've used the python datetime function to do this. And it costs a lot of time.
data = pd.read_csv(f'{proj_dir}/raw data/measures.csv', encoding="cp1252")
file = data[['ID', 'date', 'value1', 'value2', 'duration']]
def time_subtraction(date, prev_date):
diff = datetime.strptime(date, '%Y-%m-%d') - datetime.strptime(prev_date, '%Y-%m-%d')
diff_days = diff.days
return diff_days
def calculate_time_duration(dataframe, set_0_indices):
for i in range(dataframe.shape[0]):
# For each patient, sets "Time Duration" at the first measurement to be 0
if i in set_time_0_indices.values:
dataframe.iloc[i, 4] = 0 # set time duration to 0 (beginning of this patient)
else: # time subtraction
dataframe.iloc[i, 4] = time_subtraction(date=dataframe.iloc[i, 1], prev_date=dataframe.iloc[i-1, 1])
return dataframe
# I am running on Google Colab. This line takes very long.
result = calculate_time_duration(dataframe = file, set_0_indices = set_time_0_indices)
I wonder if there are any ways to accelerate this process. Does using a GPU help? I have access to a remote GPU, but I don't know if using a GPU helps with data preprocessing. By the way, under what scenario can GPUs really make things faster? Thanks in advance!
what my data looks like
Regarding updating your data in a faster fashion please see this post.
Regarding speed improvements using the GPU: You can only use the GPU if there are optimized operations which can actually be run on the CPU. Preprocessing like you do it are normally not in the scope. You must also consider that you would need to transfer data to the GPU first, before computing anything and then transferring the results back. In your case, this would take much longer than the actual speedup, especially since your operation on the data is quite simple. I'm sure using the correct pandas syntax will lead to your desired speed up in preprocessing.

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

how to improve memory usage when using GenericItemSimilarity in mahout(taste)

As we known, in genericItemSimilarity similarity between item1 and item2 is precomputed.
when we use GenericItemBasedRecommender to get recommendation,the recommender need datamodel and similarity in memory at the same time.According to the genericItemSimilarity,it offers a construction like this
public GenericItemSimilarity(ItemSimilarity otherSimilarity, DataModel dataModel) throws TasteException {
long[] itemIDs = GenericUserSimilarity.longIteratorToList(dataModel.getItemIDs());
initSimilarityMaps(new DataModelSimilaritiesIterator(otherSimilarity, itemIDs));
}
just use dataModel to generate Similarity Maps in time .
Is it necessary to store the similarity maps to Db/file ?
I find mahout 0.7 have a class named FileItemItemSimilarityIterator can be helpful to read similarity maps from file.
is the FileItemItemSimilarityIterator or AbstractJDBCInMemoryItemSimilarity(mahout 0.5) redundancy or helpless.
You don't have to put the similarities in memory at all if they can be re-computed quickly on the fly.
If not, I suggest you simply prune similarities that have small absolute value. These affect the computation the least.

Does test file in weka requires same or less number of features as train?

I have prepared two different .arff files from two different datasets one for testing and other for training. Each of them have equal instances but different features changing the dimensionality of feature vector for each file. When i did cross-validation on each of these files, they are working perfectly. This shows .arff files are properly prepared and don't have any error.
Now if i use the train file having less dimensionality compared to test file for evaluation. I get a following error.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5986
at weka.classifiers.bayes.NaiveBayesMultinomial.probOfDocGivenClass(NaiveBayesMultinomial.java:295)
at weka.classifiers.bayes.NaiveBayesMultinomial.distributionForInstance(NaiveBayesMultinomial.java:254)
at weka.classifiers.Evaluation.evaluationForSingleInstance(Evaluation.java:1657)
at weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Evaluation.java:1694)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1574)
at TrainCrossValidateARFF.main(TrainCrossValidateARFF.java:44)
Does test file in weka requires same or less number of features as train ?
Code for evaluation
public class TrainCrossValidateARFF{
private static DecimalFormat df = new DecimalFormat("#.##");
public static void main(String args[]) throws Exception
{
if (args.length != 1 && args.length != 2) {
System.out.println("USAGE: CrossValidateARFF <arff_file> [<stop_words_file>]");
System.exit(-1);
}
String TrainarffFilePath = args[0];
DataSource ds = new DataSource(TrainarffFilePath);
Instances Train = ds.getDataSet();
Train.setClassIndex(Train.numAttributes() - 1);
String TestarffFilePath = args[1];
DataSource ds1 = new DataSource(TestarffFilePath);
Instances Test = ds1.getDataSet();
// setting class attribute
Test.setClassIndex(Test.numAttributes() - 1);
System.out.println("-----------"+TrainarffFilePath+"--------------");
System.out.println("-----------"+TestarffFilePath+"--------------");
NaiveBayesMultinomial naiveBayes = new NaiveBayesMultinomial();
naiveBayes.buildClassifier(Train);
Evaluation eval = new Evaluation(Train);
eval.evaluateModel(naiveBayes,Test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
Does test file in weka requires same or less number of features as train ? Code for evaluation
Same number of features are necessary. You may need to insert ? for class attribute too.
According to Weka Architect Mark Hall
To be compatible, the header information of the two sets of instances needs to be the same - same
number of attributes, with the same names in the same order. Furthermore, any nominal attributes must
have the same values declared in the same order in both sets of instances.
For unknown class values in your test set just set the value of each to missing - i.e "?".
According to Weka's wiki, the number of features needs to be same for both the training and test sets. Also the type of these features (e.g., nominal, numeric, etc) needs to be the same.
Also, I assume that you didn't apply any Weka filters to either of your datasets. The datasets often become incompatible if you apply filters separately on each dataset (even if it is the same filter).
How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
-Load the full dataset
-select the RemovePercentage filter in the preprocess panel
-set the correct percentage for the split
-apply the filter
-save the generated data as a new file
test set:
-Load the full dataset (or just use undo to revert the changes to the dataset)
-select the RemovePercentage filter if not yet selected
-set the invertSelection property to true
-apply the filter
-save the generated data as new file

Resources