How to post process preferences? (I am using spark ALS implicit) - machine-learning

I am familiar with the article "Collaborative Filtering for Implicit Feedback Datasets" http://yifanhu.net/PUB/cf.pdf.
I use spark ml ALS implicit to recommend items to users, with parameters Alpha = 30, Rank = 10, RegParam = 0.1. In my dataset there are only users that use more than one item (according to advice given here
How to improve my recommendation result? I am using spark ALS implicit )
I use .recommendForAllUsers and get preferences p_ui. Then I filter only "new" recommendations (combinations user-item that were not in the input dataset). I also filter preferences > 0.01 to get only the most preferable items.
The questions is: how can I postprocess the preferences to make them more similar to "probabilities"?
(it is a requirement to my program to output some kind of "probability").
Is it a good idea to scale preferences to [0.5, 1.0]? (using the formula:
scaled_preference = ($"preference"*0.5 + max_preference*0.5 -
min_preference)/(max_preference - min_preference) ) ?

Related

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

Which reinforcement algorithm to use for binary classification

I am new to machine learning, but I've read a lot about Reinforcement Learning in the past 2 days. I have an application that fetches a list of projects (e.g. from Upwork). There is a moderator that manually accepts or rejects a project (based on some parameters explained below). If a project is accepted, I want to send a project proposal and if it is rejected, I'll ignore it. I am looking to replace that moderator with AI (among other reasons) so I want to know which Reinforcement Algorithm should I use for this.
Parameters:
Some of the parameters that should decide whether the agent accepts or rejects the project are listed below. Assuming I only want to accept projects related to web development (specifically backend/server-side) here is how the parameters should influence the agent.
Sector: If the project is in related to IT sector it should have more chances of being accepted.
Category: If the project is in the Web Development category it should have more chances of being accepted.
Employer Rating: Employers having a rating of over 4 (out of 5) should have more chances of being accepted.
I thought Q-Learning or SARSA would be able to help me out but most of the examples that I saw were related to Cliff Walking problem where the states are dependent on each other which is not applicable in my case since each project is different from the previous one.
Note: I want the agent to be self-learning so that if in the future I start rewarding it for front-end projects too, it should learn that behavior. Therefore, suggesting a "pure" supervised learning algorithm won't work.
Edit 1: I would like to add that I have data (sector, category, title, employer rating etc.) of 3000 projects along with whether that project was accepted or rejected by my moderator.
your problem should easily be able to be solved using Q-learning. It just depends on how you design your problem. Reinforcement learning itself is a very robust algorithm that allows an agent to receive states from an environment, and then perform actions given those states. Depending on those actions, it will get rewarded accordingly. For your problem, the structure will look like this:
State
States: 3 x 1 matrix. [Sector, Category, Employer Rating]
The sector state are all integers, where each integer represents a different sector. For example, 1 = IT Sector, 2 = Energy, 3 = Pharmaceuticals, 4 = Automotives, etc.
The category state can also be all integers, where each integer represents a different category. Ex: 1 = Web Development, 2 = Hardware, 3 = etc.
Employer rating is again all integers between 1 - 5. Where the state represents the rating.
Action
Action: Output is an integer.
The action space would be binary. 1 or 0. 1 = Take the project, 0 = Don't take the project.
Reward
The reward provides feedback to your system. In your case, you would only evaluate the reward if the action = 1, i.e., you took the project. This will then allow your RL to learn how good of a job it did taking the project.
Reward would be a function that looks something like this:
def reward(states):
sector, category, emp_rating = states
rewards = 0
if sector == 1: # The IT sector
rewards += 1
if category == 1: # The web development category
rewards += 1
if emp_rating = 5: # Highest rating
rewards += 2
elif emp_rating = 4: # 2nd highest rating
rewards += 1
return rewards
To enhance this reward function, you can actually give some sectors negative rewards, so the RL will actually receive negative rewards if it took those projects. I avoided that here to avoid the further complexity.
You can also edit the reward function in the future to allow your RL to learn new things. Such as making some sectors better than others, etc.
edit: yes, regards to lejlot's comment, it basically is a multi-armed bandit problem, where there is no sequential decision making. The setup of the bandit problem is basically the same as Q-learning minus the sequential part. All your concerned with is you have a project proposal (state), make a decision (action), and then your reward. It does not matter what happens next in your case.

Why does ALS.trainImplicit give better predictions for explicit ratings?

Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing.
I am using PredictionIO 0.9.6 and the Recommendation template for collaborative filtering. The ratings in my data set are numbers between 1 and 10. When I first trained a model with defaults from the template (using ALS.train), the predictions were horrible, at least subjectively. Scores ranged up to 60.0 or so but the recommendations seemed totally random.
Somebody suggested that ALS.trainImplicit did a better job, so I changed src/main/scala/ALSAlgorithm.scala accordingly:
val m = ALS.trainImplicit( // instead of ALS.train
ratings = mllibRatings,
rank = ap.rank,
iterations = ap.numIterations,
lambda = ap.lambda,
blocks = -1,
alpha = 1.0, // also added this line
seed = seed)
Scores are much lower now (below 1.0) but the recommendations are in line with the personal ratings. Much better, but also confusing. PredictionIO defines the difference between explicit and implicit this way:
explicit preference (also referred as "explicit feedback"), such as
"rating" given to item by users. implicit preference (also referred
as "implicit feedback"), such as "view" and "buy" history.
and:
By default, the recommendation template uses ALS.train() which expects explicit rating values which the user has rated the item.
source
Is the documentation wrong? I still think that explicit feedback fits my use case. Maybe I need to adapt the template with ALS.train in order to get useful recommendations? Or did I just misunderstand something?
A lot of it depends on how you gathered the data. Often ratings that seem explicit can actually be implicit. For instance, assume you give the option of allowing users to rate items that they have purchased / used before. This means that the very fact that they have spent their time evaluating that particular item means that the item is of a high quality. As such, items of poor quality are not rated at all because people do not even bother to use them. As such, even though the dataset is intended to be explicit, you may get better results because if you consider the results to be implicit. Again, this varies significantly based on how the data is obtained.
The explict data (like ratings) normally comes with bias - people go and rate a product because they like it! Think about your experience shopping and then rating on Amazon.com :-)
On the contrary, implict info often can truly reflect user's favor on a product, like viewing duration, comment length, etc. Even a like/dislike is better that rating because it provides a very simple 'bad' option without bothering a user to think "if I should give a 3, 3.5, or 4?".

Multi-variable Recommender System

I went through tutorials on implementing Recommender System and most of them takes one variable (rank).
I want to implement an Item-Based Recommender System which takes multiple variables.
Eg : Let's say an Item (bar) has following varables (values ranging from -10 to +10, to express opposite polarities)
- price (cheap to expensive)
- environment (casual to fine)
- age range (young to adults)
Now I want to recommend items (bar) looking at the list of bars registered in User's history.
Is this kind of a "multi dimensional recommender system" possible to implement using Mahout or any other framework ?
You want the multi-modal, multi-indicator, multi-variable, how every you want to describe it—Universal Recommender. It can handle all this data. We've tested it on real datasets and get significant boost in precision test because of what we call "secondary indicators".
Good intuition. Give the UR a look: blog.actionml.com, check out the slides in one post. Code here: https://github.com/actionml/template-scala-parallel-universal-recommendation/tree/v0.3.0 Built on the new Spark version of Mahout: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

Resources