I've created a recommender system that works this way:
-each user selects some filters and based on those filters there's a score generated
-each user is clustered using k-means based on those scores
-whenever a user receives a recommandation i'm using pearson's correlation to see which user has the best correlation to other users from the same cluster
My problem is that i'm not really sure what would be the best way to evaluate this system? I've seen that one way to do it is by hiding some values of the dataset but that's not the case for me because i'm not predicting scores.
Are there any metrics or something that i could use?
Related
Scenario - I have data that does not have labels but I can create a function to label the data based on behavior and deploy the model so I don't have to keep labeling the data. Is this considered machine learning?
Objective: classify accounts with Volume spikes based on high medium low labels to deploy on big data (trillions of lines of data)
Data: the data I have includes the following attributes:
Account, Time, Date, Volume amount.
Method:
Create a new feature column called "spike" and create a pandas function to ID a spike greater than 5. Is this feature engineering?
Next I create my label column and classify it as low medium or high spike.
Next I Train a machine learning classifier and deploy it to label future accounts with similar patterns in big data.
Thoughts on this process? Is this approach correct for Machine learning?
1st question:
If your algorithm takes the decision, that is, put a label in a sample, based on the set of samples that you have, I'd say it's a machine learning algorithm. But if you design a code that takes into account your experience regarding the data, I think it's not an ML method. In brief, ML look at the data to get patterns and insights from them. I don't know why you're doing that, but is it need to be an ML algorithm? Sometimes you can solve the problem in a very simple way, without using ML.
2nd question: I'm afraid not. Select your data attributes (ex: Account, Time, Date, Volume amount), checking their correlations, try to figure out if you have a dominant one, etc. This process is pre ML. The feature engineering will select what are the best features to present to our algorithm in order to perform the classification (in your case)
3rd question: I think it's fair enough to start playing with some ML algorithms, such as KNN, SVM, NNs, Decision Tree, etc.
I am currently working on a software which can connect users to jobs based on their user profiles. I ran text analytics on the job descriptions and derived important keywords from it. I have also collected user information from their profile. Matching the jobs to the user profiles seems to be a challenging task. Are there any Machine Learning based algorithms which can be used for match making?
OK, so basically, you have keywords for each job description and then you have some sort of text data (user profiles) to which you try to match those keywords.
Since your training data (user profiles) is not labeled, the supervised learning will not help you here. Unsupervised learning (clustering) could maybe help you in finding a certain patterns (keywords) from a loads of user profiles, but you would certainly need to experiment with different sorts of techniques (such as gaussian mixture models etc.) and observe possible patterns.
A simpler thing you could maybe do is to derive/find keywords also for each user profile(in other words to identify how many of your job keywords also exist in user profile) and then compare a distance between them using cosine similarity. You would then only need to determine the minimal angle threshold. This would be a parameter to play with. Of course you would need to vectorize your text data using bigrams or similar; if you use python there already is feature extraction in scikit). You could possibly also use tf-idf vectorizer on both, the job description and user profile but with some heavy and well determined words stop list.
I have two kind of profiles in database.one is candidate
prodile,another is job profile posted by recruiter.
in both the profiles i have 3 common field say location,skill and
experience
i know the algorithm but i am having problem in creating training data
set where my input feature will be location,skill and salary chosen
from candidate profile,but i am not getting how to choose output
(relevant job profile).
as far as i know output can only be a single variable, then how to
choose relevant job profile as a output in my training set
or should i choose some other method?another thought is clustering.
As I understand you want to predict job profile given candidate profile using some prediction algorithm.
Well, if you want to use regression you need to know some historical data -- which candidates were given which jobs, then you can create some model based on this historical data. If you don't have such training data you need some other algorithm. Say, you could set location,skill and experience as features in 3d and use clustering/nearest neighbors to find candidate profile closest to a job profile.
You could look at "recommender systems", they can be an answer to your problem.
Starting with a content based algorithm (you will have to find a way to automate the labels of the jobs, or manually do them), you can improve to an hybrid one by gathering which jobs your users were actually interested (and become an hybrid recommender)
I am an ml noob. I have a task at hand of predicting click probability given user information like city, state, os version, os family, device, browser family browser version, city, etc.
I have been recommended to try logit since logit seems to be what MS and Google are using too.
I have some questions regarding logistic regression like:
Click and non click is a very very unbalanced class and the simple glm predictions do not look good. How to make the data work through this?
All variables I have are categorical and things like device and city can be numerous. Also the frequency of occurrence of some devices or some cities can be very very low. So how to deal with what I can say is a very random variety of categorical variables?
One of the variables that we get is device id also. This is a very unique feature that can be translated to a user's identity. How to make use of it in logit, or should it be used in a completely different model based on a user identity?
i'm working on a project and i have a subset of user's key-stroke time data.This means that the user makes n attempts and i will use these recorded attempt time data in various kinds of classification algorithms for future user attempts to verify that the login process is done by the user or some another person. (Simply i can say that this is biometrics)
I have 3 different times of the user login attempt process, ofcourse this is subset of the infinite data.
until now it is an easy classification problem, i decided to use WEKA but as far as i understand i have to create some fake data to feed the classification algorithm.The user's measured attempts will be 1 and fake data will be 0.
can i use some optimization algorithms ? or is there any way to create this fake data to get min false positives ?
Thanks
There are a couple of different ways you could go about approaching this.
Collect Negative Examples - One easy solution would be to just gather keystroke timing data from other people that could be used as negative examples. If you want to gather a large sample very cheaply, as in about 1000 samples for about $10, you could use a service like Amazon Mechanical Turk.
That is, you could put together a human intelligence task (HIT) that has people type in randomized password like sequences. To get the timing information you'll need to use an External Question, since the restricted HTML for regular questions doesn't support JavaScript.
Use a Generative Model - Alternatively, you could train a generative probability model to a user's keystroke behavior. For example, you could train a Gaussian mixture model (GMM) to the user's delay between keystrokes.
Such a model will give you a probability estimate of keystroke timing information being generated by a specific user. You would then just need to set a threshold of how likely the timing information should be in order for the user to be authenticated.
Use 1-class SVMs - Finally, 1-class SVMs allows you to train a SVM like classifier using only positive examples. To learn one-class SVMs in WEKA, use the LibSVM wrapper if you're using v3.6. If you're using the bleeding edge developer version, there's weka.classifiers.meta.OneClassClassifier.