Data mining, Machine Learning : Click prediction using Logit - machine-learning

I am an ml noob. I have a task at hand of predicting click probability given user information like city, state, os version, os family, device, browser family browser version, city, etc.
I have been recommended to try logit since logit seems to be what MS and Google are using too.
I have some questions regarding logistic regression like:
Click and non click is a very very unbalanced class and the simple glm predictions do not look good. How to make the data work through this?
All variables I have are categorical and things like device and city can be numerous. Also the frequency of occurrence of some devices or some cities can be very very low. So how to deal with what I can say is a very random variety of categorical variables?
One of the variables that we get is device id also. This is a very unique feature that can be translated to a user's identity. How to make use of it in logit, or should it be used in a completely different model based on a user identity?

Related

Options on evaluating Recommender System

I've created a recommender system that works this way:
-each user selects some filters and based on those filters there's a score generated
-each user is clustered using k-means based on those scores
-whenever a user receives a recommandation i'm using pearson's correlation to see which user has the best correlation to other users from the same cluster
My problem is that i'm not really sure what would be the best way to evaluate this system? I've seen that one way to do it is by hiding some values of the dataset but that's not the case for me because i'm not predicting scores.
Are there any metrics or something that i could use?

Algorithm to classify instances from a dataset similar to another smaller dataset, where this smaller dataset represents a single class

I have a dataset that represents instances from a binary class. The twist here is that there are only instances from the positive class and I have none of the negative one. Or rather, I want to extract those from the negatives which are closer to the positives.
To get more concrete let's say we have data of people who bought from our store and asked for a loyalty card at the moment or later of their own volition. Privacy concerns aside (it's just an example) we have different attributes like age, postcode, etc.
The other set of clients, following with our example, are clientes that did not apply for the card.
What we want is to find a subset of those that are most similar to the ones that applied for the loyalty card in the first group, so that we can send them an offer to apply for the loyalty program.
It's not exactly a classification problem because we are trying to get instances from within the group of "negatives".
It's not exactly clustering, which is typically unsupervised, because we already know a cluster (the loyalty card clients).
I thought about using kNN. But I don't really know what are my options here.
I would also like to know how, if possible, can this be achieved with weka or another Java library and if I should normalize all the attributes.
You could use anomaly detection algorithms. These algorithms tell you whether your new client belongs to the group of clients who got a loyalty card or not (in which case they would be an anomaly).
There are two basic ideas (coming from the article I linked below):
You transform the feature vectors of your positive labelled data (clients with card) to a vector space with a lower dimensionality (e.g. by using PCA). Then you can calculate the probability distribution for the resulting transformed data and find out whether a new client belongs to the same statistical distribution or not. You can also compute the distance of a new client to the centroid of the transformed data and decide by using the standard deviation of the distribution whether it is still close enough.
The Machine Learning Approach: You train an auto-encoder network on the clients with card data. An auto-encoder has a bottleneck in its architecture. It compresses the input data into a new feature vector with a lower dimensionality and tries afterwards to reconstruct the input data from that compressed vector. If the training is done correctly, the reconstruction error for input data similar to the clients with card dataset should be smaller than for input data which is not similar to it (hopefully these are clients who do not want a card).
Have a look at this tutorial for a start: https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7
Both methods would require to standardize the attributes first.
Und try a one-class support vector machine.
This approach tries to model the boundary, and will give you a binary decision on whether a point should be in the class, or not. It can be seen as a simple density estimation. The main benefit is that the support vector art will be much smaller than the training data.
Or simply use the nearest-neighbor distances to rank users.

which ML algorithm to choose

In my lab, I have 10 devices which I monitor using each device specific features like.
heat-generated
power consumed
patterns in power consumption
Using a supervised classification model I could classify these devices.
The problem I have is.. in case we add more such different type of devices.. how do I classify them? These device based on the trained model will classify new devices also as one among the classified device, which is untrue. They might have their own patterns.
Is there a way?. and how ?.
If you look at it, it seems like when a new type of device is added to your data-set, you are actually adding a new "Class".
In that case, you might have to retrain your model to accommodate the new Classes added to your dataset.

Matching user profiles with employment opportunities

I am currently working on a software which can connect users to jobs based on their user profiles. I ran text analytics on the job descriptions and derived important keywords from it. I have also collected user information from their profile. Matching the jobs to the user profiles seems to be a challenging task. Are there any Machine Learning based algorithms which can be used for match making?
OK, so basically, you have keywords for each job description and then you have some sort of text data (user profiles) to which you try to match those keywords.
Since your training data (user profiles) is not labeled, the supervised learning will not help you here. Unsupervised learning (clustering) could maybe help you in finding a certain patterns (keywords) from a loads of user profiles, but you would certainly need to experiment with different sorts of techniques (such as gaussian mixture models etc.) and observe possible patterns.
A simpler thing you could maybe do is to derive/find keywords also for each user profile(in other words to identify how many of your job keywords also exist in user profile) and then compare a distance between them using cosine similarity. You would then only need to determine the minimal angle threshold. This would be a parameter to play with. Of course you would need to vectorize your text data using bigrams or similar; if you use python there already is feature extraction in scikit). You could possibly also use tf-idf vectorizer on both, the job description and user profile but with some heavy and well determined words stop list.

Machine Learning Algorithm Suggestion?

I'm novice in ML. I've crunch time and in need to choose the algorithm to complete my following task:
Traveler, is visiting my website. I make them fill the form and have all the necessary signal (attributes) with me like whether they have booked flight or not, whether email is genuine is not, phone no is given or not, trip date is fixed, destination location is fixed or not.
But along with that I have many visitor who don't fill the form completely or just uses fake phone number.
I again re-iterate, I have lot of signal available with me, and I need to filter out the traveler who is certain to go for traveling so that I can personally contact them. I also need some score as well on the scale of 10.
Which ML algorithm is best suited for this job and why ?
Previously I have worked in WEKA.
You'll need to create an ensemble model (composition of many different algorithms).

Resources