ML algorithm suggestion for databases that change a lot with time after model training - machine-learning

I have a classification problem and I'm using a logistic regression (I tested it among other models and this one was the best). I look for information from game sites and test if a user has the potential to be a buyer of certain games.
The problem is that lately some sites from which I get this information (and also from where I got the information to train the model) change weekly and, with that, part of the database I use for prediction is "partially" different from the one used for training (with different information for each user, in this case). Since when these sites started to change, the model's predictive ability has dropped considerably.
To solve this, an alternative would be, of course, to retrain the model. It's something we're considering, although we'll have to do it with some frequency given the fact that the sites are changing every couple of weeks, considerably.
other solutions considered was the use of algorithms that could adapt to these changes and, with that, we could retrain the model less frequently.
Two options raised were neural networks to classify or try to adapt some genetic algorithm. However, I have read that genetic algorithms would be very expensive and are not a good option for classification problems, given the fact that they may not converge.
Does anyone have any suggestions for a modeling approach that we can test?

Related

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

Multiple Naive Bayes classifiers

I'm looking at implement a Naive Byes classifier for a review site in order to identify spam reviews and have a couple of questions.
It occurs to me there are multiple types of spam, such as outright marketing rubbish with nothing to do with the thing they are reviewing, versus a deceptive review. Would it be wise to implement multiple classifiers for different purposes so that one gets better an general spam detection, whilst the other learns deceptive reviews?
On a similar vain, there are multiple categories of items being reviewed so for the "deceptive review" classifier, would it be best to have just one classifier that tries to learn from all reviews? or would it be better to have a classifier per category so that it may be able to learn nuances within those categories?
I know these won't be fool proof, it's just about flagging potential reviews for manual checking, but I'm just unclear on the best setup.
As long as you're using any sufficiently complex algorithm, you should be able to discriminate "good" vs "bad" data with either method. If you do it all with one model, you'll simply need to increase the model size so that the comprehensive model can build (at worst) independent paths to the two decisions, "spam" and "deception".
If you're training this on three separate classifications: good, spam, and deceptive; then you're doing fine either way. Note, however, that your model size is smaller with separate trainings, and your training times will be shorter, as there will be fewer inaccurate guesses on the way.
On the other hand, using two models for later actual use will likely slow down detection, since each output that passes the first model must be run through the second. For most applications, this time is not a significant factor.
Most of all, I would start with a separate model for each class: any problems with implementation and training will be faster to find and easier to isolate.

Use feedback or reinforcement in machine learning?

I am trying to solve some classification problem. It seems many classical approaches follow a similar paradigm. That is, train a model with some training set and than use it to predict the class labels for new instances.
I am wondering if it is possible to introduce some feedback mechanism into the paradigm. In control theory, introducing a feedback loop is an effective way to improve system performance.
Currently a straight forward approach on my mind is, first we start with a initial set of instances and train a model with them. Then each time the model makes a wrong prediction, we add the wrong instance into the training set. This is different from blindly enlarge the training set because it is more targeting. This can be seen as some kind of negative feedback in the language of control theory.
Is there any research going on with the feedback approach? Could anyone shed some light?
There are two areas of research that spring to mind.
The first is Reinforcement Learning. This is an online learning paradigm that allows you to get feedback and update your policy (in this instance, your classifier) as you observe the results.
The second is active learning, where the classifier gets to select examples from a pool of unclassified examples to get labelled. The key is to have the classifier choose the examples for labelling which best improve its accuracy by choosing difficult examples under the current classifier hypothesis.
I have used such feedback for every machine-learning project I worked on. It allows to train on less data (thus training is faster) than by selecting data randomly. The model accuracy is also improved faster than by using randomly selected training data. I'm working on image processing (computer vision) data so one other type of selection I'm doing is to add clustered false (wrong) data instead of adding every single false data. This is because I assume I will always have some fails, so my definition for positive data is when it is clustered in the same area of the image.
I saw this paper some time ago, which seems to be what you are looking for.
They are basically modeling classification problems as Markov decision processes and solving using the ACLA algorithm. The paper is much more detailed than what I could write here, but ultimately they are getting results that outperform the multilayer perceptron, so this looks like a pretty efficient method.

Resources