How to build target variable for supervised machine learning project - machine-learning

I am quite new to machine learning with small experience and I did some projects.
Now I have a project relates to insurance. So I have databases about clients that I will merge to get all possible information about the clients and I have one database for the claims. I need to build a model to identify how risky the client based on ranks.
My question: I need to build my target variable that ranks the clients based on how risky they are, counting on the claims. I could have different strategies to do that, but I am confused about how I will deal with the following:
- Shall I do a specific type of analysis before building the ranks such as clustering, or I need to have a strong theoretical assumption matching with the project provider vision.
- If I use some variables in the claims database to build up the ranks, how shall I deal with them later. In other words, shall I remove them from the final data set for training, to avoid correlation with target variable, or I can treat them in a different way and keep them.
- If I will keep them, is there a special treatment for them depending on whether they are categorical or continuous variables.

Every machine learning project's starting point is EDA. First create some feature, like how often do they get bad claims or how many do they get. Then do some EDA to find which features are more useful. Secondly, the problem looks like classification. Clustering is usually harder to evaluate.

In data sciences when you make a business model, EDA exploratory data analytics play a major role which includes data cleaning, feature engineering, filtering data. As mentioned how to build target variable, it all depends on the attributes you have and what model do you want to apply say linear regression or logistic or make a decision tree. You need to use those algorithms. But most importantly you need to find out the impacting variable. That's probably the core elation between the output and the given input and priority must be given accordingly. Also attributes which add no value must be removed as that would contribute to overfitting.
You can do clustering too. And interesting thing is any unspervisoned learning could be converted to a form of supervised learning. Probably you can try to do logistic regression or do linear regression etc... And find out which model fits best to your project.

Related

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

In a regression task, how do I find which independent variables are to be ignored or are not important?

In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?
Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.
According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

How to deploy machine learning algorithm in production environment?

I'm new to machine learning algorithm. I'm learning basic algorithms like regression, classification, clustering, sequence modelling, on-line algorithms. All the article that are available on internet shows how to use these algorithm with specific data. There is no article regarding deployment of those algorithm in production environment. So my questions are
1) How to deploy machine learning algorithm in production environment?
2) The typical approach follows in machine learning tutorial is to build the model using some training data, use it for testing data. But is it advisable to use that kind of model in production environment? Incoming data may keep changing so the model will be ineffective. What should be duration for the model refresh cycle to accommodate such changes?
I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:
1) Obtain some good amount of data which you could use to build a prototype of machine-learning system
2) Split your data into train set, cross-validation set and test set
3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set
4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.
Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:
1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).
2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.
Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).
DISCLAIMER: I work for this company, Datmo building a better workflow for ML. We’re always looking to help fellow developers working on ML so feel free to reach out to me at anand#datmo.com if you have any questions.
1) In order to deploy, you should first split up your code into preprocessing, training and test. This way you can easily encapsulate the required components for deployment. Usually, you will then want to take your preprocessing, test, as well as your weights file (the output of your training process) and put them in one folder. Next, you will want to host this on a server and wrap an API server around this. I would suggest a Flask Restful API so that you can use query parameters as your inputs and output your response in standard JSON blobs.
To host it on a server, you can use this article which talks about how you can deploy a Flask API on EC2.
You can load and model and serve it as API as given in this code.
2) Hard for me to answer without more details. It's highly dependent on the type of data and the type of model. For example, for deep learning, there is no such thing as online learning.
I am sorry that my comments does not include too much detail* since I am also a newbie in "deployment" of ML. But since the author is also new in ML, I hope these basic guidance could be helpful as well.
For "deployment", you should
Have ML algorithms: You may use free-tools, or develop your own tool using libraries in Python, R, Java, .Net, .. or use a system on cloud..)
Train those ML models using training datasets
Save those trained models (You should search this topic based on your development environment. There are some file formats that Tensorflow/Keras provide, or formats like pickle, ONNX,.. I would like to write a whole list here, with their supporting language & environment, advantage&disadvantage and loadability but I am also trying to investigate this topic, as a newbie)
And THEN, you can deploy these saved-models on production. On production you should either have your own-developed application to run the saved model (For example: an application that you developed with Python that takes trained&saved .pickle file and TestData as input; and simply gives "prediction for the test data" as output) or you should have an environment/framework that runs the saved models (search for ML environments/frameworks on cloud). At first, you should clarify your need: Do you need a stand-alone program on production, or will you serve a internal web-service, or via-cloud, etc.
For the second question; as above answers indicate the issue is "online training ability" of the models. Please additionally note that; for "online learning", your production environment has to feed your production tool/system with the real-correct label of the test data as well. Will you have that capability?
Note: All above are just small "comments" instead of a clear answer, but technically I am not able to write comments yet. Thanks for not de-voting :)
Regarding the first question, my service mlrequest makes deploying models to production simple. You can get started with a free API key that provides 50k model transactions a month.
This code will train and deploy, or update your model across 5 global data centers.
from mlrequest import Classifier
classifier = Classifier('my-api-key')
features = {'feature1': 'val1', 'feature2': 45}
training_data = {'features': features, 'label': 2}
r = classifier.learn(training_data=training_data, model_name='my-model', class_count=2)
This is how you make predictions, latency-routed to the nearest data center to get the quickest response.
features = {'feature1': 'val1', 'feature2': 77}
r = classifier.predict(features=features, model_name='my-model', class_count=2)
r.predict_result
Regarding your second question, it completely depends on the problem you are solving. Some models need to be frequently updated, while others almost never need to be updated.

Resources