What's the best way to construct the data for and train a model with user-specific data? Should each user get their own model to ensure it learns specifics of their data? Or is it possible to create a global model that keeps users' data mostly separate (e.g. with a feature that identifies the user)?
The prediction task I have in mind is either regression or classification.
There should be both the models. The first model, a global one, will be trained to predict the general behavior. Once it is done, it will be copied to every user and further tuned based on the user's individual behavior.
Related
Consider use cases like
lending money - ML model predicts that lending money is safe to an individual.
predictive maintenance in which a machine learning model predicts that an equipment will not fail.
In above cases, it is easy to find if the ML model's prediction was correct or not
depending on whether the money was paid back or not and whether the equipment part failed or not.
How is performance of a model evaluated for the following scenarios? Am I correct that it is not possible to evaluate performance for the following scenarios?
lending money - ML model predicts that lending money is NOT safe to an individual and money is not lend.
predictive maintenance in which a machine learning model predicts that an equipment will fail and equipment is thus replaced.
In general, would I be correct is saying that some predictions can be evaluated but some can't be? For scenarios where the performance can't be evaluated, how do businesses ensure that they are not losing opportunities due to incorrect predictions? I am guessing that there is no way to do this as this problem exists in general without use of ML models as well. Just putting my doubt/question here to validate my thought process.
If you think about it, both groups are referring to same models, just different use cases. If you take the model predicting whether it's safe to lend money and invert it's prediction, you would get a prediction whether it's NOT safe to lend money.
And if you use your model to predict safe lending you would still care about increasing recall (i.e. reducing number of safe cases that are classified as unsafe).
Some predictions can't be evaluated if we act on them (if we denied lending we can't tell whether the model was right). Another related problem is gathering a good dataset to train the model further: usually we would train the model on the data we observed, and if we deny 90% of applications based on the current model prediction, then in the future we can only train next model on the remaining 10% of the applications.
However, there are some ways to work around this:
Turning the model off for some percentage of applications. Let's say that random 1% of applications are approved regardless of model prediction. This will get us an unbiased dataset evaluate the model.
Using historical data, that was gathered before the model was introduced.
Finding a proxy metric that correlates with business-metric, but is easier to evaluate. As an example you could measure percentage of applicants who within 1 year after their applications made late-payments (with other lenders, not us) among the applicants who have been approved vs rejected by our model. The higher the difference of this metric between rejected and approved groups, the better our model performs. But for this to work you have to prove that this metric correlates with probability of our lending being unsafe.
The basic process for most supervised machine learning problems is to divide the dataset into a training set and test set and then train a model on the training set and evaluate its performance on the test set. But in many (most) settings, disease diagnosis for example, more data will be available in the future. How can I use this to improve upon the model? Do I need to retrain from scratch? When might be the appropriate time to retrain if this is the case (e.g., a specific percent of additional data points)?
Let’s take the example of predicting house prices. House prices change all the time. The data you used to train a machine learning model that predicts house prices six months ago could provide terrible predictions today. For house prices, it’s imperative that you have up-to-date information to train your models.
When designing a machine learning system it is important to understand how your data is going to change over time. A well-architected system should take this into account, and a plan should be put in place for keeping your models updated.
Manual retraining
One way to maintain models with fresh data is to train and deploy your models using the same process you used to build your models in the first place. As you can imagine this process can be time-consuming. How often do you retrain your models? Weekly? Daily? There is a balance between cost and benefit. Costs in model retraining include:
Computational Costs
Labor Costs
Implementation Costs
On the other hand, as you are manually retraining your models you may discover a new algorithm or a different set of features that provide improved accuracy.
Continuous learning
Another way to keep your models up-to-date is to have an automated system to continuously evaluate and retrain your models. This type of system is often referred to as continuous learning, and may look something like this:
Save new training data as you receive it. For example, if you are receiving updated prices of houses on the market, save that information to a database.
When you have enough new data, test its accuracy against your machine learning model.
If you see the accuracy of your model degrading over time, use the new data, or a combination of the new data and old training data to build and deploy a new model.
The benefit to a continuous learning system is that it can be completely automated.
I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!
The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.
I am new to the field of computer vision, so I apologise if the question is inappropriate in any way.
I have created a segmentation model using the PascalVOC 2012 data set, and so far I have only been able to test it on the train and val data sets. Now I wanted to test my model using the test set, however, it does not provide any annotations, so I am unsure of what I could do in order to measure my model's performance on the test data.
I have noticed that other data sets, such as COCO, do not provide the annotations for the test data.
I am curious how researchers who have published papers regarding models trained on these data sets have tested them on the test data in such case and what I can do in order to do the same.
Thanks in advance for your help!
The main reason why many of the major datasets do not release the test set is to avoid people reporting unreliable results due to overfitting.
For model selection and "informal" evaluation, you should split your training set into train and validation split, and evaluate on the latter while training only on the first.
So how is that researchers report results on the test set in the papers?
Once you have a definitive model you want to evaluate, you can upload your results to an evaluation server; in this way you can benchmark yourself w.r.t. the state-of-the-art without having explicit access to the test set.
Examples:
For COCO dataset you can find the guidelines here about how to upload your results (for each task).
For CityScapes dataset you can submit your results through this page.
As a side note: VOC2012 is pretty old, so maybe you can also find the test set if you really need it. Take a look at this mirror by Joseph Redmon for example.
The Machine Learning Model I download from Apple's website, can be used to recognise lots of different object through image.
I want to know, say I need to detect all different kinds of trees. Some can be recognised by the model whereas some cannot. Then how would I know how many trees are being trained in this model? in another word, How would I know all the possible tree recognition in this model? How would I find out? Can I decode the model? or parse it or something?
These models are trained on the ImageNet dataset. Here is a list of classes that these models can detect: https://github.com/HoldenCaulfieldRye/caffe/blob/master/data/ilsvrc12/synset_words.txt