I have been learning ML using TensorFlow for a few weeks. I have been following the tutorials given on the TensorFlow website (here). I started with training the model and it has been running on the system with the following specifications(it was taken before training started therefore showing minimal usage)
It has completed more than 200,000 steps so for how long should it be running or is there anything I am missing here.
Also, a similar question was found on the forum here. I could not find any reference on TensorFlow website where it says that you have to terminate it yourself when you get the desired loss. Even if it is so, how to determine what is the value of loss where you can stop the training?
As of now, there's no such fixed loss or anything by which you can say, you have a best model. It depends on training samples how simple/complex it is to train. Sometime 200k step might be more than enough and sometime not. Higher number of iteration causes over-fitting of the model and less causes under-fitting. But somehow you can use validation and test set to evaluate the model.
Related
I am currently comparing several NLU configurations in the Rasa framework.
I would like to know what the it in it/s as seen in the log while training or evaluating a configuration in Rasa means. Iterations maybe? What does that metric exactly tell us?
Example from a recent log (metric used at the end of the line):
2020-11-22 17:04:37 INFO rasa.nlu.test - Running model for predictions: 100%|ββ| 83/83 [00:00<00:00, 182.72it/s]
I could not find anything by googling and searching the forums.
I have also asked the same question in the rasa forums before, but there seems to be little to no activity.
it/s is not a (Rasa or ML specific) metric, it is just the iterations/second performed by the system (here during the prediction phase).
As a general rule, you may want to keep in mind that nothing really important (and certainly not any metric) is expected to be reported through an INFO message in the logs; it/s is just such an informational indication and, as already said, not a metric.
I am a beginner in the neuronal network field and I want to understand a certain statement. A friend said that a neuronal network gets slower after you fit a lot of data in.
Right now, I just did the coursera ML course from androw ng. There, I implemented backpropagation. I thought it just adaptes the model related to the expected output by using different types of calculations. Nevertheless, it was not like the history was used to adapt the model. Just the current state of the neurons were checked and their weight were adapted backwards in combination with regularisation.
Is my assumption correct or am I wrong? Are there some libraries that use history data that could result in a slowly adapting model after a certain amount of training?
I want to use a simple neuronal network for reinforcement learning and I want to get an idea if I need to reset my model if the target environment changes for some reason. Otherwise my model would be slower and slower in adaption after time.
Thanks for any links and explanations in advanced!
As you have said, neural networks adapt by modifying their weights during the backpropagation step. Modifying these weights will not be slower as the training goes on since the number of steps to modify these weights will always remain the same. The amount of steps needed to run an example through your model will also remain the same, therefore not slowing down your network according to the amount of examples you fed it during training.
However, you can decide to change your learning rate during your training (generally decreasing it as epochs go on). According to the way the learning rate of your model evolves, the weights will be modified in a different manner, generally resulting in a smaller difference each epoch.
I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)
I am working with the machine learning workbench LightSide for my MA-thesis. I have successfully trained some models, now I would like to use a trained model to predict new data. However, when I try to do so, the system stops after a few seconds, with the pop-up message "prediction as been stopped" and no hint on why. It happens with different data sets, as well as algorithms used for training...
Has anyone encountered the same problem and found a solution for it?
Thank you for your help :)
edit: I tried to export the feature table to WEKA and train models there, but WEKA gets lost in an endless training loop, I assume it has to do with the built-in feature UNIGRAM I use form LightSide. But I am still not closer to predicting on new data...
edit II: LightSide throws an error saying that one feature is not part of the model, when in fact it is
If you have a huge feature space, then lightSIDE might end up saying the prediction has stopped. This happened to me when I tried to use a column (ex: problem_name) that had too many different values as one of the features (during the feature extraction phase). I ended up recomputing another column (ex: problem_difficulty based on problem_name) that had fewer values and lightSIDE could predict new data successfully.
Thanks to Dr. Rose for the solution.
I'm training a neural network in TensorFlow (using tflearn) on data that I generate. From what I can tell, each epoch we use all of the training data. Since I can control how many examples I have, it seems like it would be best to just generate more training data until one epoch is enough to train the network.
So my question is: Is there any downside to only using one epoch, assuming I have enough training data? Am I correct in assuming that 1 epoch of a million examples is better than 10 epochs of 100,000?
Following a discussion with #Prune:
Suppose you have the possibility to generate an infinite number of labeled examples, sampled from a fixed underlying probability distribution, i.e. from the same manifold.
The more examples the network see, the better it will learn, and especially the better it will generalize. Ideally, if you train it long enough, it could reach 100% accuracy on this specific task.
The conclusion is that only running 1 epoch is fine, as long as the examples are sampled from the same distribution.
The limitations to this strategy could be:
if you need to store the generated examples, you might run out of memory
to handle unbalanced classes (cf. #jorgemf answer), you just need to sample the same number of examples for each class.
e.g. if you have two classes, with 10% chance of sampling the first one, you should create batches of examples with a 50% / 50% distribution
it's possible that running multiple epochs might make it learn some uncommon cases better.
I disagree, using multiple times the same example is always worse than generating new unknown examples. However, you might want to generate harder and harder examples with time to make your network better on uncommon cases.
You need training examples in order to make the network learn. Usually you don't have so many examples in order to make the network converge, so you need to run more than one epoch.
It is ok to use only one epoch if you have so many examples and they are similar. If you have 100 classes but some of them only have very few examples you are not going to learn those classes only with one epoch. So you need balanced classes.
Moreover, it is a good idea to have a variable learning rate which decreases with the number of examples, so the network can fine tune itself. It starts with a high learning rate and then decreases it over time, if you only run for one epoch you need to bear in mind this to tweak the graph.
My suggestion is to run more than one epoch, mostly because the more examples you have the more memory you need to store them. But if memory is fine and learning rate is adjusted based on number of examples and not epochs, then it is fine run one epoch.
Edit: I am assuming you are using a learning algorithm which updates the weights of the network every batch or similar.