I'm using model.predict() to make predictions batch by batch.
By doing so, I find that Keras seems to normalize the input on a batch basis which means if I combine the results, they will become meaningless because they are not normalized together and the probabilities vary a lot from batch to batch.
So what should I do to tackle this? Loading all the data into one batch is not an option as the dataset is too large. Should I pass in one sample per time at the cost of performance?
Related
If decision tree tries to determine splits based on highest amount of data belonging to similar class, why can't it split this particular data until each split has only 1 element in it,
which would lead to a 100% precision?
Having only 1 data point / case in every terminal node can cause over-fitting on the training dataset. To avoid this, test the constructed model using a particular summary statistic (eg. RMSE) against both the training and validation datasets. In Random Forest, the 'Out of Bag' sample can be used as a validation set. This is the proportion of data (roughly 37%) that is not used in the construction of every tree. The RMSE should be relatively similar between both the training and validation sets.
I am new to TensorFlow and Machine Learning and found the concept of batch.
What is the purpose of splitting the DataSets into batches and how does the TensorFlow perform an optimization task on variables, using different sub-sets?
You are confusing a few things, as far as I understand.
First, you need to split the dataset into two (or more) distinct sets. The one is a set that you train your system on, and the second one is used to test your model.
This are basics of ML and you should easily find more in the internet. Look for "cross-validation" or "train, validation, test sets".
Batch is something that is usually important in Neural Networks (NN). You are not using one example at each training step (then the algorithm would be called Stochastic Gradient Descent), nor every example at each training step (this would be Batch Gradient Descent). Usually, it is the best to train NN using mini-batches (Mini-batch Gradient Descent).
It is a trade-off in optimization between accuracy and training speed.
Tensorflow is just a library for NN. You can easily find how sets and batches are split in many tutorials. Remember to learn the basic concepts first, for example in this great class:
https://www.coursera.org/specializations/deep-learning
The purpose of splitting the DataSet into batches is typically to speed up the learning. Instead of processing the entire set of training examples, it processes only a batch, which is just a small subset of the entire training set, at a time. There are various techniques about how to formulate such batches and process them to get the final trained model.
I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.
I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.
I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.