So I've got some data from 700-some smart meters. Data from each meter includes electricity usage taken in intervals of 15 mins, outside temperature, humidity, if it's a national holiday...
The goal is to predict combined electricity usage of the users on the grid.
When I combine the data by summing all the electricity and train my model (normalization, some batching, 3 lstm layers with 512 nodes, some dropout, relu activation, adam optimizer, absolute loss, default lr) I get good results which I am happy with.
But when I do it in federated, with each user training on his private data, using the same model I did (server lr = 1.0, because its less confusing i think) i get really bad results.
Unsystematically I messed around with batch size, switching adam for SGD, changing learning rates, upping the epochs, changing the number of users calculating gradients in each round. Nothing seemed to work.
Should i just up the epochs in some order of magnitude? Do i have any theoretical assurance that there exists a set of parameters under which the same model that converged on the sum of data should should converge in federated?
It's more of a soft question, but i may post the code or the results if needed.
One thing that strikes me in the question is the mention of normalization. In TFF's default federated averaging implementation, only trainable variables are averaged across clients.
Keras' BatchNorm implementation uses non-trainable variables to track mean and standard deviation of the parameters across batches, so there is bad composition here. There has been success using TFF's GroupNorm implementation, however, as a drop-in replacement. Are you using BatchNorm? If so, I might recommend trying to simply swap for GroupNorm.
Related
I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)
I've trained a deep neural network of a few hundreds of features which analyzes geo data of a city, and calculate a score per sample based on the profile between the observer and the target location. That is, the longer the distance between the observer and target, the more features I will have for this sample. When I train my NN with samples from part of a city and test with other parts of the same city, the NN works very well, but when I apply my NN to other cities, the NN starts to give high standard deviation of errors, especially on cases which the samples of the city I'm applying the NN to generally has more features than samples of the city I used to train this NN. To deal with that, I've appended 10% of empty samples in training which was able to reduce the errors by half, but the remaining errors are still too large compare to the solutions calculated by hand. May I have some advise of generalize a regression neural network? Thanks!
I was going to ask for more examples of your data, and your network, but it wouldn't really matter.
How to improve the generalization of a regression neural network?
You can use exactly the same things you would use for a classification neural network. The only difference is what it does with the numbers that are output from the penultimate layer!
I've appended 10% of empty samples in training which was able to reduce the errors by half,
I didn't quite understand what that meant (so I'd still be interested if you expanded your question with some more concrete details), but it sounds a bit like using dropout. In Keras you append a Dropout() layer between your other layers:
...
model.append(Dense(...))
model.append(Dropout(0.2))
model.append(Dense(...))
...
0.2 means 20% dropout, which is a nice starting point: you could experiment with values up to about 0.5.
You could read the original paper or this article seems to be a good introduction with keras examples.
The other generic technique is to add some L1 and/or L2 regularization, here is the manual entry.
I typically use a grid search to experiment with each of these, e.g. trying each of 0, 1e-6, 1e-5 for each of L1 and L2, and each of 0, 0.2, 0.4 (usually using the same value between all layers, for simplicity) for dropout. (If 1e-5 is best, I might also experiment with 5e-4 and 1e-4.)
But, remember that even better than the above are more training data. Also consider using domain knowledge to add more data, or more features.
Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers
I implement the ResNet for the cifar 10 in accordance with this document https://arxiv.org/pdf/1512.03385.pdf
But my accuracy is significantly different from the accuracy obtained in the document
My - 86%
Pcs daughter - 94%
What's my mistake?
https://github.com/slavaglaps/ResNet_cifar10
Your question is a little bit too generic, my opinion is that the network is over fitting to the training data set, as you can see the training loss is quite low, but after the epoch 50 the validation loss is not improving anymore.
I didn't read the paper in deep so I don't know how did they solved the problem but increasing regularization might help. The following link will point you in the right direction http://cs231n.github.io/neural-networks-3/
below I copied the summary of the text:
Summary
To train a Neural Network:
Gradient check your implementation with a small batch of data and be aware of the pitfalls.
As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of
the data
During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to
parameter values (it should be ~1e-3), and when dealing with ConvNets,
the first-layer weights.
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or
whenever the validation accuracy tops off.
Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges,
training only for 1-5 epochs), to fine (narrower rangers, training for
many more epochs)
Form model ensembles for extra performance
I would argue that the difference in data pre processing makes the difference in performance. He is using padding and random crops, which in essence increases the amount of training samples and decreases the generalization error. Also as the previous poster said you are missing regularization features, such as the weight decay.
You should take another look at the paper and make sure you implement everything like they did.
The Situation:
I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.
One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.
Questions:
(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?
(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?
(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.
(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.
(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.
(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow
and the answer.
Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).
Upweighting positive samples
This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Batch Sampling
This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample
I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.
1) Use cost function calculating 0 and 1 labels at the same time like below.
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))
2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278
Both strategy worked when I tried to make credit rating model.
Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.
1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.
Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.
Not sure about 2.