Why network model average could improve performance on test set?

Why network model average could improve performance on test set? - machine-learning

As people train a few network models and then do model average to improve the performance the final network. Then I'd like to know why model average could work? is there any paper or explanation on this?
Actually Dropout is also model average, then why dropout could works?

People take model average so that if any of the models overfit the data, the combined model average will be able to provide a much more general prediction.

Related

How to properly score a DNN model during hyperparameter tuning?

I'm using a grid search to tune the hyperparameters of my DNN, which has 2 depth layers. I'm currently scoring each model based on the average loss in the test set, but I'm not sure if this is the best approach. Would it be better to use the accuracy, or both the loss and accuracy, as a scoring metric? How do other people typically score their models during hyperparameter tuning? Any advice or insights would be greatly appreciated.

The first thing in your experimental setup is using the test set while making hyperparameter tunning. You should train your model with your train set and make your hyperparameter tunning with your validation set. After finishing this process, you need to use test set to get the model score is the best option to the way of using/splitting the dataset correctly.
The second part of your question is very open-ended, but you may benefit from the following tips:
Different metrics may be suitable for different tasks, so it is important to choose the right metric. For instance, in some classification tasks you would like to track accuracy, and some of them recall or precision etc. (or you can use and track multiple metrics to understand your model behavior more deeper)
The recent advancement on this topic is generally referred to as AutoML and there are many different applications/libraries/methodologies that are used for hyperparameter tuning. So you may also want to search other methods rather than just using GridSeach. If you want to continue with GridSearch, to find the optimal parameters for your problem, you can switch the GridSearchCV so you can test your model more than once with a different part of the dataset which makes your hyperparameter tunning operation more robust.

Is Naive Bayes biased?

I have a use case where in text needs to be classified into one of the three categories. I started with Naive Bayes [Apache OpenNLP, Java] but i was informed that the algorithm is biased, meaning if my training data has 60% of data as classA and 30% as classB and 10% as classC then the algorithm tends to biased towards ClassA and thus predicting the other class texts to be of classA.
If this is true is there a way to overcome this issue?
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my use case. Please advise.

there a way to overcome this issue?
Yes, there is. But first you need to understand why it happens?
Basically your dataset is imbalanced.
An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.
In this scenario, your model becomes bias towards the class with majority of samples as you have more training data for that class.
Solutions
Under sampling:
Randomly removing samples from majority class to make dataset balance.
Over sampling:
Adding more samples of minority classes to makes dataset balance.
Change Performance Metrics
Use F1-score, 'recallorprecision` to measure the performance of your model.
There are few more solutions, if you want to know more refer this blog
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure
which will be more suitable for my usecase
You will never know unless you try, I would suggest you try 3-4 different algorithms on your data.

Test accuracy vs Training time on Weka

From what I know, test accuracy should increase when training time increase(up to some point); but experimenting with weka yielded the opposite. I am wondering if misunderstood someting.
I used diabetes.arff for classification with 70% for training and 30% for testing. I used MultilayerPerceptron classifier and tried training times 100,500,1000,3000,5000.
Here are my results,
Training time Accuracy
100 75.2174 %
500 75.2174 %
1000 74.7826 %
3000 72.6087 %
5000 70.4348 %
10000 68.6957 %
What can be the reason for this? Thank you!

You got a very nice example of overfitting.
Here is the short explanation of what happened:
You model (doesn't matter whether this is multilayer perceptron, decision trees or literally anything else) can fit the training data in two ways.
First one is a generalization - model tries to find patterns and trends and use them to make predictions. The second one is remembering the exact data points from the training dataset.
Imagine the computer vision task: classify images into two categories – humans vs trucks. The good model will find common features that are present in human pictures but not in the trucks pictures (smooth curves, skin-color surfaces). This is a generalization. Such model will be able to handle new pictures pretty well. The bad model, overfitted one, will just remember exact images, exact pixels of the training dataset and will have no idea what to do with new images on the test set.
What can you do to prevent overfitting?
There are few common approaches to deal with overfitting:
Use simpler models. With fewer parameters, it will be difficult for a model to remember the dataset
Use regularization. Constrain the weights of the model and/or use dropout in your perceptron.
Stop the training process. Split your training data once more, so you will have three parts of the data: training, dev, and test. Then train your model using training data only and stop the training when the error on the dev set stopped decreasing.
The good starting point to read about overfitting is Wikipedia: https://en.wikipedia.org/wiki/Overfitting

Early stopping : neural networks

I'm working on relation classification with the SemEval2010 Task 8 dataset. The dataset is already split into 8'000 samples for the training and 2'717 for the testing. In order to be as fair as possible, I use only my model at the end to computing its performance (F1-Score).
In order to tune my convolutional neural networks, I keep 6'400 samples for the training and 1'600 for the validation. I train the model and after each epoch (~10' of computation) I compute the F1-Score of my predictions.
I read the paper http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf and stop training when the last 3 performances were increasing (similar to UP in the paper). In the paper, they return the model corresponding to the best performance seen so far.
My question is : in order to be as accurate as possible, we need the whole 8'000 samples for the training. Is it correct to say we will train until the epoch which had the best performance on the validation set and then do the predictions ? Or should we save the model corresponding to the best performance and "waste" 1'600 samples ?

Deep learning Training dataset with Caffe

I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.

Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart