Setting up multiclass decision forest/neural network on smaller dataset - machine-learning

So I have a set of data, 1900 rows and 22 columns. 21 column is just numbers but that one crucial that I want to train the data on has 3 stages: a,b, and c.
I have tried both decision trees/jungles, and neural networks and no matter how I set them up I can't get more than 55% precision.
Usually it's around 50% accuracy and the best I was ever able to get was 55% overall accuracy and around 70% average.
Should I even use NN on a such small dataset? As I said I tried with other ML algorithms but they don't yield anything better.

I think that there is no clear answer to your question. Low accuracy score may come from a few reasons. I will state some of them in the following points :
When you use decision trees / neural networks - low accuracy may be a result of a wrong setup of metaparameters (like maximum height of a tree or number of trees in DT or wrong topology or data preparation in NN case). What I advise you is to use a grid or random search for both NN and DT to look for the best metaparameters for your algorithm (in case of "static" (not sequential data) packages like e.g. h20 in R or Scikit-learn in Python may do a great job) and in neural network case - normalize your data properly (e.g. subtract mean and divide by standard deviation every x column of your data).
Your dataset might be inconsistent. If e.g. your data has not a property that there exists a functional dependency between x and y (what means that y = f(x) for some f) then what is learnt during a training session is a probability that given x - your example belong to some specified class. This inconsistency might seriously harm your accuracy. What I advice you in this case is to try specify if that phenomenon occurs and then e.g. try to segmentate your data to solve the problem.
Your data set might be simply too small. Try to get more data in this case.

Related

Testing the maximum theoretical accuracy for a data set?

I am applying several machine learning methods to a real-world medical data set but I can't achieve high accuracy (its around 80% now) for the test data set. The problem is to predict if the disease is present or not.
Is there any way to prove how much maximum accuracy can be achieved? Or something similar that can tell the expected accuracy of a particular machine learning model for the data set?
If not, how can I prove the accuracy I am getting is the best (or near best) accuracy possible from the data set?
It depends on how deterministic your data is. I will illustrate with two variables, y as a function of x.
If y = x, then the theoretical best accuracy is 100%. It should be possible to get a perfect result.
Now suppose that y = x + rnorm(n, 0, sigma) where n is the number of points and you get to choose sigma. You can predict x, but you cannot predict the random part. The bigger sigma is, the worse your predictions. You can make the best possible accuracy arbitrarily low by choosing a large enough sigma.
With real data, you don't usually know how well your input variables determine the output, so you cannot state a meaningful theoretical limit just accuracy is between 0 and 1.
What is the accuracy rate for the detections done by humans?
If it is almost the accuracy you get via the machine, you are doing great! Even if the machine is doing a bit worse, it can even be considered good.
In the industry, such a question is mostly Product management questions rather than a scientific one.

Using Multilayer Perceptron (MLP) to categorise images and its performance

I am new to Machine/Deep learning area!
If I understood correctly, when I am using images as an input,
the number of neurons at input layer = the number of pixels (i.e resolution)
The weights and biases are updated through back-propagation to achieive low as possible error-rate.
Question 1.
So, even one single image data will adjust the values of weights & biases (through back-propagation algorithm), then how does adding more similar images into this MLP improve the performance?
(I must be missing something big.. however to me, it seems like it will only be optimised for the given single image and if i input the next one (of similar img), it will only be optimised for the next one )
Question 2.
If I want to train my MLP to recognise certain types of images ( Let's say clothes / animals ) , what is a good number of training set for each label(i.e clothes,animals)? I know more training set will produce better result, however how much number would be ideal for good enough performance?
Question 3. (continue)
A bit different angle question,
There is a google cloud vision API , which will take images as an input, and produce label/probability as an output. So this API will give me an output of 100 (lets say) labels and the probabilities of each label.
(e.g, when i put an online game screenshot, it will produce as below,)
Can this type of data be used as an input to MLP to categorise certain type of images?
( Assuming I know all possible types of labels that Google API produces and using all of them as input neurons )
Pixel values represent an image. But also, I think this type of API output results can represent an image in different angle.
If so, what would be the performance difference ?
e.g) when classifying 10 different types of images,
(pixels trained model) vs (output labels trained model)
I can help you with the "intuitive" picture.
First, it may be worth looking at convolution neural nets and deep learning and see how to handle images as input to reduce number of weights. It will not be 1 weight per pixel.
Also, what exactly you mean by "performance"? That is not a well defined question. If you use 1 image, say a cat, do you mean by performance that you can identify cats in other pictures, or how well you are able to get close to your cat?
Imagine you have a table of 3 weights, 1 input and 1 output, and trained your network to have error of < 0.01, and the desired output is 0.5
W1 | W2 | W3 | Output
0.1 0.2 0.05 0.5006
If you retrain the network, you may get a different
W1 | W2 | W3 | Output
0.3 0.2 0.08 0.49983
Since the weights are way different, you can imagine that there are several solutions.
Then, if you add another input, you can imagine that some of those weights which worked for first solution will work for the second.
Then you add another input. Then subset of the solutions with 2 inputs will work for 3 inputs. Etc.
When you have enough unrelated or noisy inputs, you won't find a subset of weights which meet your error criterion. Either you need to add weights (more degrees of freedom) or increase the error target, or both.
Now, you have a learning rate when you train a network. Say you are doing online training (for each input you update the weights), not batch training (you find the error vector for a batch (subset) of the input and you update your weights based on that, 1 time for the batch).
Now, suppose your learning rate was 0.01 and weight of 0.1. Intuitively:
If, for the first input, the first weight had derivative of 5, then your weight has new value of 0.1 - 0.01*5 = 0.05
If you feed your next input, say the derivative was -5. That means that the second input "disagrees" with the first change, and tries to go back to 0.01
If the derivative for the second input was 5, that means that the second weight "agrees" with the first.
If you have 20 inputs, some will pull the value up, some will push the value down. You keep looping through the training and then the value will approach a value which most of the inputs agree on, hence minimizing the error caused by that weight.
For question 2:
My mathematical guts feel tells me you definitely need at least 2*weight number to have any meaning to the training, but you should make that at least 10x the number of weights for the least minimum amount to even make a conclusion about your network, unless you are not trying to guess something new (for example, for xor gate, you can probably get away with way less input than weights, but that is a bit long discussion)
Note:
With 1 image, you can rotate it, stretch it, mix it with other images... to create another images and increase your input set.
If you have a simple input like xor gate, you can create inputs like (0.3, 0.7) (0.3, 0.6) (0.2, 0.8)... to expand your training set.
For question 3:
This is equivalent to chaining google's network with a network you create serially, but training each part separately.
Basically: You have Pictures --> 10 labels input to your network --> your classification
The problem I see there is, you may not know all the possible outputs of google's classification. But say they are consistent,
Is your label same as one of the 10 labels? If so, use the given label. If it is a different type of label, you can use that API to simplify your network. What are the consequences or what is the performance?
That is beyond me. In neural nets, while they have good mathematical theories to tell us what they can do, many posed problems such as the one you asked require either a special mathematical analysis (perhaps get PhD on some insight related to that class of problems) or, as most do, show empirical results.

Why does VGG19 subtract the mean RGB values of inputs?

This is found in most implementations I've seen; I don't really understand the purpose? I've heard it's a preprocessing step that helps with classification accuracy? Is it necessary, particularly for non-classification tasks, eg. generating new images, working with image activations?
One of the most popular ways on how to normalize data is to make it have 0 mean and variance 1. It's usually done because:
Computational reasons - most training algorithms need your data points to have a small norm in order to run properly. It's because e.g. gradient stability, etc.
Dataset bias reason - if your data doesn't have a 0 mean - then it means that it constantly pushes network toward the certain direction. This must be compensated by network weights and biases what may slow down training (especially when the norm of outputs are relatively large).
When data is not normalized/scaled - some input coordinates (these ones with bigger means and norms) have a much greater impact on a training process. Imagine e.g. two variables - age and a binary indicator if someone had a heart attack. If you don't normalize your data - the fact that age has a higher norm than binary indicator will make this coordinate to influence training process much more than the other one. Is it plausible e.g. for predicting if someone will have another heart attack?

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Resources