Majority layer in Keras? - machine-learning

Is there any kind of majority layer in Keras? That the input would be dome 1D vector, and the output is a single number which is the value that has the most occurrences kn the input vector?
My use case is - I'm building an ensemble of neural networks, but let's say that I want to have a single network. So I'm building a new network, with the previous models as inputs. I want to add a dingle output layer that simply runs a majority vote. Is it possible with Keras?

Note, that such a layer would not make much sense with float activations, since probability of two floats being exactly the same is 0. Consequently this only makes sense if your inputs are categorical. And for categorical layer you usually one-hot encode the result, thus imagine having N networks with one hot encoding of K possible values, thus N x K tensor
Net 1: 0 0 1 0 0 0
Net 2: 1 0 0 0 0 0
Net 3: 0 0 1 0 0 0
Now all we have to do is to sum over N
1 0 2 0 0 0
and take an argmax to find what you want. So you just compose 2 standard operations from every NN library - summation, and argmaxing. Note, that this solution actually works also if your inputs are floats, you "sum" the votes, and take an argmax.

Related

Implementing a neural network classifier for my data, but is it solveable this way?

I will try to explain what the problem is.
I have 5 materials, each composed of 3 different minerals of a set of 10 different minerals. For each material I have measured the inensity vs wavelength. And each Intensity vs wavelength vector can be mapped into a binary vector of ones and zeros corresponding to the minerals the material is composed of.
So material 1 has an intensity of [0.51 0.53 0.57 0.68...... ] measured at different wavelengths [470 480 490 500 510 ......] and a binary vector
[1 0 0 0 1 0 0 1 0 0]
and so on for each material.
For each material I have 5000 examples, so 25000 examples for all. Each example will have a 'similar' intensity vs wavelength behaviour but will give the 'same' binary vector.
I want to design a NN classifier so that if I give it as an input the intensity vs wavelength, it gives me the corresponding binary vector.
The intensity vs wavelength has a length of 450 so I will have 450 units in the input layer
the binary vector has a length of 10, so 10 output neurons
the hidden layer/s will have as a beginning 200 neurons.
Can I simly design a NN classifier this way, and would it solve the problem, or I need something else?
You can do that, however, be aware to use the right cost and output layer activation functions. In your case, you should use sigmoid units for your outer layer and binary-cross-entropy as a cost function.
Another way to go about this would be to use one-hot encoding so that you can use normal multi-class classification (will probably not make sense since your output is probably sparse).

How can I use one-hot encoded labels with some sklearn classifiers?

I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:
ValueError: bad input shape (1203L, 10L)
I understand the allowed shape of y in these two classifiers is different:
GaussianNB:
y : array-like, shape (n_samples,)
RandomForest:
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!
The question is, why is this?
It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of
1 0 0
0 1 0
0 0 1
you literally pass
1 2 3
So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like
1 1 0
1 1 1
0 0 0
Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?
Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,
Any way to go around it?
Yup, just do not encode - pass raw labels :-)

Torch - Large input and output sizes into neural network

I'm new to machine learning and seek some help.
I would like to train a network to predict the next values I expect as follow:
reference: [val1 val2 ... val 15]
val = 0 if it doesn't exists, 1 if it does exist.
Input: [1 1 1 0 0 0 0 0 1 1 1 0 0 0 0]
Output: [1 1 1 0 0 0 0 0 1 1 1 0 0 1 1] (last two values appear)
So my neural network would have 15 inputs and 15 outputs
I would like to know if there would be a better way to do that kind of prediction. Do my data would need normalization also?
Now the problem is, I dont have 15 values, but actually 600'000 of them. Can a neural network handle such big tensors, and I've hear I would need twice the number for hidden layer units.
Thanks a lot for your help, you machine learning expert!
Best
This is not a problem for the concept of a neural network: the question is whether your computing configuration and framework implementation deliver the required memory. Since you haven't described your topology, there's not a lot we can do to help you scope this out. What do you have for parameter and weight counts? Each of those is at least a short float (4 bytes). For instance, a direct FC (fully-connected) layer would give you (6e5)^2 weights, or 3.6e11 * 4 bytes => 1.44e12 bytes. Yes, that's pushing 1.5 terabytes for that layer's weights.
You can get around some of this with the style of NN you choose. For instance, splitting into separate channels (say, 60 channels of 1000 features each) can give you significant memory savings, albeit at the cost of speed in training (more layers) and perhaps some accuracy (although crossover can fix a lot of that). Convolutions can also save you overall memory, again at the cost of training speed.
600K => 4 => 600K
That clarification takes care of my main worries: you have 600,000 * 4 weights in each of two places: 1,200,004 parameters and 4.8M weights. That's 6M total floats, which shouldn't stress the RAM of any modern general-purpose computer.
The channelling idea is when you're trying to have a fatter connection between layers, such as 600K => 600K FC. In that case, you break up the data into smaller groups (usually just 2-12), and make a bunch of parallel fully-connected stream. For instance, you could take your input and make 10 streams, each of which is 60K => 60K FC. In your next layer, you swap the organization, "dealing out" each set of 60K so that 1/10 goes into each of the next channels.
This way, you have only 10 * 60K * 60K weights, only 10% as many as before ... but now there are 3 layers. Still, it's a 5x saving on memory required for weights, which is where you have the combinatorial explosion.

How can I speed up learning for feed forward, gradient based backpropagation neural networks

I am using tanh as an activation function.
Let's take one problem for example.
XOR Problem:
1 1 0
0 1 1
1 0 1
0 0 0
When I train my neural network 500 epochs,
results look like this:
1 1 0.001015
0 1 0.955920
1 0 0.956590
0 0 0.001293
After another 500 epoch:
1 1 0.000428
0 1 0.971866
1 0 0.971468
0 0 0.000525
Another 500 epoch:
1 1 0.000193
0 1 0.980982
1 0 0.981241
0 0 0.000227
It seems that the learning is slowing down alot.
My neural network is taking forver to get precise enough for my costom problems.
Is there anyway to speed up the learning after it starts getting slow like that?
Thanks
This kind of learning curve is perfectly normal in neural network training (or even in real life learning). That said, while the general shape of the curve is typical, we can improve on its steepness. In that respect, I suggest that you implement momentum into your training algorithm. If that does not seem to be enough, your next step would be to implement some adaptive learning rate algorithm such as adadelta, adagrad or rmsprop. Finally, a last thing you may want to try is batch normalization.
If the net you are building has sigmoids applied to the neurons in the output layer (seems like they do from your results), you might consider removing them and just have a linear relation. Your net might become a bit more unstable, so a smaller step-size can be advised. But you will be able to reach better accuracy.

Intuition behind standard deviation as a threshold and why

I have a set of input output training data, few samples are
Input output
[1 0 0 0 0] [1 0 1 0 0]
[1 1 0 0 1] [1 1 0 0 0]
[1 0 1 1 0] [1 1 0 1 0]
and so on. I need to apply standard deviation on the entire output as a threshold. So, I calculate the mean standard deviation for the output. The application is that the model when presented this data should be able to learn and predict the output. There is a condition in my objective function design which is the distance = sum of the sqrt of the euclidean distance between model output and the desired target, corresponding to an input should be less than a threshold.
My question is how should I justify the use of threshold? Is it justified ? I read this article article which says that it is common to take standard deviation as the threshold.
For my case, what does it mean taking the standard deviation of the output of the training data?
There is no intuition/philosophy behind std deviation (or variance), statisticians like these measures purely because they are mathematically easy to work with due to various nice properties. See https://math.stackexchange.com/questions/875034/does-expected-absolute-deviation-or-expected-absolute-deviation-range-exist
There are quite a few other ways to perform various forms of outliar detection, belief revision, etc, but they can be more mathematically challenging to work with.
I am not sure this idea applies. You are looking at the definition of standard deviation for a univariate value, but your output is multivariate. There are multivariate analogs, but, it's not clear why you need to apply it here.
It sounds like you are minimizing squared error, or Euclidean distance, between the output and known correct output. That's fine, and makes me think you're predicting the multivariate output shown here. What is the threshold doing then? input is less than what measure of what from what?

Resources