how to add more outputs to neural networks? - machine-learning

Of course no one making a neural network for image recognition and classification can make place for all possible image outputs. so If I make a neural network that takes the array input and get the output as a bird or not a bird. can I add more outputs for more images after I finish learning the first network or that will make the learning vanish.
so I add fixed input number and 1 output then I add 1 more and 1 more is that applicable or no?

Retrain
If you can spend the resources, it would be a good thing to re-train (or to be more specific: train something from scratch) your network. But read the approaches following when you might achieve something better (or at least less costly).
Transfer-learning
But if you are using one of the huge popular NNs which take weeks to train (on very costly) hardware, there might be a way touching the idea of transfer-learning.
There are at least two different approaches then:
Using the pretrained NN as feature-extractor
Here you will remove the final dense-layers and just use the trained NN to extract some features out of your images. Then you can build some arbitrarily new classifier on your new dataset, which maps OLD-NN-OUTPUT = FEATURES-INPUT -> classes (new softmax-NN or SVM/Kernel-SVM or anything else). This sounds quite robust if we assume that your pretrained NN is of high-quality and your new class is not too different from the learned ones.
In general this approach might be favorable if your new class + dataset is small and similar to the original one.
If the new data is not that similar, one might use some features at some earlier layer (more generic).
Continuing training
Here you would continue training the weights of your original NN, probably keeping the first layers (maybe even all but the final dense ones). As above the general idea is that we assume a good NN to be very general at the first layers (= extracting features) and more specific in the last ones.
This approach should be favorable if you got huge data for your new class. Depending on the similarity you might either continue to retrain all weights or if quite similar, fix some layer-weights (first ones).
There might be technical issues here how to achieve this approach (like different image-size inputs and other stuff). So it needs some work if some constraints of the original NN are broken. It's also important to tune the hyper-parameters for learning (maybe learning-rates should be lower!).

Related

Machine learning with my car dataset

I’m very new to machine learning.
I have a dataset with data given me by a f1 race. User is playing this game and is giving me this dataset.
With machine learning, I have to work with this data and when a user (I know they are 10) plays a game I have to recognize who’s playing.
The data consists of datagram packet occurred in 1/10 second freq, the packets contains the following Time, laptime, lapdistance, totaldistance, speed, car position, traction control, last lap time, fuel, gear,..
I’ve thought to use a kmeans used in a supervised way.
Which algorithm could be better?
The task must be a multiclass classification. The very first step in any machine learning activity is to define a score metric (https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). That allows you to compare models between themselves and decide which is better. Then build a base model with random forest or/and logistic regression as suggested in another answer - they perform well out-of-the-box. Then try to play with features and understand which of them are more informative. And don't forget about a visualizations - they give many hints for data wrangling, etc.
this is somewhat a broad question, so I'll try my best
kmeans is unsupervised algorithm meaning it will find the classes itself and it best used when you know there are multiple classes but you don't know what exactly they are... using it with labeled data just means you will compute the distance of new vector v to each vector in the dataset and pick the one (or ones using majority vote) which give the min distance , this is not considered as machine learning
in this case when you do have the labels, supervised approach will yield much better results
I suggest try random forest and logistic regression at first, those are the most basic and common algorithms and they give pretty good results
if you haven't achieve the desired accuracy you can use deep learning and build a neural network with input layer as big as your packet's values and output layer of the number of classes, in between you can use one or multiple hidden layers with various nodes, but this is advanced approach and you better pick up some experience in machine learning field before pursue it
Note: the data is a time series, meaning that every driver has it's own behaviour of driving a car, so data should be considered as bulks of points, with this you can apply pattern matching technics, also there are a several neural networks build exactly for this data (like RNN) but this is far far advanced and much more difficult to implement

Can a neural network be trained while it changes in size?

Are there known methods of continuous training and graceful degradation of a neural net while it shrinks or grows in size (by number of nodes, connections, whatever)?
To the best of my memory, everything I've read about neural networks is from a static perspective. You define the net and then train it.
If there is some neural network X with N nodes (neurons, whatever), is it possible to train the network (X) so that while N increases or decreases, the network is still useful and capable of performing?
In general, changing network architecture (adding new layers, adding more neurons into existing layers) once the network was already trained makes sense and a rather common operation in Deep Learning domain. One example is the dropout - during training half of the neurons randomly get switched off completely and only remaining half participates in training during specific iteration (each iteration or 'epoch' as it often is named has different random list of switched off neurons). Another example is transfer learning - where you learn network on one set of input data, cut off part of the outcoming layers, replace them with new layers and re-learn the model on another dataset.
To better explain why it makes sense lets step back for a moment. In deep networks, where you have lots of hidden layers each layer learns some abstraction from the incoming data. Each additional layer uses abstract representations learned by previous layer and builds upon them, combining such abstraction to form a higher level of the data representation. For instance, you could be trying to classify the images with DNN. First layer will learn rather simple concepts from images - like edges or points in data. Next layer could combine this simple concepts to learn primitives - like triangles or circles of squares. Next layer could drive it further and combine this primitives to represent some objects which you could find in images, like 'a car' or 'a house'and using softmax it calculates the probabilities of the answer you are looking for (what to actually output). I need to mention that these facts and learned representations could be actually checked. You could visualize the activation of your hidden layer and see what it learned. For example this was done with google's project 'inceptionism'. With that in mind let's get back to what I mentioned earlier.
Dropout is used to improve generalization of the network. It forces each neuron to 'not be so sure' that some pieces of the information from the previous layer will be available and makes it to try to learn the representations relying on less favorable and informative pieces of abstractions from previous layer. It forces it to consider all of the representations from previous layer to make decisions instead of putting all of its weight into couple of neurons it 'likes most of all'. By doing this the network is usually better prepared to new data where the input will be different from the training set.
Q: "As far as you're aware is the quality of the stored knowledge (whatever training has done to the net) still usable following the dropout? Maybe random halves could be substituted by random 10ths with a single 10th dropping, that might result in less knowledge loss during the transition period."
A: Unfortunately I can't properly answer why precisely half of the neurons is switched off and not 10% (or any other number). Maybe there is an explanation but I haven't seen it. In general it just works and that's it.
Also I need to mention that the task of dropout is to ensure that each neuron doesn't consider just several of the neurons from previous layer and is ready to make some decision even if neurons which usually helped it to make correct decision are not available. This is used for generalization only and helps the network to better cope with the data it haven't seen previously, nothing else is achieved with a dropout.
Now let's consider Transfer Learning again. Consider that you have a network with 4 layers. You train it to recognize specific objects in pictures (cat, dog, table, car etc). Than you cut off last layer, replace it with three additional layers and now you train the resulting 6-layered network on a dataset which, for instance, wrights short sentences about what is shown on this image ('a cat is on the car', 'house with windows and tree nearby' etc). What we did with such operation? Our original 4-layer network was capable to understand if some specific object is in the image we feed it with. Its first 3 layers learned good representations of the images - first layer learned about possible edges or points or some extremely primitive geometric shapes in images. Second layer learned some more elaborate geometric figures like 'circle' or 'square'. Last layer knows how to combine them to form some higher level objects - 'car', 'cat', 'house'. Now, we could just re-use this good representation which we learned in different domain and just add several more layers. Each of them will use abstractions from last (3rd) layer of original network and learn how combine them to create meaningful descriptions of images. While you will perform learning on new dataset with images as input and sentences as output it will adjust first 3 layers which we got from original network but these adjustments will be mostly minor, while 3 new layers will be adjusted by learning significantly. What we achieve with transfer learning is:
1) We can learn a much better data representations. We could create a network which is very good at specific task and than build upon that network to perform something different.
2) We can save training time - first layers of network will already be trained well enough so that your layers which are closer to output already get a rather good data representations. So the training should finish much faster using pre-trained first layers.
So the bottom line is that pre-training some network and than re-using part or whole network in another network makes perfect sense and is not something uncommon.
This is something I have seen in the likes of this video...
https://youtu.be/qv6UVOQ0F44
There are links to further resources in the video description.
And is based on a process called NEAT. Neuro Evolution of Augmenting Topologies.
It uses a genetic algorithm and evolutionary process to design and evolve a neural net from scratch with no prior assumptions of structure or complexity of the neural net.
I believe this is what you are looking for.

How to discover new classes in a classification machine learning algorithm?

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

Advantages of RNN over DNN in prediction

I am going to work on a problem that needs to be addressed with either RNN or Deep Neural Nets. In general, the problem is predicting financial values. So, because I am given the sequence of financial data as an input, I thought that RNN would be better. On the other hand, I think that if I can fit the data into some structure, I can train with DNN much better because the training phase is easier in DNN than RNN. For example, I could get last 1-month info and keep 30 inputs and predict 31'th day while using DNN.
I don't understand the advantage of RNN over DNN in this perspective. My first question is about the proper usage of RNN or DNN in this problem.
My second questions are somehow basic. While training RNN, isn't it possible for a network to get "confused"? I mean, consider the following input: 10101111, and our inputs are one digits 0 or 1 and we have 2-sequences (1-0,1-0,1-1,1-1) Hereafter 1, comes 0 several times. And then at the end, after 1 comes 1. While training, wouldn't this become a major problem? That is, why the system not gets confused while training this sequence?
I think your question is phrased a bit problematically. First, DNNs are a class of architectures. A Convolutional Neural Network differs greatly from a Deep Belief Network or a simple Deep MLP. There are feed forward architectures (e.g. TDNN) fit for timeseries prediction but it depends on you, whether you're more interested in research or just solving your problem.
Second, RNNs are as "deep" as it gets. Considering the most basic RNN, the Elman Network: During training with Backpropagation through time (BPTT) they are unfolded in time - backpropagating over T timesteps. Since this backpropagation is done not only vertically like in a standard DNN but also horizontally over T-1 context layers, the past activations of the hidden layers from T-1 timesteps before the present are actually considered for the activation at the current timestep. This illustration of an unfolded net might help in understanding what I just wrote (source):
This makes RNNs so powerful for timeseries prediction (and should answer both your questions). If you have more questions, read about Elman Networks. LSTMs etc. will only confuse you. Understanding Elman Networks and BPTT is the needed foundation to understand any other RNN.
And one last thing you'll need to look out for: The vanishing gradient problem. While it's tempting to say let's make T=infinity and give our RNN as much memory as possible: It doesn't work. There are many ways working around this problem, LSTMs are quite popular at the moment and there are even some proper LSTM implementations around nowadays. But it's important to know that a basic Elman Network could really struggle with T=30.
As you answered yourself - RNN are for sequences. If data has sequential nature (time series) than it is preferable to use such model over DNN and other "static" models. The main reason is that RNN can model process which is responsible for each conequence, so for example given sequences
0011100
0111000
0001110
RNN will be able to build a model, that "after seeing '1' I will see two more" and correctly build a prediction when seeing
0000001**** -> 0000001110
While in the same time, for DNN (and other non sequential models) there is no relation between these three sequences, in fact the only common thing for them is that "there is 1 on forth position, so I guess it is always like that".
Regarding the second question. Why it won't get confused? Because it models sequences, because it has memory. It makes its recisions based on everything that was observed before, and assuming that your signal has any type of regularity, there is always some vent in the past that differentiate between two possible paths of signals. Once again, such phenomena are much better addressed by RNN than non-recurrent models. See for example natural language and enormous progress given by LSTM-based models in recent years.

Model selection with dropout training neural network

I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.

Resources