I am working on a basic decision making algorithm, i.e. based on the time of a parallel loop iteration, a decision is made to either increase or decrease the amount of threads assigned to a process. My initial approach was to take the average time of ten iterations and compare it to the previous (average) time, every 5secs. This approach failed... left by itself it would always drive the thread count down to 1.
So i've turned to unsupervised learning, using clustering as a way to decide if time x, should be classified into either: increase, stick with, or decrease the amount of threads to assign.
Based on the data type i am classifying, I believe K-means is a good starting point for unsupervised learning? I am on the right track here...
If you have an objective, use supervised learning.
Unsupervised methods can cluster by some kind of objective. You have no control to have k-means cluster points according to this objective (e.g. "increase, stick with, or decrease"). Instead, k-means may yield clusters that have no relationship to this at all!
Try labeling some data (which should be fairly easy in retrospective, i.e. "should I have increased the number of threads at t minus 10") and then training a classifier on that.
Related
I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.
I am currently conducting some analysis using NTSB aviation accident database. There are cause statements for most of the aviation incidents in this dataset that describe the factors lead to such event.
One of my objectives here is to try to group the causes, and clustering seems to be a feasible way to solve this kind of problem. I performed the followings prior to the beginning of k-means clustering:
Stop-word removal, that is, to remove some common functional words in the text
Text stemming, that is, to remove a word's suffix, and if necessary, transform the term into its simplest form
Vectorised the documents into TF-IDF vector to scale up the less-common but more-informative words and scale down highly-common but less-informative words
Applied SVD to reduce the dimensionality of vector
After these steps k-means clustering is applied to the vector. By using the events that occurred from Jan 1985 to Dec 1990 I get the following result with number of clusters k = 3:
(Note: I am using Python and sklearn to work on my analysis)
... some output omitted ...
Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=3, n_init=1,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=True)
Initialization complete
Iteration 0, inertia 8449.657
Iteration 1, inertia 4640.331
Iteration 2, inertia 4590.204
Iteration 3, inertia 4562.378
Iteration 4, inertia 4554.392
Iteration 5, inertia 4548.837
Iteration 6, inertia 4541.422
Iteration 7, inertia 4538.966
Iteration 8, inertia 4538.545
Iteration 9, inertia 4538.392
Iteration 10, inertia 4538.328
Iteration 11, inertia 4538.310
Iteration 12, inertia 4538.290
Iteration 13, inertia 4538.280
Iteration 14, inertia 4538.275
Iteration 15, inertia 4538.271
Converged at iteration 15
Silhouette Coefficient: 0.037
Top terms per cluster:
**Cluster 0: fuel engin power loss undetermin exhaust reason failur pilot land**
**Cluster 1: pilot failur factor land condit improp accid flight contribute inadequ**
**Cluster 2: control maintain pilot failur direct aircraft airspe stall land adequ**
and I generated a plot graph of the data as follows:
The result doesn't seem like make sense to me. I wonder why all of the clusters contain some common terms like "pilot" and "failure".
One possibility that I can think of (but I am not sure if it is valid in this case) is the documents with these common terms are actually located at the very centre of the the plot graph, therefore they can not be efficiently clustered into a right cluster. I believe this problem cannot be addressed by increasing the number of clusters, as I have just done it and this problem persists.
I just want to know if there is any other factors that could cause the scenario that I am facing? Or more broadly, am I using the right clustering algorithm?
Thanks SO.
I do not want to be a carrier of bad news, but ...
Clustering is a very bad exploration technique - mostly because without a clear, task oriented aim, clustering techniques are actually focused on optimization of some mathematical criterions, which rarely have anything to do with what you want to achieve. Thus k-means in particular will look for minimization of the euclidean distances from cluster centers to all points inside a cluster. Is this anyhow related with the task you want to achieve? Usually the answer is "no", or in the best case "I have no idea".
Representing documents as bag of words leads to very general look at your data, thus it is not a good approach to distinguish between similar objets. Such an approach can be used to distinguish between texts about guns from texts about hockey, but not specialistic texts from the very same domain (which seems to be the case here)
In the end - you cannot really evaluate a clustering, and this is the biggest issue. Thus there are no well established techniques of fitting best clustering.
So, to answer your final questions
I just want to know if there is any other factors that could cause the scenario that I am facing?
There are thousands of such factors. Finding actual, reasonable from the human perspectice, clusters in data is extremely hard. Finding any clusters is exteremely simple - because every clustering technique will find something. But in order to find what is important here one would have to go through whole data exploration here.
Or more broadly, am I using the right clustering algorithm?
Probably not, as k-means is simply a method of minimizing of inner cluster sum of euclidean distances, thus it will not work in most real world scenarios.
Unfortunately - this is not the kind of problem where you can just ask "which alogirhtm to use?" and someone will offer you exact solution.
You have to dig in your data, figure out:
way of representation - is tfidf really good? have you preprocessed the vocablurary? Removed meaningless words? Maybe it is wort considering going for some modern word/document representation learning?
structure in your data - in order to find best model you should visualize your data, investigate, run statistical analysis, try to figure out what is an underlying metric. Is there any reasonable distribution of points? Are these gaussians? Gaussian mixtures? Is your data sparse?
can you provide some expert knowledge? Maybe you can divide part of dataset yourself? semi-supervised techniques are much better defined then any unsupervised ones, thus you might easily get much better results.
I want my neural network to be trained on every new data that it classifies incorrectly. Assuming that I somehow label the data correctly every time the network makes a mistake, how many back props do i need to run on this single instance of new data in order to train my network for that particular case? Is there a better way to train a neural network on real time scenarios?
It depends on the optimization algorithm you use. The backpropagation by itself calculates only the gradient, which is used by the next iteration of the algorithm.
In the simplest case you can use a self-developed gradient descent and check the behavior of your cost function. If the cost function decreases less than some threshold epsilon, you might break the optimization loop for the current instance. You can also limit the maximum number of iterations.
It is worth using some advanced optimizers such fminunc in Matlab, which will stop by themselves when reached an optimum.
You may find this post about different termination conditions of gradient descent very useful.
I think, learning only using one single instance is not really efficient. The cost function can behave jerky. You may consider the batch learning method, where you learn using small batches of new instances. It should provide a better learning rate.
In order to illustrate how network's accuracy depends on the iteration number and on the batch size, I experimented a bit with a neural network used to recognize hand written digits. I had 4000 examples in the training set and 1000 examples in the validation set. Then I started the learning algorithm with different parameters and measured the resulted accuracy. You can see the result here:
Of course this plot describes only my particular case, but you can get some intuition on what to expect and on how to validate network parameters.
I'm new to Artificial Neural Networks and NeuroEvolution algorithms in general. I'm trying to implement the algorithm called NEAT (NeuroEvolution of Augmented Topologies), but the description in original public paper missed the method of how to evolve the weights of a network, it says
Connection weights mutate as in any NE system, with each connection either perturbed or not at each generation
I've done some searching about how to mutate weights in NE systems, but can't find any detailed description, unfortunately.
I know that while training a neural network, usually the backpropagation algorithm is used to correct the weights, but it only works if you have a fixed topology (structure) through generations and you know the answer to the problem. In NeuroEvolution, you don't know the answer, you have only the fitness function, so it's not possible to use backpropagation here.
I have some experience with training a fixed-topology NN using a genetic algorithm (What the paper refers to as the "traditional NE approach"). There are several different mutation and reproduction operators we used for this and we selected those randomly.
Given two parents, our reproduction operators (could also call these crossover operators) included:
Swap either single weights or all weights for a given neuron in the network. So for example, given two parents selected for reproduction either choose a particular weight in the network and swap the value (for our swaps we produced two offspring and then chose the one with the best fitness to survive in the next generation of the population), or choose a particular neuron in the network and swap all the weights for that neuron to produce two offspring.
swap an entire layer's weights. So given parents A and B, choose a particular layer (the same layer in both) and swap all the weights between them to produce two offsping. This is a large move so we set it up so that this operation would be selected less often than the others. Also, this may not make sense if your network only has a few layers.
Our mutation operators operated on a single network and would select a random weight and either:
completely replace it with a new random value
change the weight by some percentage. (multiply the weight by some random number between 0 and 2 - practically speaking we would tend to constrain that a bit and multiply it by a random number between 0.5 and 1.5. This has the effect of scaling the weight so that it doesn't change as radically. You could also do this kind of operation by scaling all the weights of a particular neuron.
add or subtract a random number between 0 and 1 to/from the weight.
Change the sign of a weight.
swap weights on a single neuron.
You can certainly get creative with mutation operators, you may discover something that works better for your particular problem.
IIRC, we would choose two parents from the population based on random proportional selection, then ran mutation operations on each of them and then ran these mutated parents through the reproduction operation and ran the two offspring through the fitness function to select the fittest one to go into the next generation population.
Of course, in your case since you're also evolving the topology some of these reproduction operations above won't make much sense because two selected parents could have completely different topologies. In NEAT (as I understand it) you can have connections between non-contiguous layers of the network, so for example you can have a layer 1 neuron feed another in layer 4, instead of feeding directly to layer 2. That makes swapping operations involving all the weights of a neuron more difficult - you could try to choose two neurons in the network that have the same number of weights, or just stick to swapping single weights in the network.
I know that while training a NE, usually the backpropagation algorithm is used to correct the weights
Actually, in NE backprop isn't used. It's the mutations performed by the GA that are training the network as an alternative to backprop. In our case backprop was problematic due to some "unorthodox" additions to the network which I won't go into. However, if backprop had been possible, I would have gone with that. The genetic approach to training NNs definitely seems to proceed much more slowly than backprop probably would have. Also, when using an evolutionary method for adjusting weights of the network, you start needing to tweak various parameters of the GA like crossover and mutation rates.
In NEAT, everything is done through the genetic operators. As you already know, the topology is evolved through crossover and mutation events.
The weights are evolved through mutation events. Like in any evolutionary algorithm, there is some probability that a weight is changed randomly (you can either generate a brand new number or you can e.g. add a normally distributed random number to the original weight).
Implementing NEAT might seem an easy task but there is a lot of small details that make it fairly complicated in the end. You might want to look at existing implementations and use one of them or at least be inspired by them. Everything important can be found at the NEAT Users Page.
My problem is that I have a large unlabeled dataset, but over time I want it to become labeled and build a confident classifier.
This can be done by active learning, but active learning needs an initial classifier to be built for it to then estimate and rank the remaining unlabeled instances by how informative they are expected to be to the classifier.
To build the initial classifier, I need to label some examples by hand. my questions is: Are there methods to find likely informative examples in the initial unlabeled dataset, without the help of an initial classifier?
I thought about just using k-means with some number of clusters, run it and label one example from each cluster, then train the classifier on these.
Is there a better way?
I have to disagree with Edward Raff.
k-means may turn out to be useful here (if your data is continuous).
Just use a rather large value of k.
The idea is to avoid picking too similar objects, but get a sample that covers the data reasonably well. k-means may fail to "cluster" complex data, but it works reasonably well for quantization. So it will return a "less random, more representative" sample from your data.
But beware: k-means centers do not correspond to data points. You could either use a medoid based algorithm, or just find the closes instance to each center.
Some alternatives:
if you can afford to label "a" objects, run k-means with k=a
run k-means with k=5*a, and select 20% of the centers (maybe preferring those with highest density)
choose 0.5*a by k-means, 0.5*a randomly
do either, but choose only 0.5*a objects to label. Train a classifier, find the 0.5*a unlabeled objects that the classifier had the lowest confidence on
No. If you don't have any labeled data, you have no way of determining which points are the most informative. k-means does not necessarily help either, as you don't know where the decision surface lives.
You are overthinking the problem. Just randomly sample some data and get it labeled. Once you have a few hundred - thousand points labeled you can start to look at the labeled data and makes some decisions about where to head next.