Machine learning algorithms for numerical sequences of variable length - machine-learning

I'm working on a machine learning project at the moment, and my problem is that I have sequences of varying length as input and would like a numerical sequence of the same length as output. Are there machine learning algorithms that can be used for these kinds of problems?
Thanks in advance!

Usually in such cases you bring all your input sequences (which are more commonly called feature vectors) to the same size. Now, how you do it depends on the nature of your data.
One simplest approach would be to just append 0 to all but longest vector. Or the opposite - trim all but the smallest vector.
When all your feature vectors are of the same size you can proceed with any machine learning algorithm out there.

Take a look at Recurrent Neural Networks (RNNs).
A famous kind of these networks are LSTM that can be easily implemented using TensorFlow.
https://en.wikipedia.org/wiki/Long_short-term_memory
https://www.tensorflow.org/versions/r0.11/tutorials/recurrent/index.html

Related

Machine learning with a variable-sized real vector of inputs?

I have a collection of objects with properties that I measure. For each object, I obtain a vector of real numbers describing that object. The vector is always incomplete: there's usually numbers missing from the beginning or end of what would be the complete vector, and sometimes there is information missing in the middle. Hence, each object results in a vector of a different length. I also measure, say, the mass of each object, and I now want to relate the vector of things I've measured to the mass.
It's common in my field (astrophysics) to extract features from this vector of real numbers, e.g. take an average or some linear combinations of the values; and then use those extracted features to infer the mass (or whatever) using for example neural networks. It was recently shown, however, that a very complex combination of the elements of the vector result in a much better model of the mass.
There are still residuals in this model, however, even when working on simulated data. Presumably then there is a better way out there to manipulate these variable-length vectors in order to get a better model.
I am wondering if it is possible to do machine learning with real-valued input vectors of all different lengths. I know for text mining there are things like the bag-of-words approach, but it is unclear how such a method would work on real-valued vectors. I know recurrent neural networks work on sentences of variable length, but I'm not sure they work for real-valued vectors. I have also considered imputing the missing data; however, sometimes it is missing for physical reasons, i.e. a value in such-and-such place cannot exist, and so imputing it would violate the physicality of the situation.
Is there any research in this area?
Recurrent Neural Networks (RNNs) are capable of taking a variable-sized input vector of length n and producing a variable sized output vector of length m.
There are many ways to make RNNs work. The most common cell types are called Long short-term memory (LSTM) and Gated Recurrent Unit (GRU).
You might want to read:
The Unreasonable Effectiveness of Recurrent Neural Networks: Nice to get an idea what RNNs are capable of, especially character predictors. It is easy to read, but not exactly what you're searching.
Understanding LSTM Networks: More technical; very well written
Sepp Hochreiter, Jurgen Schmidhuber: LONG SHORT-TERM MEMORY
RNNs in TensorFlow
However, training RNNs takes a lot of training data. You might be better off with computing a fixed-size feature vector from it. But you never know when you don't try it ;-)

Do all machine learning algorithm usage Word frequency as a feature?

I use scikit learn for classification. And mainly work with NAive bayes, SVM, Neural network. There are variant in each of them.
I see for training algo create vectors. What does this vector contains?
For all algorithm does it consider word frequency as a feature? If yes then how they differ?
For text classification you usually create a vector of words frequency, or tf-idf to be able to compute distances between two documents. You could use all kinds of method to create these weights on word.
The words (features) can be extracted by just a splitting the documents on separator but you can use more complex methods like stemming (keep only the root of the words).
You will find lots of example in the sklearn documentation. For instance :
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
This iPython Notebook could be a good start too.

How To Fight Randomness Caused By KMeans Clustering

I'm developing an algorithm to classify different types of dogs based off of image data. The steps of the algorithm are:
Go through all training images, detect image features (ie SURF), and extract descriptors. Collect all descriptors for all images.
Cluster within the collected image descriptors and find k "words" or centroids within the collection.
Reiterate through all images, extract SURF descriptors, and match the extracted descriptor with the closest "word" found via clustering.
Represent each image as a histogram of the words found in clustering.
Feed these image representations (feature vectors) to a classifier and train...
Now, I have run into a bit of a problem. Finding the "words" within the collection of image descriptors is a very important step. Due to the random nature of clustering, different clusters are found each time I run my program. The unfortunate result is that sometimes the accuracy of my classifier will be very good, and other times, very bad. I have chalked this up to the clustering algorithm finding "good" words sometimes, and "bad" words other times.
Does anyone know how I can hedge against the clustering algorithm from finding "bad" words? Currently I just cluster several times and take the mean accuracy of my classifier, but there must be a better way.
Thanks for taking time to read through this, and thank you for your help!
EDIT:
I am not using KMeans for classification; I am using a Support Vector Machine for classification. I am using KMeans for finding image descriptor "words", and then using these words to create histograms which describe each image. These histograms serve as feature vectors that are fed to the Support Vector Machine for classification.
There are many possible ways of making clustering repeatable:
The most basic method of dealing with k-means randomness is simply running it multiple times and selecting the best one (the one that minimizes the inner cluster distances/maximizes the between clusters distance).
One can use some fixed initialization for your data instead of randomization. There are many heuristics for starting the k-means. Or at least minimize the variance by using algorithms like k-means++.
Use modification of k-means which guarantees global minimum of regularized function, ie. convex k-means
Use different clustering method, which is deterministic, ie. Data Nets
I would offer two possible suggestions, in addition to those provided.
K-means optimises an objective related to the distance between cluster points and their centroids. You care about classification accuracy. Depending on the computational cost, a simple brute-force approach is to induce multiple clusterings on a subset of your training data, and evaluate the performance of each on some held-out development set for the task you care about. Then use the highest performing variant as the final model. I don't like the use of non-random initialisation because this is only a solution to avoid the randomness, not find the true global minimum of the objective, and your chosen initialisation may be useless and just produce consistently bad classifiers.
The other approach, which is much harder, is to view the k-means step as a dimensionality reduction to enable classification, and incorporate this into the classifier directly. If you use a deep neural net, the layer(s) closest to the input are essentially dimensionality reducers in the same way as the k-means clustering you induce: the difference is their weights are set wrt the error of the net on the classification problem, rather than some unrelated intermediate step. The downside is that this is much closer to a current research problem: training deep nets is hard. You could start with a standard one-hidden-layer architecture (with binary activations on the hidden layer, and using cross-entropy loss on the output layer with outputs coded as one-of-n categories), and attempt to add layers incrementally, but as far as I'm aware standard training algorithms start to behave poorly beyond the single hidden layer, so you'd need to investigate layer-wise training to initialise, or some of the Hessian-Free stuff coming out of Geoff Hinton's group in Toronto.
That is actually an important problem with the BofW approach, and you should share this prominently. SIFT data may actually not have k-means clusters at all. However, due to the nature of the algorithm, k-means will always produce k clusters. One of the things to test with k-means is to validate that the results are stable. If you get a completely different result each time, they are not much better than random.
Nevertheless, if you just want to get some working results, you can just fix the dictionary once and choose one that is working well.
Or you might look into more advanced clustering (in particular one that is more robust wrt. noise!)

Machine Learning: Unsupervised Backpropagation

I'm having trouble with some of the concepts in machine learning through neural networks. One of them is backpropagation. In the weight updating equation,
delta_w = a*(t - y)*g'(h)*x
t is the "target output", which would be your class label, or something, in the case of supervised learning. But what would the "target output" be for unsupervised learning?
Can someone kindly provide an example of how you'd use BP in unsupervised learning, specifically for clustering of classification?
Thanks in advance.
The most common thing to do is train an autoencoder, where the desired outputs are equal to the inputs. This makes the network try to learn a representation that best "compresses" the input distribution.
Here's a patent describing a different approach, where the output labels are assigned randomly and then sometimes flipped based on convergence rates. It seems weird to me, but okay.
I'm not familiar with other methods that use backpropogation for clustering or other unsupervised tasks. Clustering approaches with ANNs seem to use other algorithms (example 1, example 2).
I'm not sure which unsupervised machine learning algorithm uses backpropagation specifically; if there is one I haven't heard of it. Can you point to an example?
Backpropagation is used to compute the derivatives of the error function for training an artificial neural network with respect to the weights in the network. It's named as such because the "errors" are "propagating" through the network "backwards". You need it in this case because the final error with respect to the target depends on a function of functions (of functions ... depending on how many layers in your ANN.) The derivatives allow you to then adjust the values to improve the error function, tempered by the learning rate (this is gradient descent).
In unsupervised algorithms, you don't need to do this. For example, in k-Means, where you are trying to minimize the mean squared error (MSE), you can minimize the error directly at each step given the assignments; no gradients needed. In other clustering models, such as a mixture of Gaussians, the expectation-maximization (EM) algorithm is much more powerful and accurate than any gradient-descent based method.
What you might be asking is about unsupervised feature learning and deep learning.
Feature learning is the only unsupervised method I can think of with respect of NN or its recent variant.(a variant called mixture of RBM's is there analogous to mixture of gaussians but you can build a lot of models based on the two). But basically Two models I am familiar with are RBM's(restricted boltzman machines) and Autoencoders.
Autoencoders(optionally sparse activations can be encoded in optimization function) are just feedforward neural networks which tune its weights in such a way that the output is a reconstructed input. Multiple hidden layers can be used but the weight initialization uses a greedy layer wise training for better starting point. So to answer the question the target function will be input itself.
RBM's are stochastic networks usually interpreted as graphical model which has restrictions on connections. In this setting there is no output layer and the connection between input and latent layer is bidirectional like an undirected graphical model. What it tries to learn is a distribution on inputs(observed and unobserved variables). Here also your answer would be input is the target.
Mixture of RBM's(analogous to mixture of gaussians) can be used for soft clustering or KRBM(analogous to K-means) can be used for hard clustering. Which in effect feels like learning multiple non-linear subspaces.
http://deeplearning.net/tutorial/rbm.html
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
An alternative approach is to use something like generative backpropagation. In this scenario, you train a neural network updating the weights AND the input values. The given values are used as the output values since you can compute an error value directly. This approach has been used in dimensionality reduction, matrix completion (missing value imputation) among other applications. For more information, see non-linear principal component analysis (NLPCA) and unsupervised backpropagation (UBP) which uses the idea of generative backpropagation. UBP extends NLPCA by introducing a pre-training stage. An implementation of UBP and NLPCA and unsupervised backpropagation can be found in the waffles machine learning toolkit. The documentation for UBP and NLPCA can be found using the nlpca command.
To use back-propagation for unsupervised learning it is merely necessary to set t, the target output, at each stage of the algorithm to the class for which the average distance to each element of the class before updating is least. In short we always try to train the ANN to place its input into the class whose members are most similar in terms of our input. Because this process is sensitive to input scale it is necessary to first normalize the input data in each dimension by subtracting the average and dividing by the standard deviation for each component in order to calculate the distance in a scale-invariant manner.
The advantage to using a back-prop neural network rather than a simple distance from a center definition of the clusters is that neural networks can allow for more complex and irregular boundaries between clusters.

How can HMMs be used for handwriting recognition?

The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem.
I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this?
I think HMM can be used in both problems mentioned by #jens. I'm working on online handwriting too, and HMM is used in many articles. The simplest approach is like this:
Select a feature.
If selected feature is continuous convert it to discrete.
Choose HMM parameters: Topology and # of states.
Train character models using HMM. one model for each class.
Test using test set.
for each item:
the simplest feature is angle of vector which connects consecutive
points. You can use more complicated features like angles of vectors
obtained by Douglas & Peucker algorithm.
the simplest way for discretization is using Freeman codes, but
clustering algorithms like k-means and GMM can be used too.
HMM topologies: Ergodic, Left-Right, Bakis and Linear. # of states
can be obtained by trial & error. HMM parameters can be variable for
each model. # of observations is determined by discretization.
observation samples can be have variable length.
I recommend Kevin Murphy HMM toolbox.
Good luck.
This problem is actually a mix of two problems:
recognizing one character from your data
recognizing a word from a (noisy) sequence of characters
A HMM is used for finding the most likely sequence of a finite number of discrete states out of noisy measurements. This is exactly problem 2, since noisy measurements of discrete states a-z,0-9 follow eachother in a sequence.
For problem 1, a HMM is useless because you aren't interested in the underlying sequence. What you want is to augment your handwritten digit with information on how you wrote it.
Personally, I would start by implementing regular state-of-the-art handwriting recognition which already is very good (with convolutional neural networks or deep learning). After that, you can add information about how it was written, for example clockwise/counterclockwise.

Resources