Lets say you have 10 features of 500 categories. A category can only occur once per input. The features can be in any particular order. eg [1,2,3,4,5,...], [5,3,4,2,1,...], etc.. And the order does not matter so [1,2,3,4,5,...] = [5,3,4,2,1,...]. So you shuffle your training data to train the network on the unordered data.
Now you want to feed this to your neural network. 3 architectures come to my mind:
MLP (Input: embedding_dim x n_features)
LSTM with embedding (Input: embedding_dim, Sequence Len: n_features)
LSTM with one hot encoding (Input: feature_dim, Sequence Len: n_features)
Which of these perform better on unordered data form your evidence based research?
Do you have any other architectures on your mind that perform well on unordered data. (maybe where shuffling the training data is not even necessary)
What are you actually trying to model? This information may give some clues on how to approach the problem.
If I am understanding correctly, you are trying to learn from unordered multisets of size 10. Each element may assume one of 500 categories.
It may help to do some preprocessing of your data. Two approaches that come to my mind are:
Each sample could be encoded as vector of size 500, where each component represents the multiplicity of the respective element, e.g. [1,1,1,1,3,3,3,4,4,5] would be represented as [4,0,3,2,1,0,...].
Another simple approach could be to reorder your samples. Otherwise, the number of different inputs would be extremely high, i.e. 500^10.
Related
I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.
I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)
The inputs I am using are 2xN, where the first 1xN row are continuous numbers, and the second 1xN row are discrete numbers (that encodes a specific class out of 7 possible classes). I expect there to be a relation between vertically adjacent pairs.
I am looking to use a neural net for a multi-class classifier on this input, but am unsure of how to reshape my data for forward propagation in a way that makes sense.
What is a feasible way to reshape my data into 1x2N for forward propogation that makes sense?
edit:
Example input:
input_features = [[99.3, 22.1, 41.7], [1, 3, 4]]
Unless you know something more than "there might be some kind of relation", you should just flatten the array and pass it as a vector - NN can (in theory) find such realtions on its own (given enough data).
What are the other options? If you suspect that there is a single relation, such that it is true for every single column, then you might want to construct specific neural net. One option is to have a convolution of size 2x1 (single column) in the input layer. On the other hand - if you create large enough set of kernels, this will be able to model more complex relations too. In such case - leave it as a matrix (think about it as an image). There is nothing wrong with discrete values, as long as they are in the reasonable scale.
In general - you will actually just work with specific wiring of the net, not reshaping of an array (however, implementations of conv nets actually use shape to do the work for you, as described).
I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?
Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.
The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.
I work on classifying some reviews (paragraphs) consists of multiple sentences. I classified them with bag-of-word features in Weka via libSVM. However, I had another idea which I don't know how to implement :
I thought creating syntactical and shallow-semantics based features per sentence in the reviews is worth to try. However, I couldn't find any way to encode those features sequentially, since a paragraph's sentence size varies. The reason that I wanted to keep those features in an order is that the order of sentence features may give a better clue for classification. For example, if I have two instances P1 (with 3 sentences) and P2 (2 sentences), I would have a space like that (assume each sentence has one binary feature as a or b):
P1 -> a b b /classX
P2 -> b a /classY
So, my question is that whether I can implement that classification of different feature sizes in feature space or not? If yes, is there any kind of classifier that I can use in Weka, scikit-learn or Mallet? I would appreciate any responses.
Thanks
Regardless of the implementation, an SVM with the standard kernels (linear, poly, RBF) requires fixed-length feature vectors. You can encode any information in those feature vectors by encoding as booleans; e.g. collect all syntactical/semantic features that occur in your corpus, then introduce booleans that represent that "feature such and such occurred in this document". If it's important to capture the fact that these features occur in multiple sentences, count them and use put the frequency in the feature vector (but be sure to normalize your frequencies by document length, as SVMs are not scale-invariant).
In case you are classifying textual data, I would suggest looking at "Rational Kernels" which are made on weighted finite transducers for classifying natural language texts. Rational Kernels can be applied on varied length vectors and are already implemented as an open source project (OpenFST).
It is the library's problem, since SVM itself does not require fixed-length feature vectors, it only need a kernel function, if you can provide a kernel function with varied length vector, it should be OK for SVM