Qualitative Classification in Neural Network on Weka - machine-learning

I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?

Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.

The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.

Related

What's the major difference between glove and word2vec?

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Machine learning multi-classification: Why use 'one-hot' encoding instead of a number

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.

How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation

So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.

Resources