I sort of understand what features are, say a ML algorithm that learns SPAM, certain keywords could be a feature?
But in the famous MNIST digits data set, I see a matrix of numbers, is the entire matrix one single feature? Or is a feature each number in the matrix?
In my opinion, you are lacking some critical literature review.
Here are some good papers about RNN and CNN that can be used for image recognition appications :
https://pdfs.semanticscholar.org/86ef/e7769f2b8a0e15ca213ab09881e6705caeb0.pdf
https://arxiv.org/pdf/1506.00019.pdf
What is a feature? A feature represents one of the elements of the input vector which will be used to train the model and produce output.
The feature set is to be determined depending on the application.
Each element of the input vector is a different (dependent or independent) feature.
Look at this tutorial for example using the MNIST digit data set:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py
It says:
'''
To classify images using a recurrent neural network, we consider every image
row as a sequence of pixels. Because MNIST image shape is 28*28px, we will then
handle 28 sequences of 28 steps for every sample.
'''
The RNN is built on sequences, hence if the image is 28 by 28 you can break it in 28 sequences of 28 features.
# Network Parameters
num_input = 28 # MNIST data input (img shape: 28*28)
timesteps = 28 # timesteps
This is what you see in the network parameters. The 28 features (num_input = 28) representing one sequence of the image.
To repeat again, each element of the input vector is considered a feature. Furthermore, is the analyst's responsibility to properly define these features.
Technically, a feature is a numerical value which discriminatively represents (or attempts to discriminatively represent) input or some part(s) of input. In case of MNIST, where image size is 28 x 28, the entire image matrix is flattened (generally row-wise) into a 1D feature vector, each element of this feature vector is a feature (in this case, simply image intensity). The type or kind of feature which one wants to use is completely problem specific. For e.g., instead of flattening the entire MNIST digit image, you could have used number of white pixels as your feature; however, it boils down to how discriminative such a feature could be for the given problem.
In case of spam classification, generally the features are frequency of words (there are several other things involved, such as stop word elimination, stemming, etc.).
One can off-course select or design multiple features for a given problem, such as stroke length, curvature, number of edges, etc. which you mentioned in the comment above. However, the main idea is that features should be discriminative enough for all the classes and they should not be derived from each other (this point leads us to another problem called feature or dimensionality reduction). I suggest you to read this Wikipedia page here and then go on to read an academic presentation on feature extraction and dimensionality reduction, such as this (this one is specific to images). This would help you to understand the overall idea.
An additional note, the features are combined into a compact representation called a feature vector. In this particular case, as mentioned before, you have a 1-D feature vector, which contains image intensities as a features.
Related
I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!
No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.
While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!
The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.
I have created a DCGAN and already trained it for CIFAR-10 dataset. Now, i would like to train it for custom dataset.
I have already gathered around 1200 images, it is practicly impossible to gather more. What should i do?
We are going to post a paper in a coming week(s) about stochastic deconvolutions for generator, that can improve stability and variety for such a problem. If you are interested, I can send a current version of a paper right now. But generally speaking, the idea is simple:
Build a classic GAN
For deep layers of generator (let's say for a half of them) use stochastic deconvolutions (sdeconv)
sdeconv is just a normal deconv layer, but filters are being selected on a fly randomly from a bank of filters. So your filter bank shape can be, for instance, (16, 128, 3, 3) where 16 - number of banks, 128 - number of filters in each, 3x3 - size. Your selection of a filter set at each training step is [random uniform 0-16, :, :, :]. Unselected filters remain untrained. In tensorflow you want to select different filter sets for a different images in batch as well as tf keeps training variables even if it is not asked for (we believe it is a bug, tf uses last known gradients for all variables even if they are not being used in a current dynamic sub-graph, so you have to utilize as much variables as you can).
That's it. Having 3 layers with sdeconv of 16 sets in each bank, practically you'll have 16x16x16 = 4096 combinations of different internal routes to produce an output.
How is it helping on a small dataset? - Usually small datasets have relative large "topics" variance, but generally dataset is of one nature (photos of cats: all are realistc photos, but with different types of cats). In such datasets GAN collapses very quickly, however with sdeconv:
Upper normal deconv layers learns how to reconstruct a style "realistic photo"
Lower sdevond learns sub-distributions: "dark cat", "white cat", "red cat" and so on.
Model can be seen as ensemble of weak-generators, each sub-generator is weak and can collapse, but will be "supported" by another sub-generator that temorarily outperforms discriminator.
MNIST is a great example of such a dataset: high "topics" variance, but the same style of digits.
GAN+weight norm+prelu (collapsed after 1000 steps, died after 2000, can only describe one "topic"):
GAN+weight norm+prelu+sdeconv, 4388 steps (local variety degradation of sub-topics is seen, however not collapsed globally, global visual variety preserved):
I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.
I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.