Differences between Training Data and Vocabulary - Bag Of Words - opencv

When creating a Bag Of Words, you need to create a Vocabulary to give to the BOWImgDescriptorExtractor to which you use on the images you wish to input. This creates the Testing Data.
So where does the Training Data come from, and where do you use it?
Whats the difference between Vocabulary and Training Data?
Isn't the Vocabulary the same thing as the Training Data?

Training data is a set of images you collected for your application as the input of BOWTrainer, and vocabulary is the output of the BOWTrainer. Once you have the vocabulary, you can extract features of images using BOWImgDescriptorExtractor with the words defined in the vocabulary.
An image can be described by tons of features (words), however only some of them are important. The first job to do is to find those important words, that is, to train a vocabulary. After the vocabulary is obtained, images can be described more precisely.
So where does the Training Data come from, and where do you use it?
You should provide the Training data, and use it to train the vocabulary with BOWTrainer. The Training data is a set of images (descriptors), depends on your application domain.
What's the difference between Vocabulary and Training Data?
Vocabulary is cooked, while training data is raw, unorganized.
Isn't the Vocabulary the same thing as the Training Data?
No.

There is an add function that is used to specify training data. docs on opencv bow module

Related

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Neural Network training data normalisation vs. runtime input data

I'm starting to learn about neural networks and I came across data normalisation. I understand the need for it but I don't quite know what to do with my data once my model is trained and in the field.
Let say I take my input data, subtract its mean and divide by the standard deviation. Then I take that as inputs and I train my neural network.
Once in the field, what do I do with my input sample on which I want a prediction?
Do I need to keep my training data mean and standard deviation and use that to normalise?
Correct. The mean and standard deviation that you use to normalize the training data will be the same that you use to normalize the testing data (i.e, don't compute a mean and standard deviation for the test data).
Hopefully this link will give you more helpful info: http://cs231n.github.io/neural-networks-2/
An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

How to train a caffemodel on our own dataset?

I tried using pre-trained bvlc_reference_caffenet.caffemodel for object recognition from images. I got good results for images containing only a single object. For images with multiple objects, I removed the argmax() term from prediction which gives the class label with the maximum probability.
Still, the accuracy is very less for the labels which I am getting. So, I am thinking of training the same caffemodel on my own dataset (containing images with multiple objects). How should I proceed? Is there any way to retrain a pre-trained caffemodel with the different dataset?
What you are after is called "finetuning": taking a deep net trained for task A, reusing its weights and re-train it to accomplish task B.
You can start with this tutorial, but you will find much more information simply by googling "finetune caffe model".
You may also be interested in this post regarding training caffe with mutiple categories per input image.

Algorithm for Multi-Class Classification of News Article

I want to classify the news article into the category it belongs to. I have 4 categories of news eg." Technology,Sports,Politics and Health." And i have collected around 50 documents for each category as a Training Set
**Is the Training data enough for classification ??? And Which Algorithm should i use for classification?? SVM, Random Forest,Knn, ??
I am using Scikit-learn http://scikit-learn.org/ [python] library for my task
Thanks
There are many ways to attack this problem form CRFs to Random Forests.
With your limited training data, I would suggest going with a model with high bias such as the linear SVM. Start with training one vs all models for each class and predicting the class with the highest probably. This will give you a baseline for how hard your problem is with the given training data.
I prefer you to use Naive-Bayes classification. There is a tool called Ling-pipe where this is already implemented. What you want to do is just refer
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
There you have a small sample program Classifynews.java. Run that program by training the data and apply testing .A training data sample is given as "20 newsgroups"
http://qwone.com/~jason/20Newsgroups/
Training can be applied by training the data and if needed you can build an intermediate model and then apply the test data into that model. Naive-Bayes is good for the cases where training data is small.
But its accuracy increases as the size of training data increases. So try to include more news groups. Good luck. Try this and let me know

Representing documents in vector space model

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?
You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Resources