How much data is required to train SyntaxNet? - machine-learning

I know the more data, the better it's but what would be a reasonable amount of data required to train SyntaxNet?

Based on some trial and error, I have arrived at the following minimums:
Train corpus - 18,000 tokens (anything less than that and step 2 - Preprocessing with the Tagger- fails)
Test corpus - 2,000 tokens (anything less than that and step 2 - Preprocessing with the Tagger - fails)
Dev corpus - 2,000 tokens
But please note that with this, I've only managed to get the steps in the NLP pipeline to run, I actually haven't managed to get anything usable out of it.

Related

Why using batch to predict when applying Batch Normalization is cheating?

In a post on Quora, someone says:
At test time, the layer is supposed to see only one test data point at
a time, hence computing the mean / variance along a whole batch is
infeasible (and is cheating).
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images?
I mean, our network as been train to predict using batches, so what is the issue with giving it batches?
If someone could explain what informations our network gets from batches that it is not suppose to have that would be great :)
Thank you
But as long as testing data have not been seen by the network during training isn't it ok to use several testing images ?
First of all, it's ok to use batches for testing. Second, in test mode, batchnorm doesn't compute the mean or variance for the test batch. It takes the mean and variance it already has (let's call them mu and sigma**2), which are computed based solely on the training data. The result of batch norm in test mode is that all tensors x are normalized to (x - mu) / sigma.
At test time, the layer is supposed to see only one test data point at a time, hence computing the mean / variance along a whole batch is infeasible (and is cheating)
I just skimmed through Quora discussion, may be this quote has a different context. But taken on its own, it's just wrong. No matter what the batch is, all tensors will go through the same transformation, because mu and sigma are not changed during testing, just like all other variables. So there's no cheating there.
The claim is very simple, you train your model so it is useful for some task. And in classification the task is usually - you get a data point and you output the class, there is no batch. Of course in some practical applications you can have batches (say many images from the same user etc.). So that's it - it is application dependent, so if you want to claim something "in general" about the learning model you cannot make an assumption that during test time one has access to batches, that's all.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

spam classification - machine learning

I have to do spam detection application using a few classifiers(e.g. Naive Bayes, SVM and another one yet) and compare them efficiency but unfortunately I don't know what should I do exactly.
Is this correct:
Firstly I should have corpus spam such as trec2005, spamassasin or enron-spam.
Then, I do text pre-processing like stemming, stop words removal, tokenize, etc.
After that I can measure weight my features/terms in spam emails using tf-idf .
Next I remove these features with very low and very high frequencies.
And I can classify my emails then. Right?
After that I can measure my correct classifications by true-positive, false-positive, etc..
If 10fold cross validation is required for something ?
How should I use it?
Could you tell me if these steps for email classifications are OK?
If not, please explain what are the correct steps for spam classification.
Here is roughly the steps you need to build a spam classifier:
1- Input: a labeled training set that contains enough samples of spam and legitimate e-mails
2- Feature Extraction: convert your e-mail text into useful features for training e.g. stemming, remove stop words, words frequency. Then evaluate these features (i.e. apply attribute selection method) to select the most significant ones.
3- If you have large enough dataset, split it into training, validation and testing set. If not you can use your entire dataset for training and do cross validation to evaluate the classifier performance
4- Train your classifier and either use the testing dataste to evaluate its performance or do cross validation
5- Use the trained model to classify new e-mails. Done.
The use of cross validation is to evaluate your model performance on new/unseen data. So if you have an independent testing dataset you might not need cross validation at all, because you can evaluate the model performance on the testing dataset. However when your dataset is small you can divide it to subsets (e.g. 10 folds) and then repeat the training 10 times, every time you will use only 90% of your data and test on the remaining 10% and so on.
You will end up with 10 estimates of the classifier error average them to get the mean square or absolute error

Evaluating recommenders - unable to recommend in x cases

I'm exploring some of the code examples in Mahout in Action in more detail. I have built a small test that computes the RMS of various algorithms applied to my data.
Of course, multiple parameters impact the RMS, but I don't understand the "unable to recommend in ... cases" message that is generated while running an evaluation.
Looking at StatsCallable.java, this is generated when an evaluator encounters a NaN response; Perhaps not enough data in the training set or the user's prefs to provide a recommendation.
It seems like the RMS score isn't impacted by a very large set of "unable to recommend" cases. Is that assumption correct? Should I be evaluating my algorithm not only on RMS but also the ratio of "unable to recommend" cases versus my overall training set?
I'd appreciate any feedback.
Yes this essentially means there was no data at all on which to base an estimate. That's generally a symptom of data sparseness. It should be rare, and happen only for users with data that's very small or disconnected from others'.
I personally think it's not such a big deal unless it's a really significant percentage (20%+?) I'd worry more if you couldn't generate any recs at all for many users.

Where does the verification data go when training an ANN?

The need for having part of the training set used as verification data is straightforward, but I am not really clear on how and at what stage of the training should it be incoperated?
Is it at the end of the training (after reaching a good minimum for the training data)? If so, what should be done if the verification data yeilds a big error?
Is it throughout the training (keep looking for a minimum while errors for both the training and verification data aren't satisfactory)?
No matter what I try it seems that the network is having a trouble to learn both training and verification when the verification set reaches a certain size (I recall reading somewhere that 70% training 30% verification is a common ratio, I get stuck at a much smaller one), while it has no problem to learn the same data when used entirely for training.
The important thing is that your verification set must have no feedback on the training. You can plot the error rate on the verification set, but the training algorithm can only use the error rate on the training set to correct itself.
The validation data set is mostly used for early stopping.
Train network for epoch i on test data. Let test eerror be e(t, i).
Evaluate network on validation set. Let that be e(v, i).
If e(v, i) > e(v, i-1) stop training. Else goto 1.
So it helps you to see, when the network overfits, which means that it models the specifics of the test data too much. The idea is that with an ANN, you want to achieve good generalization from training data to unseen data. The validation set helps you to determine, when the point is reached when it specializes too much on the training data.
means that Over-Training
i advise check a verification set' MSE during training
see Overtraining Caution System of FannTool
http://fanntool.googlecode.com/files/FannTool_Users_Guide.zip

Resources