transformer classification w/ many classes (Pytorch) - machine-learning

I have a transformer (for rec.) and (sequence)classification task with a lot of classes, usually for classification tasks probabilities or logits for all classes are calculated and the max value is selected. However I think there will be too man parameters in this case, so I was wondering if following approach is valid:
I want to predict 10 classes based on an input sequence of x No of classes
-convert all classes to INTs
-build transformer with mean pooling and linear output layer = 10 (then round all values) predicting classes directly
I am not sure because the ordering of class INTs has no meaning, so will the model be able to learn correct weights?
Maybe a basic question but I couldnt find anything on that.

Related

Hyperparameter optimisation in Python with a separate validation set

I am trying to optimise the hyper parameters of a random forest regressor in Python.
I have 3 separate datasets: train/validate/test. Therefore, rather than using a cross validation method I want to use the specific validation set to tune the hyperparameters, i.e. the "First Approach" described in this stackoverflow post.
Now, sklearn has some nice inbuilt methods for hyperparameter optimisation using cross validation (e.g. this tutorial), but what about if I want to tune my hyperparameters with a specific validation set? Is it still possible to use a method like RandomizedSearchCV?
It is indeed possible with cv option. As the documentation suggests, one of the possible inputs is an iterable of train/test index tuples:
An iterable yielding (train, test) splits as arrays of indices.
So, a list of size one with train and validation indices packed as a tuple would be ok.
I think we should just have some wording clarified:
'Validation set'
A validation-set is used to evaluate your model on a unseen set of data i.e data not used for training. This is to simulate how your model would behave on new data. We use the validation-set to tune our hyper-parameters such as number of trees, max-depths etc. and chose the hyper-parameters which works best on the validation set.
'Cross-validate'
When you CV (cross-validate) with, say, 5 folds you divide your data into 5 sets where set [1,2,3,4] are used for traning, and set 5 is used for validation. Then you use [2,3,4,5] for training and use set 1 for validation - you repeat this untill all sets (i.e 5 times when using 5 fold) have been used as a validation-set and then you would average your 5 validation-score e.g accuracy to get one score which you want to (often) maximize.
Answer
So, to answer your question; yes, you can use GridSearchCV on your validation-set but that wouldn't often be the case since. You would often do one of the following:
a) Use a (i.e one) validation-set to tune your hyper-parameters against, as explained in "Validation set"
b) Use all your data i.e train+validation as one data-set and then run a, say, 5-fold grid-CV search as explained in "Cross-validate"

Caffe's way of representing negative examples on benchmark dataset for binary classification

I would like to know how to define or represent a negative training set if I would want to train a binary classifier from a pre-trained model say, AlexNet on ILSVRC12 (or ImageNet) dataset. What I am currently thinking of is to take one the classes which is not related as the negative training set while the one which is related as positive one. Is there any better way which is more elegant?
The CNNs trained on the ILSVRC data set are already discriminating among 1000 classes of images. Yes, you can use one of those topologies to train a binary classifier, but I suggest that you start with an untrained model and run it through your two chosen classes. If you start with a trained model, you have to unlearn a lot, and your result is still trying to discriminate among 1000 classes: that last FC layer is going to give you trouble.
There are ways to work around the 1000-class problem. If your application already overlaps one or more of the trained classes, then simply add a layer that maps those classes to label "1" and all the others to label "0".
If you're insistent on retaining the trained kernels, then try replacing the final FC layer (1000) with a 2-class FC layer. Then choose your two classes (applicable images vs everything else) and run your training.

Baseline classification accuracy with Mahout

In his Data Mining with Weka class, Prof. Witten stresses the importance of checking your classifier against simpler ones, like the ZeroR classifier which picks the most common class (if your fancy machine learning algorithm is barely beating ZeroR's accuracy, it's probably not working very well).
Is there a way to check baseline accuracy of a classifier built with Apache Mahout, either using ZeroR or some thing else?
Take your data, count how often the classes occur.
And that's literally what ZeroR does. Since it is so simple I don't think Mahout includes it in their Framework.
Writing a MapReduce job to do this is rather simple:
Mapper:
emit the class as key, 1 as value (let the mapper precompute the sum over his whole input for network efficiency or use a combiner)
Reducer
sum over all keys, take the max and divide by the sum over all classes
Then you would know what baseline accuracy you would get from predicting the majority class.
The Spark implementation is similar:
Group by the class and then count per class and divide by the sum over all classes. Pick the max, that's the baseline.

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Classification in weka fails, caused by case sensitiveness of nominal values?

I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.

Resources