WEKA LibSVM don't work with my dataset - machine-learning

The LibSVM on WEKA isn't loading with my dataset.
I am using WEKA and LibSVM. Every time I open my dataset and then try to chose an algorithm, the LibSVM algorithm isn't enabled (the option is gray). But if for instance I load the weather.arff example dataset that comes with the WEKA then the LibSVM algorithm works...
I don't know if there is anything wrong with my dataset. Are there any limitations that I should be aware of when dealing with LibSVM? For instance, number of attributes, etc.
The strange thing is that when I run my dataset with the SMO algorith that comes with WEKA it works without any problem.
In my dataset I have 76 attributes and my class attribute has 100 possible values.
Am I doing anything wrong? Thanks, very appreciated.

Your dataset doesn't match input format required by LibSVM. The capabilities are as follows:
CAPABILITIES
Class -- Nominal class, Missing class values, Binary class
Attributes -- Empty nominal attributes, Nominal attributes, Unary attributes, Binary attributes, Date attributes, Numeric attributes
Additional
min # of instances: 1
So the class in your .arff file should be either nominal or binary (allowed to miss some values) and your attributes should be nominal, unary or binary (allowed to be empty).

Related

transformer classification w/ many classes (Pytorch)

I have a transformer (for rec.) and (sequence)classification task with a lot of classes, usually for classification tasks probabilities or logits for all classes are calculated and the max value is selected. However I think there will be too man parameters in this case, so I was wondering if following approach is valid:
I want to predict 10 classes based on an input sequence of x No of classes
-convert all classes to INTs
-build transformer with mean pooling and linear output layer = 10 (then round all values) predicting classes directly
I am not sure because the ordering of class INTs has no meaning, so will the model be able to learn correct weights?
Maybe a basic question but I couldnt find anything on that.

In Weka - not normalize all the numeric attribute

I am working on KDD99 dataset using WEKA. There are three types of attributes in the dataset, which are Nominal, Binary and Numeric. But in WEKA, it considers Binary data also as Numeric.
I tried to use Unsupervised-attribute-Normalize tool to normalize the data. However, it also normalize the binary data. I have two question here.
Do I need to normalize the Binary attributes? Because binary data is not continuous.
If I do not need to normalize the binary attributes, in WEKA, how I can select attributes in Normalize tool? Because the Normalize tool always applies to all the numeric attribute(including the binary attribute).
Thanks!
Weka has interpreted the binary attributes from your input file as numeric because their values are all numbers (i.e. 0 and 1), but if you're going to use classifiers that can handle nominal attributes you probably want to convert the binary attributes into nominal ones instead.
You can do this with the weka.filters.unsupervised.attribute.Discretize filter. Just specify the numeric indices of the attributes that are binary and specify the number of bins to be 2.
This will give you attributes with nominal value labels of (-inf-0.5] and (0.5-inf), but if you'd rather see them as 0 and 1 you can rename the values using weka.filters.unsupervised.attribute.RenameNominalValues.

What splitting criterion does Random Tree in Weka 3.7.11 use for numerical attributes?

I'm using RandomForest from Weka 3.7.11 which in turn is bagging Weka's RandomTree. My input attributes are numerical and the output attribute(label) is also numerical.
When training the RandomTree, K attributes are chosen at random for each node of the tree. Several splits based on those attributes are attempted and the "best" one is chosen. How does Weka determine what split is best in this (numerical) case?
For nominal attributes I believe Weka is using the information gain criterion which is based on conditional entropy.
IG(T|a) = H(T) - H(T|a)
Is something similar used for numerical attributes? Maybe differential entropy?
When tree is split on numerical attribute, it is split on the condition like a>5. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same.
P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But I do not know specifically about Weka

One class SVM to detect outliers

My problem is
I want to build a one class SVM classifier to identify the nouns/aspects from test file.
The training file has list of nouns. The test has list of words.
This is what I've done:
I'm using Weka GUI and I've trained a one class SVM(libSVM) to get a model.
Now the model classifies those words in test file that the classifier identified as nouns in the generated model. Others are classified as outliers. ( So it is just working like a look up. If it is identified as noun in trained model, then 'yes' else 'no')
So how to build a proper classifier?. ( I meant the format of input and what it information it should contain?)
Note:
I don't give negative examples in training file since it is one class.
My input format is arff
Format of training file is a set of word,yes
Format of test file is a set of word,?
EDIT
My test file will have noun phrases. So my classifier's job is to get the nouns words from candidates in test file.
Your data is not formatted appropriately for this problem.
If you put
word,class
pairs into a SVM, what you are really putting into the SVM are sparse vectors that consist of a single one, corresponding to your word, i.e.
0,0,0,0,0,...,0,0,1,0,0,0,...,0,0,0,0,yes
Anything a classifier can do on such data is overfit and memorize. On unknown new words, the result will be useless.
If you want your classifier to be able to abstract and generalize, then you need to carefully extract features from your words.
Possible features would be n-grams. So the word "example" could be represented as
exa:1, xam:1, amp:1, mpl:1, ple:1
Now your classifier/SVM could learn that having the n-gram "ple" is typical for nouns.
Results will likely be better if you add "beginning-of-word" and "end-of-word" symbol,
^ex:1, exa:1, xam:1, amp:1, mpl:1, ple:1, le$:1
and maybe also use more than one n-gram length, e.g.
^ex:1, ^exa:1, exa:1, exam: 1, xam:1, xamp:1, amp:1, ampl:1, mpl:1, mple1:1, ple:1, ple$.1, le$:1
but of course, the more you add the larger your data set and search space grows, which again may lead to overfitting.

Classification in weka fails, caused by case sensitiveness of nominal values?

I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.

Resources