weka - normalize nominal values - machine-learning
I have this data set:
Instance num 0 : 300,24,'Social worker','Computer sciences',Music,10,5,5,1,5,''
Instance num 1 : 1000,20,Student,'Computer engineering',Education,10,5,5,5,5,Sony
Instance num 2 : 450,28,'Computer support specialist',Business,Programming,10,4,1,0,4,Lenovo
Instance num 3 : 1000,20,Student,'Computer engineering','3d Design',1,1,2,1,3,Toshiba
Instance num 4 : 1000,20,Student,'Computer engineering',Programming,2,5,1,5,4,Dell
Instance num 5 : 800,16,Student,'Computer sciences',Education,8,4,3,4,4,Toshiba
and I want to classify using SMO and other multi-class classifiers so I convert all the nominal values to numeric using this code :
int [] indices={2,3,4,10}; // indices of nominal columns
for (int i = 0; i < indices.length; i++) {
int attInd = indices[i];
Attribute att = data.attribute(attInd);
for (int n = 0; n < att.numValues(); n++) {
data.renameAttributeValue(att, att.value(n), "" + n);
}
}
and the result is:
Instance num 0 : 300,24,0,0,0,10,5,5,1,5,0
Instance num 1 : 1000,20,1,1,1,10,5,5,5,5,1
Instance num 2 : 450,28,2,2,2,10,4,1,0,4,2
Instance num 3 : 1000,20,1,1,3,1,1,2,1,3,3
Instance num 4 : 1000,20,1,1,2,2,5,1,5,4,4
Instance num 5 : 800,16,1,0,1,8,4,3,4,4,3
after applying the "Normalize" filter the result will be like this:
Instance num 0 : 0,0.666667,0,0,0,1,1,1,0.2,1,0
Instance num 1 : 1,0.333333,1,1,1,1,1,1,1,1,1
Instance num 2 : 0.214286,1,2,2,2,1,0.75,0,0,0.5,2
Instance num 3 : 1,0.333333,1,1,3,0,0,0.25,0.2,0,3
Instance num 4 : 1,0.333333,1,1,2,0.111111,1,0,1,0.5,4
Instance num 5 : 0.714286,0,1,0,1,0.777778,0.75,0.5,0.8,0.5,3
the problem is the converted columns still in String "Normalize" filter will not normalize them...
Any ideas?
and my second question: what should I use as multi-class classifier beside SMO?
Don't convert nominals/categoricals into floats(/integers), and then normalize them. It's meaningless. Garbage In, Garbage Out. Treating them as continuous numbers or numeric vectors gives nonsense results like "the average of 'Engineering' + 'Nursing' = 'Architecture'"
The right way to treat nominals/categoricals is to convert each one into dummy variables (also known as 'dummy coding' or 'dichotomizing'). Say if Occupation column (or Major, or Elective, or whatever) has K levels, then you create either K or (K-1) binary variables which are everywhere 0 except for one corresponding column containing a 1.
Look up Weka documentation to find the right function call.
cf. e.g. SO: Dummy Coding of Nominal Attributes (for Logistic Regression)
I believe that best way to convert string into a numeric can be done using the filter weka.filters.unsupervised.attribute.StringToWordVector.
After doing so, you can apply the "Normalize" filter weka.classifiers.functions.LibSVM.
Related
Mulitlayer Perceptron prediction result not only as a double, also as String (using Weka (Java))
I would like to make a prediction using multilayer perceptron. For this purpose, I have created test data to be predicted. Now I go through all records in a for loop and want to append the prediction: for (int i1 = 0; i1 < datapredict1.numInstances(); i1++) { double clsLabel1 = mlp.classifyInstance(datapredict1.instance(i1)); datapredict1.instance(i1).setClassValue(clsLabel1); String s = datapredict1.instance(i1) + "," + clsLabel1; writer11.write(s.toString()); writer11.newLine(); System.out.println(datapredict1.instance(i1) + "," + clsLabel1); } The result output is as follows: 0.178571,0.2,0.181818,0.333333,0,09:15,0.849899,0.8498991728827364 0.414835,0,0.454545,0.666667,0,16:15,0.850662,0.85066198399766 How is it possible that here, not only the probability is displayed, but also the string value As for example: 0.178571,0.2,0.181818,0.333333,0,09:15,"Value2",0.8498991728827364 0.414835,0,0.454545,0.666667,0,16:15,"Value4",0.85066198399766
The classifyInstance method of a classifier returns the regression value for numeric class attributes or the index of the most likely class label for nominal ones. In the latter case, cast the returned double to an int and use the value(int) method of the class attribute of your dataset to obtain the label string.
Why VoxelGrid after filtering gives me only 1 point in the cloud?
I am receiving ROS message of type sensor_msgs::PointCloud2ConstPtr in my callback function then I transform it to pointer of type pcl::PointCloud<pcl::PointXYZ>::Ptr using function pcl::fromROSMsg. After that using this code from pcl tutorials for normal estimation: void OrganizedCloudToNormals( const pcl::PointCloud<pcl::PointXYZ>::Ptr &_inputCloud, pcl::PointCloud<pcl::PointNormal>::Ptr &cloud_normals ) { pcl::console::print_highlight ("Estimating scene normals...\n"); pcl::NormalEstimationOMP<pcl::PointXYZ,pcl::PointNormal> nest; nest.setRadiusSearch (0.001); nest.setInputCloud (_inputCloud); nest.compute (*cloud_normals); //write 0 wherever is NaN as value for(int i=0; i < cloud_normals->points.size(); i++) { cloud_normals->points.at(i).normal_x = isnan(cloud_normals->points.at(i).normal_x) ? 0 : cloud_normals->points.at(i).normal_x; cloud_normals->points.at(i).normal_y = isnan(cloud_normals->points.at(i).normal_y) ? 0 : cloud_normals->points.at(i).normal_y; cloud_normals->points.at(i).normal_z = isnan(cloud_normals->points.at(i).normal_z) ? 0 : cloud_normals->points.at(i).normal_z; cloud_normals->points.at(i).curvature = isnan(cloud_normals->points.at(i).curvature) ? 0 : cloud_normals->points.at(i).curvature; } } after that I have point cloud of the type pcl::PointNormal and trying to downsample it const float leaf = 0.001f; //0.005f; pcl::VoxelGrid<pcl::PointNormal> gridScene; gridScene.setLeafSize(leaf, leaf, leaf); gridScene.setInputCloud(_scene); gridScene.filter(*_scene); where _scene is of the type pcl::PointCloud<pcl::PointNormal>::Ptr _scene (new pcl::PointCloud<pcl::PointNormal>); then after filtering I end up with my point cloud _scene and it has only 1 point inside. I have tried to change leaf size but that doesn't change outcome. Does anyone knows what am I doing wrong? Thanks in advance
I have found where was the problem. Type pcl::PoinNormal has fields x,y,z,normal_x, normal_y and normal_z but in my function OrganizedCloudToNormals I filled only fields normal_x, normal_y and normal_z and fields x, y and z had value 0 for each point. When I filled fields x,y and z from input point cloud problem with filtering (downsampling) disappeared I have filtered cloud with more than 1 point inside. Probably lack of values in x,y and z fields caused problems later in filter method of the voxel grid object.
Total sum from a set (logic)
I have a logic problem for an iOS app but I don't want to solve it using brute-force. I have a set of integers, the values are not unique: [3,4,1,7,1,2,5,6,3,4........] How can I get a subset from it with these 3 conditions: I can only pick a defined amount of values. The sum of the picked elements are equal to a value. The selection must be random, so if there's more than one solution to the value, it will not always return the same. Thanks in advance!
This is the subset sum problem, it is a known NP-Complete problem, and thus there is no known efficient (polynomial) solution to it. However, if you are dealing with only relatively low integers - there is a pseudo polynomial time solution using Dynamic Programming. The idea is to build a matrix bottom-up that follows the next recursive formulas: D(x,i) = false x<0 D(0,i) = true D(x,0) = false x != 0 D(x,i) = D(x,i-1) OR D(x-arr[i],i-1) The idea is to mimic an exhaustive search - at each point you "guess" if the element is chosen or not. To get the actual subset, you need to trace back your matrix. You iterate from D(SUM,n), (assuming the value is true) - you do the following (after the matrix is already filled up): if D(x-arr[i-1],i-1) == true: add arr[i] to the set modify x <- x - arr[i-1] modify i <- i-1 else // that means D(x,i-1) must be true just modify i <- i-1 To get a random subset at each time, if both D(x-arr[i-1],i-1) == true AND D(x,i-1) == true choose randomly which course of action to take. Python Code (If you don't know python read it as pseudo-code, it is very easy to follow). arr = [1,2,4,5] n = len(arr) SUM = 6 #pre processing: D = [[True] * (n+1)] for x in range(1,SUM+1): D.append([False]*(n+1)) #DP solution to populate D: for x in range(1,SUM+1): for i in range(1,n+1): D[x][i] = D[x][i-1] if x >= arr[i-1]: D[x][i] = D[x][i] or D[x-arr[i-1]][i-1] print D #get a random solution: if D[SUM][n] == False: print 'no solution' else: sol = [] x = SUM i = n while x != 0: possibleVals = [] if D[x][i-1] == True: possibleVals.append(x) if x >= arr[i-1] and D[x-arr[i-1]][i-1] == True: possibleVals.append(x-arr[i-1]) #by here possibleVals contains 1/2 solutions, depending on how many choices we have. #chose randomly one of them from random import randint r = possibleVals[randint(0,len(possibleVals)-1)] #if decided to add element: if r != x: sol.append(x-r) #modify i and x accordingly x = r i = i-1 print sol P.S. The above give you random choice, but NOT with uniform distribution of the permutations. To achieve uniform distribution, you need to count the number of possible choices to build each number. The formulas will be: D(x,i) = 0 x<0 D(0,i) = 1 D(x,0) = 0 x != 0 D(x,i) = D(x,i-1) + D(x-arr[i],i-1) And when generating the permutation, you do the same logic, but you decide to add the element i in probability D(x-arr[i],i-1) / D(x,i)
How to get the decision function from svm_model
Say I have a feature vector [v1,v2,v3], then I have a decision function a*v1+b*v2+c*v3 =d how do I get the values (a,b,c,d) using the inforrmation in svm_model? I saw that these two fields in svm_model public double[][] sv_coef;// coefficients for SVs in decision functions (sv_coef[k-1][l]) public double[] rho;// constants in decision functions (rho[k*(k-1)/2]) I suspect it could be essential for getting the decision function.
There is also a SVs field in svm_model. Your decision function is wv+b=0, where v = [v1,v2,v3]. Then, w = SVs' * msv_coef; b = -.rho; For multi-class SVM, you may also need another field called Label if Label(1) == -1 w = -w; b = -b; end Check the FAQ part for more details.
WEKA classification likelihood of the classes
I would like to know if there is a way in WEKA to output a number of 'best-guesses' for a classification. My scenario is: I classify the data with cross-validation for instance, then on weka's output I get something like: these are the 3 best-guesses for the classification of this instance. What I want is like, even if an instance isn't correctly classified i get an output of the 3 or 5 best-guesses for that instance. Example: Classes: A,B,C,D,E Instances: 1...10 And output would be: instance 1 is 90% likely to be class A, 75% likely to be class B, 60% like to be class C.. Thanks.
Weka's API has a method called Classifier.distributionForInstance() tha can be used to get the classification prediction distribution. You can then sort the distribution by decreasing probability to get your top-N predictions. Below is a function that prints out: (1) the test instance's ground truth label; (2) the predicted label from classifyInstance(); and (3) the prediction distribution from distributionForInstance(). I have used this with J48, but it should work with other classifiers. The inputs parameters are the serialized model file (which you can create during the model training phase and applying the -d option) and the test file in ARFF format. public void test(String modelFileSerialized, String testFileARFF) throws Exception { // Deserialize the classifier. Classifier classifier = (Classifier) weka.core.SerializationHelper.read( modelFileSerialized); // Load the test instances. Instances testInstances = DataSource.read(testFileARFF); // Mark the last attribute in each instance as the true class. testInstances.setClassIndex(testInstances.numAttributes()-1); int numTestInstances = testInstances.numInstances(); System.out.printf("There are %d test instances\n", numTestInstances); // Loop over each test instance. for (int i = 0; i < numTestInstances; i++) { // Get the true class label from the instance's own classIndex. String trueClassLabel = testInstances.instance(i).toString(testInstances.classIndex()); // Make the prediction here. double predictionIndex = classifier.classifyInstance(testInstances.instance(i)); // Get the predicted class label from the predictionIndex. String predictedClassLabel = testInstances.classAttribute().value((int) predictionIndex); // Get the prediction probability distribution. double[] predictionDistribution = classifier.distributionForInstance(testInstances.instance(i)); // Print out the true label, predicted label, and the distribution. System.out.printf("%5d: true=%-10s, predicted=%-10s, distribution=", i, trueClassLabel, predictedClassLabel); // Loop over all the prediction labels in the distribution. for (int predictionDistributionIndex = 0; predictionDistributionIndex < predictionDistribution.length; predictionDistributionIndex++) { // Get this distribution index's class label. String predictionDistributionIndexAsClassLabel = testInstances.classAttribute().value( predictionDistributionIndex); // Get the probability. double predictionProbability = predictionDistribution[predictionDistributionIndex]; System.out.printf("[%10s : %6.3f]", predictionDistributionIndexAsClassLabel, predictionProbability ); } o.printf("\n"); } }
I don't know if you can do it natively, but you can just get the probabilities for each class, sorted them and take the first three. The function you want is distributionForInstance(Instance instance) which returns a double[] giving the probability for each class.
Not in general. The information you want is not available with all classifiers -- in most cases (for example for decision trees), the decision is clear (albeit potentially incorrect) without a confidence value. Your task requires classifiers that can handle uncertainty (such as the naive Bayes classifier). Technically the easiest thing to do is probably to train the model and then classify an individual instance, for which Weka should give you the desired output. In general you can of course also do it for sets of instances, but I don't think that Weka provides this out of the box. You would probably have to customise the code or use it through an API (for example in R).
when you calculate a probability for the instance, how exactly do you do this? I have posted my PART rules and data for the new instance here but as far as calculation manually I am not so sure how to do this! Thanks EDIT: now calculated: private float[] getProbDist(String split){ // takes in something such as (52/2) meaning 52 instances correctly classified and 2 incorrectly classified. if(prob_dis.length > 2) return null; if(prob_dis.length == 1){ String temp = prob_dis[0]; prob_dis = new String[2]; prob_dis[0] = "1"; prob_dis[1] = temp; } float p1 = new Float(prob_dis[0]); float p2 = new Float(prob_dis[1]); // assumes two tags float[] tag_prob = new float[2]; tag_prob[1] = 1 - tag_prob[1]; tag_prob[0] = (float)p2/p1; // returns double[] as being the probabilities return tag_prob; }