Missing value replacement based on class - machine-learning

I've been reading an article on Random Forests, and in missing value replacement section (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1) they say:
If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j.
Wouldn't that undermine the entire process? If most values in some column are missing, then after this procedure the new values could be used to easily identify the class, and the resulting classifier would be useless. Am I missing something here?

The resulting classifier is not necessarily useless, it depends on the characteristics of the 'missingness' (the event that a feature value is missing). If its distribution is identical between train and test set (which is a prevailing implicit assumption in ML), it is doing the right thing. However it is indeed problematic if there is a discrepancy, e.g., if missing values are an artifact of the way the training data was generated and mostly associated with one class, while at test time feature values are always fully known. In this case, imputation might lead to incorrect conclusions, especially if the number of missing values is large.

Related

Missing Values in WEKA output

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.
The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

Treat missing value as it is in Decision Tree

I have a dataset in which some variable (categorical variable and numerical variable) has missing values. Example, i have a variable "area" with numerical value which divided into two categories, "area (today)" and "area (-1 day)". If a data row categorized as "new comer" then it will have no value on "area (-1 day)". So, normal missing value handling like removal or mean not working here. Do i have to label no value on "area (-1 day)" as a category where the variable is originally numeric? Or, is there any other suggestions?
Treating the newcomer as a separate class makes sense, because that's how you are treating it in your dataset - you have a separate area column for it.
Otherwise you can check various other Imputation techniques to suit your use case. Regression imputation might suit your case.
HTH

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?
Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.

Weka java library: how to get string representation of classified instance?

Currently I'm working on a project of classifying search queries into the following eight types: {athlete, actor, artist, politician, geo, facility, QA, definition}. After a bit of work I managed to score 78% correctly classified instances for my set of 300 sample queries using a Multilayer Perceptron classifier when I evaluate the classifier with a stratified 10-fold cross validation, which is reasonably good I think.
Using the weka java library I implemented the whole thing into java code, so I can write a program that dynamically feeds a query to the classifier and retrieves it's query type. I managed to implement the whole classifier training part successfully. The next step would be to use either the classifyInstance() or distributionForInstance() to determine the class to which the query is classified.
classifyInstance() however does only return a double value for which I do not know to get the actual query-type out of it. The weka wikispaces tell me I can use
unlabeled.classAttribute().value((int) clsLabel);
After calling classifyInstance() to get the String representation of the class, this however seems to always return the empty string in my case.
Using distributionForInstance() I'm able to successfully retrieve an array with eight double values between 0 and 1 (which is good, as I classify into eight query types). However, what is the order of this array? Is the first element in the result array the first class that occurs in my training file? Or is there some other predefined element order in this result array (e.g. alphabetically)? The weka documentation does not give any information on this.
I hope someone will be able to help me out!
Internally, Weka handles all values as doubles. When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or c.
With the distributionForInstance() method, the indexes of the two arrays match, so attributeValues[0] is the string name for the first element of the array returned.
UPDATE (because of downvote)
The above method won't work if you're letting weka create the Instances object itself (e.g. if you're reading from an arff file). That doesn't seem to be the case given your question, but if it is, then please post code so we can see what's going on.

Resources