I have a dataset in which some variable (categorical variable and numerical variable) has missing values. Example, i have a variable "area" with numerical value which divided into two categories, "area (today)" and "area (-1 day)". If a data row categorized as "new comer" then it will have no value on "area (-1 day)". So, normal missing value handling like removal or mean not working here. Do i have to label no value on "area (-1 day)" as a category where the variable is originally numeric? Or, is there any other suggestions?
Treating the newcomer as a separate class makes sense, because that's how you are treating it in your dataset - you have a separate area column for it.
Otherwise you can check various other Imputation techniques to suit your use case. Regression imputation might suit your case.
HTH
Related
I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.
I've been reading an article on Random Forests, and in missing value replacement section (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1) they say:
If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j.
Wouldn't that undermine the entire process? If most values in some column are missing, then after this procedure the new values could be used to easily identify the class, and the resulting classifier would be useless. Am I missing something here?
The resulting classifier is not necessarily useless, it depends on the characteristics of the 'missingness' (the event that a feature value is missing). If its distribution is identical between train and test set (which is a prevailing implicit assumption in ML), it is doing the right thing. However it is indeed problematic if there is a discrepancy, e.g., if missing values are an artifact of the way the training data was generated and mostly associated with one class, while at test time feature values are always fully known. In this case, imputation might lead to incorrect conclusions, especially if the number of missing values is large.
For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?
Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.
If one calculates the recommendations for a boolean DataModel, the RecommendedItems have some numbers in their value field.
What does it represent? (Understandably, it can't be the calculated preference).
The class GenericRecommendedItem-API only says: "A value expressing the strength of the preference for the recommended item. The range of the values depends on the implementation. Implementations must use larger values to express stronger preference."
It's intentionally opaque so you don't rely on any particular value. It happens to be a sum of similarities if I recall correctly -- all similarities between the user's items and that item.
I was wondering how to implement the following problem: Say I have a 'set' of Strings and I wish to know which one is the most related to a given value.
Example:
String value= "ABBCCE";
Set contains: {"JJKKLL", "ABBCC", "AAPPFFEE", "AABBCCDD", "ABBCEE", "AABBCCEE"}
By 'most related' I assume there could be many options (valid one can be the last 2), but at least we can ignore some items (JJKKLLL).
What should be the approach to solve this kind of a problem (that at minmum, a result like AABBCCEE would be acceptable)
Any java code would be appreciated :-)
You could try using the Levenshtein Distance between your "target" string (e.g. "ABBCCE") and each element in your set. Pick a maximum threshold above which you will consider items to be unrelated (in your example here, a threshold of one or two perhaps), and reject everything in the set that has a Levenshtein Distance greater than that from the target string.
An example implementation of the Levenshtein Distance computation in Java can be found here.
You may be interested in the Levenstein distance metric, which measures similarities between two strings, including insertions and removals.