Missing Values in WEKA output - machine-learning

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.

The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

Related

BERT Certainty (iOS)

I am currently integrating the BERT model listed on https://developer.apple.com/machine-learning/models/#text into an iOS application and have had difficulty removing answers that have low certainty.
I have used the sample code found at the link above but because I wanted to answer questions based on larger volumes of text, I loop over an array of paragraphs and predict an answer for each one. However, the model does not return nil or "No Answer" if an answer is not found and instead returns a (seemingly) random substring. I suppose what I am trying to ask is: is it possible to access the certainty of BERT's response to filter out unlikely results? Or is there another way to get BERT to only return results above a set certainty threshold?
After hours of searching, I've now found a solution. Ironically it only took three lines of code, but here it is anyway:
if bestSum < 7.5 {
return nil
}
I implemented this in the findBestLogitPair() method in the BERTOutput.swift file as provided in Apple's sample code for text analysis using BERT. I have now discovered that the word logit does kind of mean probability in statistics - but being a programmer, I had no idea!

DL4J - When using a ComputationGraph, is it possible to get the Class labels from it?

I saw how to do this from a DataSet object, and I saw a setLabel method, and I saw a getLabelMaskArrays, but none of these are what I'm looking for.
Am I just blind or is there not a way?
Thanks
Masking is for variable length time series in RNNs. Most of the time you don't need it. Our built in sequence dataset iterators also tend to handle these cases. For more details see our rnn page: https://deeplearning4j.org/usingrnns

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

Rapidminer - unable to apply learning algorithm as process document is making regular to text

have the following process :
Process documents from files (where I load the text files with respective 6 classes ) --> this connects to set Role (which changes text attribute to REGULAR attribute to allow machine learning) -> Process documents from data ( I dont need the word vectors so I uncheck that, I keep text, within this process I tokenize, stopwords, stemming etc.) and then I feed this into validation operator. (bayes/svm)
What is happening here is in the example set the text column is going back to type "TEXT" from regular after running Process documents from Data. And hence I get the error Input ExampleSet has no attributes as there are zero regular attributes. And this is causing the process to fail. I have no idea why. I try to set the role again after this but then the error says "No examples in example set"
PLEASE HELP. I am stuck since two days!!!
EDIT : I think I know the issue - I was applying a 10-fold X-Validation on a dataset with few examples
I think I know the issue :
I was applying a 10-fold X-Validation on a dataset with very few examples

How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Resources