Weka java library: how to get string representation of classified instance? - machine-learning

Currently I'm working on a project of classifying search queries into the following eight types: {athlete, actor, artist, politician, geo, facility, QA, definition}. After a bit of work I managed to score 78% correctly classified instances for my set of 300 sample queries using a Multilayer Perceptron classifier when I evaluate the classifier with a stratified 10-fold cross validation, which is reasonably good I think.
Using the weka java library I implemented the whole thing into java code, so I can write a program that dynamically feeds a query to the classifier and retrieves it's query type. I managed to implement the whole classifier training part successfully. The next step would be to use either the classifyInstance() or distributionForInstance() to determine the class to which the query is classified.
classifyInstance() however does only return a double value for which I do not know to get the actual query-type out of it. The weka wikispaces tell me I can use
unlabeled.classAttribute().value((int) clsLabel);
After calling classifyInstance() to get the String representation of the class, this however seems to always return the empty string in my case.
Using distributionForInstance() I'm able to successfully retrieve an array with eight double values between 0 and 1 (which is good, as I classify into eight query types). However, what is the order of this array? Is the first element in the result array the first class that occurs in my training file? Or is there some other predefined element order in this result array (e.g. alphabetically)? The weka documentation does not give any information on this.
I hope someone will be able to help me out!

Internally, Weka handles all values as doubles. When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or c.
With the distributionForInstance() method, the indexes of the two arrays match, so attributeValues[0] is the string name for the first element of the array returned.
UPDATE (because of downvote)
The above method won't work if you're letting weka create the Instances object itself (e.g. if you're reading from an arff file). That doesn't seem to be the case given your question, but if it is, then please post code so we can see what's going on.

Related

Will Hot encoded data in H2O effect the model somehow?

I have hot encoded data separately (there are multiple categories under a single main variable and 30 variables). I want to know if this will effect GB, GL, DRF in H2O . the documentation says for XGBOOST it internally encodes to one-hot For deep learning models i can may be use All factor parameter but I cannot find how to stop implicit hot encoding or let it be as the results will be same?
A little detail on why i needed to preprocess and encode. The original data i have consist of 30 columns and each row is answer from participant and each column row data has multiple categories as string separated by new line . The logical solution was to use hot encode suing dummy coding to split each cell and encode to get columns. The columns are not 150 and rows are 250. I want to find out if the hot encoded data is handled automatically in H2O ?
I have read documentation and tutorial published by amazonaws, may be I am missing something.
If you have categorical columns, you don't need to encode it. You just need to make sure that that column is read in as enum and not int. For Deeplearning, if you want to use all factors of the categorical columns, you just need to set the parameter use_all_factor_levels=True/true/TRUE for Python, Java or R.

Missing Values in WEKA output

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.
The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

What are buckets in terms of hash functions?

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.
The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys
What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?
I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).
Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?
Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...
When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.
When this is the case, the output values are refered to as buckets.
Consider the function f(x) = x mod 2
This generates the following outputs;
1 => 1
2 => 0
3 => 1
4 => 0
In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.
A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.
This is a simplified example of course but hopefully you get the idea?
The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:
it's relatively cheap to compute
it's as different as possible for all objects.
Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.
Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.
Let's also define a relatively cheap hash function as follows:
hash = number % 10
The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...
Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.
Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.
Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.
And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.
Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.
Consider a phone book.
Imagine that you wanted to look for Donald Duck in a phone book.
It would be very inefficient to have to look every page, and every entry on that page. So rather than doing that, we do the following thing:
We create an index
We create a way to obtain an index key from a name
For a phone book, the index goes from A-Z, and the function used to get the index key, is just getting first letter from the Surname.
In this case, the hashing function takes Donald Duck and gives you D.
Then you take D and go to the index where all the people with Surnames starting with D are.
That would be a very oversimplified way to put it.
Let me explain in simple terms. Buckets come into picture while handling collisions using chaining technique ( Open hashing or Closed addressing)
Here, each array entry shall correspond to a bucket and each array entry (if nonempty) will be having a pointer to the head of the linked list. (The bucket is implemented as a linked list).
The hash function shall be used by hash table to calculate an index into an array of buckets, from which the desired value can be found.
That is, while checking whether an element is in the hash table, the key is first hashed to find the correct bucket to look into. Then, the corresponding linked list is traversed to locate the desired element.
Similarly while any element addition or deletion, hashing is used to find the appropriate bucket. Then, the bucket is checked for presence/absence of required element, and accordingly it is added/removed from the bucket by traversing corresponding linked list.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

How do I efficiently search through an ordered list?

I have a function that predicts a words being typed and returns the possibilities in an array. Unfortunately those aren’t sorted by frequency used. So I have a list of 10K ordered words listed by most frequent to less frequent. What would be an efficient way to compare the words in the array and the ordered list to return the most frequent one? (i.e the one it encounters first?)
I was tipped off by a friend to use a binary search tree but I really don't see how that helps me. From what I understood from the following website, only numerical values can be used.. Am I wrong in thinking so? Is there a better way of doing the aforementioned task?
Thanks in advance
You could create a dictionary with words as keys and frequencies as values. Then iterate over your result array, use the dictionary to obtain the frequency value for each item, and predict the item with the highest frequency.
I wouldn't use a vanilla binary search tree here. It would be possible - as Taylor Kirkpatrick says, you could just create a tree with words as keys and frequencies and use that to find the frequency for each result word, in much the same way as the dictionary solution.
The problem is that you cannot guarantee that a simple binary tree will be balanced. From the sound of it your data would probably be OK, since your words are in frequency order. The worst case would be if the words were in alphabetic order - then your binary tree would end up being identical to a linked list - it would never branch, since every node would attach to the right of the previous one. So the computational complexity of a search would be the same as iterating over the array of words - O(n) instead of O(log2N) (which is the best case for binary trees).
Of course, you could guard against this by randomising the list of words before doing the insert. But to my mind it's just easier to use a dictionary. I don't know what the actual implementation of Swift dictionaries is (and we won't until they open source it in a couple of months), but you can take it as read that it will out perform a vanilla BT for value retrieval.
I don't know what the background to this problem is - if you are learning CS it might be worth implementing the BST just for intellectual growth - in this case, with only 10,000 items you might find the performance differences are ultimately quite small. But if you are a working programmer trying to solve a problem, go with the dictionary approach.
You put all your words into a dictionary or a set. That's it. Dictionary if you have data associated with the words, set if you have no data and just want to know if the word is in the list or not.
You might want to use a Trie.
Put your word list into it. For every character entered, you traverse the Trie as deeply as you can and then show all paths to leaf nodes as possible completions.
Since the world like you have is likely static, you can precompute the Trie and load from disk/network/whatever at startup if performance is a concern.
You can use a binary search tree with anything as the actual value. To actually make use of the tree, use the frequency of the words as the numerical value. This is actually a pretty good solution to your problem. Each node of the tree will contain this word and a numeric value that represents the frequency of the word.
Here are a few links to help you out with making it.
Hope that helps.

Resources