Will Hot encoded data in H2O effect the model somehow? - machine-learning

I have hot encoded data separately (there are multiple categories under a single main variable and 30 variables). I want to know if this will effect GB, GL, DRF in H2O . the documentation says for XGBOOST it internally encodes to one-hot For deep learning models i can may be use All factor parameter but I cannot find how to stop implicit hot encoding or let it be as the results will be same?
A little detail on why i needed to preprocess and encode. The original data i have consist of 30 columns and each row is answer from participant and each column row data has multiple categories as string separated by new line . The logical solution was to use hot encode suing dummy coding to split each cell and encode to get columns. The columns are not 150 and rows are 250. I want to find out if the hot encoded data is handled automatically in H2O ?
I have read documentation and tutorial published by amazonaws, may be I am missing something.

If you have categorical columns, you don't need to encode it. You just need to make sure that that column is read in as enum and not int. For Deeplearning, if you want to use all factors of the categorical columns, you just need to set the parameter use_all_factor_levels=True/true/TRUE for Python, Java or R.

Related

Calculation of Internet Checksum of two 16-bit streams

I want to calculate Internet checksum of two bit streams of 16 bits each. Do I need to break these strings into segments or I can directly sum the both?
Here are the strings:
String 1 = 1010001101011111
String 2 = 1100011010000110
Short answer
No. You don't need to split them.
Somewhat longer answer
Not sure exactly what you mean by "internet" checksum (a hash or checksum is just the result of a mathematical operation, and has no direct relation or dependence on the internet), but anyway:
The checksum of any value should not depend on the length of the input. In theory, your input strings could be of any length at all.
You can test this with a basic online checksum generator such as this one, for instance. That appears to generate a whole slew of checksums using lots of different algorithms. The names of the algorithms appear on the left in the list.
If you want to do this in code, a good starting point might be to search for examples using one of them in whatever language / environment you are working in.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?
Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.

Weka java library: how to get string representation of classified instance?

Currently I'm working on a project of classifying search queries into the following eight types: {athlete, actor, artist, politician, geo, facility, QA, definition}. After a bit of work I managed to score 78% correctly classified instances for my set of 300 sample queries using a Multilayer Perceptron classifier when I evaluate the classifier with a stratified 10-fold cross validation, which is reasonably good I think.
Using the weka java library I implemented the whole thing into java code, so I can write a program that dynamically feeds a query to the classifier and retrieves it's query type. I managed to implement the whole classifier training part successfully. The next step would be to use either the classifyInstance() or distributionForInstance() to determine the class to which the query is classified.
classifyInstance() however does only return a double value for which I do not know to get the actual query-type out of it. The weka wikispaces tell me I can use
unlabeled.classAttribute().value((int) clsLabel);
After calling classifyInstance() to get the String representation of the class, this however seems to always return the empty string in my case.
Using distributionForInstance() I'm able to successfully retrieve an array with eight double values between 0 and 1 (which is good, as I classify into eight query types). However, what is the order of this array? Is the first element in the result array the first class that occurs in my training file? Or is there some other predefined element order in this result array (e.g. alphabetically)? The weka documentation does not give any information on this.
I hope someone will be able to help me out!
Internally, Weka handles all values as doubles. When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or c.
With the distributionForInstance() method, the indexes of the two arrays match, so attributeValues[0] is the string name for the first element of the array returned.
UPDATE (because of downvote)
The above method won't work if you're letting weka create the Instances object itself (e.g. if you're reading from an arff file). That doesn't seem to be the case given your question, but if it is, then please post code so we can see what's going on.

How to detect tabular data from a variety of sources

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.
My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.
I realize this question isn't especially eloquently put so I hope it makes some sense!
Any ideas?
(no idea how to tag this either - so help there is welcomed!)
The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.
A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.
This assumes that you do not already have a defined types stored in the TSV.
A TSV file is typically
[Value1]\t[Value..N]\n
My suggestion would be to:
Count up all the tabs
Count up all of new lines
Count the total tabs in the first row
Divide the total number of tabs by the tabs in the first row
With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:
You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
You can scan each line before reading to make sure all are consistent
You can read up to the line that does not fit the format and then throw an error
Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

Resources