Predict a alphanumeric value - machine-learning

is it possible to predict a alphanumeric value by using ml.net (or any other ML-Framework)?
For example: i got a large list of alphanumeric codes like: "245ab67k"
Is it possible to resolve the method of how those codes got generated or is it possible to use ML.NET (or any other ML-Framework) to predict a possible alphanumeric code.
Appreciate your help!

Related

Will Hot encoded data in H2O effect the model somehow?

I have hot encoded data separately (there are multiple categories under a single main variable and 30 variables). I want to know if this will effect GB, GL, DRF in H2O . the documentation says for XGBOOST it internally encodes to one-hot For deep learning models i can may be use All factor parameter but I cannot find how to stop implicit hot encoding or let it be as the results will be same?
A little detail on why i needed to preprocess and encode. The original data i have consist of 30 columns and each row is answer from participant and each column row data has multiple categories as string separated by new line . The logical solution was to use hot encode suing dummy coding to split each cell and encode to get columns. The columns are not 150 and rows are 250. I want to find out if the hot encoded data is handled automatically in H2O ?
I have read documentation and tutorial published by amazonaws, may be I am missing something.
If you have categorical columns, you don't need to encode it. You just need to make sure that that column is read in as enum and not int. For Deeplearning, if you want to use all factors of the categorical columns, you just need to set the parameter use_all_factor_levels=True/true/TRUE for Python, Java or R.

How to validate a negative value being passed in as a alphanumeric literal?

Basically, I've been given an input file that gets passed in too a mix of fields that are alphanumeric and numeric. My goal is to test each field for valid data. The first field is an alphanumeric with a Pic X(3) description that should represent a number. However, because i'm testing for data validity there will be instances where the value could have a letter in there such as 0R1 or a negative value -001.
When testing for data, using the "Is Numeric" test works perfectly for finding values that aren't numeric. However, it fails the test when there is a negative sign being passed in. I assume this is happening because it recognizes that a dash or negative symbol (-) is not a numeric character. My overall goal is to test if the number is both numeric and positive, but given the circumstances above, i'm having trouble achieving a proper test.
Any recommendations on how to get past this?
Thanks
I suggest to use FUNCTION TEST-NUMVAL(some-data) if you need a good check of anything that is defined as alphanumeric/alphabetic; if this passes you can then use FUNCTION NUMVAL(some-data) to get the actual value.
For more details you could check on those in the current draft of the COBOL standard.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

opennlp does not recognized twiiter input

I have a file that contain twitter post and I am trying to identify the structure of the twitter post per line, like get the noun ,verb and stuff, using opennlp.
it work perfectly until it reach line that contain hashtag and link only
example :
#birthday www.mybirthday/test/mypi.com
and give error com.cybozu.labs.langdetect.LangDetectException: no features in text
when I write a sentence next to the line it just work. any idea how to handle it?? there are more then thousand line that almost like the example.
To use the POS tagger, you need to pass tokens, (in laymen terms say individual words). The link contains multiple words separated by a slash /. The link in itself is not associated with any Part Of Speech. See here the list of tags and how they are assigned to a word. If you want it to identify your link, and give a separate tag to it, say LN either give your own training data, here you will know how to create the training dataor separate the words in the link as separate token (you can separate a link by slash/, question mark?, equal to sign = or ampersand (&)) to get the underlying words and then use the POSTagger to get Part Of Speech (similar case for the hash tag.) For tokenization also, you can use opennlp tokenizer and for your special case, train it. Go through the documentation, it will help you a lot.

Weka java library: how to get string representation of classified instance?

Currently I'm working on a project of classifying search queries into the following eight types: {athlete, actor, artist, politician, geo, facility, QA, definition}. After a bit of work I managed to score 78% correctly classified instances for my set of 300 sample queries using a Multilayer Perceptron classifier when I evaluate the classifier with a stratified 10-fold cross validation, which is reasonably good I think.
Using the weka java library I implemented the whole thing into java code, so I can write a program that dynamically feeds a query to the classifier and retrieves it's query type. I managed to implement the whole classifier training part successfully. The next step would be to use either the classifyInstance() or distributionForInstance() to determine the class to which the query is classified.
classifyInstance() however does only return a double value for which I do not know to get the actual query-type out of it. The weka wikispaces tell me I can use
unlabeled.classAttribute().value((int) clsLabel);
After calling classifyInstance() to get the String representation of the class, this however seems to always return the empty string in my case.
Using distributionForInstance() I'm able to successfully retrieve an array with eight double values between 0 and 1 (which is good, as I classify into eight query types). However, what is the order of this array? Is the first element in the result array the first class that occurs in my training file? Or is there some other predefined element order in this result array (e.g. alphabetically)? The weka documentation does not give any information on this.
I hope someone will be able to help me out!
Internally, Weka handles all values as doubles. When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or c.
With the distributionForInstance() method, the indexes of the two arrays match, so attributeValues[0] is the string name for the first element of the array returned.
UPDATE (because of downvote)
The above method won't work if you're letting weka create the Instances object itself (e.g. if you're reading from an arff file). That doesn't seem to be the case given your question, but if it is, then please post code so we can see what's going on.

Resources