How should we do when encounter a tree only contains a label in Random Forest? - random-forest

when using random forest to create trees, I found there"s a tree contains with only a label, and has no other nodes, looks like that: Tree: {yes}. It leeads to error when I use my forest to predict .
I would like to know what's the commen solution for this probelem.
is it to add a judge function that when the tree only contains a lable, create a new tree ;
or in I need to modify the createTree function.
here is my full code in github:
https://github.com/suchunxie/Random_forest_sample.git
Thank you if you can nicely give some syggestion.

Related

How to predict a textual field on the basis of input features

I'm stuck with a problem statement of predicting an identifier for a product on the basis of couple of product features. A sample of data available to me looks like the one shown below:
ABC10L 20.0 34 XYZ G345F FG MKD -> 000000DEF_VYA
Here, ABC10L,20.0,34,XYZ,G345F,FG,MKD are the features and 000000DEF_VYA is the unique identifier associated with the product. Initially I tried to formulate this problem as a regression problem but I'm not sure how to generate textual output from my model and what should be my cost function. Also, I'm not sure is regression the right tool to solve the issue here.
Please help in suggesting the right approach and how I may proceed to solve this !!!

Entities on my gazette are not recognized

I would like to create a custom NER model. That's what i did:
TRAINING DATA (stanford-ner.tsv):
Hello O
! O
My O
name O
is O
Damiano PERSON
. O
PROPERTIES (stanford-ner.prop):
trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true
GAZZETTE gazzetta.txt):
PERSON John
PERSON Andrea
I build the model via command line with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop
And test with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
I did two tests with the following texts:
>>> TEST 1 <<<
TEXT:
Hello! My name is Damiano and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O
>>> TEST 2 <<<
TEXT:
Hello! My name is John and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O
As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.
Why does John entity is not recognized ?
Thank you so much in advance.
As Stanford FAQ says,
If a gazette is used, this does not guarantee that words in the
gazette are always used as a member of the intended class, and it does
not guarantee that words outside the gazette will not be chosen. It
simply provides another feature for the CRF to train against. If the
CRF has higher weights for other features, the gazette features may be
overwhelmed.
If you want something that will recognize text as a member of a class
if and only if it is in a list of words, you might prefer either the
regexner or the tokensregex tools included in Stanford CoreNLP. The
CRF NER is not guaranteed to accept all words in the gazette as part
of the expected class, and it may also accept words outside the
gazette as part of the class.
Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.
If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.
gazzette will only help for extracting extra features from the training data, if you don't have any occurrence of these words inside your training data or any connection to labeled tokens, your model will not benefits from that. One of the experiments that I would suggest is to add Damiano to your gazzette.
Why does John entity is not recognized ?
It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).

Parse/ tree diagram for phrasal constituents of a specific sentence

I'm trying to do some language analysis on the opening paragraph of The Kite Runner by Khaled Hosseini, specifically looking at phrasal constituents. The first sentence is as follows:
"I became what I am today at the age of twelve, on a frigid overcast day in the winter of 1975."
I've got a pretty good idea of what the phrasal constituents are, but I'm a bit unsure as to how to draw the tree, as it seems like the tree should be split into two distinct branches, splitting at the comma after twelve. I've uploaded an image my tree so far, but I'm not sure if it's correct or not. Any help would be greatly appreciated.
Thanks in advance :)
There is a library called constituent-treelib that can be used to construct, process and visualize constituent trees. First, we must install the library:
pip install constituent-treelib
Then, we can use it as follows to parse the sentence into a constituent tree, visualize it, and finally export the result to a PDF file:
from constituent_treelib import ConstituentTree
# Define a sentence
sentence = "You must construct additional pylons!"
# Define the language that should be considered
language = ConstituentTree.Language.English
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Construct the neccesary NLP pipeline by downloading and installing the required models (benepar and spaCy)
nlp = ConstituentTree.create_pipeline(language, spacy_model_size, download_models=True)
# Instantiate a ConstituentTree object and pass it both the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)
# Export the visualization of the tree into a PDF file
tree.export_tree('my_constituent_tree.pdf')
The result...

What is difference between Parse Tree, Annotated Parse Tree and Activation Tree ?(compiler)

I know what is a Parse Tree and what is an Abstract Tree but I after reading some about Annotated Parse Tree(as we draw detailed tree which is same as Parse Tree), I feel that they are same as Parse Tree.
Can anyone please explain differences among these three in detail ?
Thanks.
AN ANNOTATED PARSE TREE is a parse tree showing the values of the attributes at each node. The process of computing the attribute values at the nodes is called annotating or decorating the parse tree.
For example: Refer link below, it is annotated parse tree for 3*5+4n
https://i.stack.imgur.com/WAwdZ.png
A parse tree is a representation of how a source text (of a program) has been decomposed to demonstate it matches a grammar for a language. Interior nodes in the tree are language grammar nonterminals (BNF rule left hand side tokens), while leaves of the tree are grammar terminals (all the other tokens) in the order required by grammar rules.
An annotated parse tree is one in which various facts about the program have been attached to parse tree nodes. For example, one might compute the set of identifiers that each subtree mentions, and attach that set to the subtree. Compilers have to store information they have collected about the program somewhere; this is a convenient place to store information which is derivable form the tree.
An activation tree is conceptual snapshot of the result of a set of procedures calling one another at runtime. Node in such a tree represent procedures which have run; childen represent procedures called by their parent.
So a key difference between (annotated) parse trees and activation trees is what they are used to represent: compile time properties vs. runtime properties.
An annotated parse tree lets you intergrate the entire compilation into the parse tree structure. CM Modula-3 does that if im not mistaken.
To build an APT, simply declare an abstract base class of nodes, subclass each production on it and declare the child nodes as field variables.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

Resources