PMML // tree based embedding by modifying a random forest model file - machine-learning

Following this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7712003/ I have a question on how to apply, e.g. XSLT, to modify a PMML file with random forest to do the following:-
tree based embedding
label each tree leaf with an id
return the leaf id used for prediction
"individualise" trees in the forest: use for example multipleModelMethod="modelChain" to create individual output from each tree
so if I have a random forest with 2 trees each with 5 terminal leafs I would like to have the output
one hot encode "t.l" where t=tree id and l=leaf id
0.0 0.1 0.2 0.3 0.4 1.0 1.1 1.2 1.3 1.4
1 0 0 0 0 0 1 0 0 0
Thanks

how to apply, e.g. XSLT, to modify a PMML file
Consider using a proper PMML library such as JPMML-Model for this. It provides a special-purpose Visitor API for traversing and modifying PMML data structures.
return the leaf id used for prediction
Append the following OutputField element to all TreeModel/Output elements:
<OutputField name="id(node)" feature="entityId" dataType="string" optype="categorical"/>
This uses the "entity identifier" mechanism to extract the identifier of the winning Node element.
There is no need to label nodes manually. If the Node#id attribute is missing, then implicit 1-based indices are returned.
"individualise" trees in the forest
The trees in the random forest model can be identified by the Segment#id attribute.
For example, the following OutputField element will return the id of the winning node of the seventh decision tree:
<OutputField name="id(node, 7)" segmentId="7" feature="entityId" dataType="string" optype="categorical"/>

Related

Ternary Classification of the type 'A', 'B' or 'any'?

For any general machine learning model (though I am currently working with neural networks), for the task of
classifying the elements of a set into three groups ('A' or 'B' or 'any'),
(here, labeling as 'A' means that the only valid label is 'A' (similarly 'B'), and 'any' means that both the tags 'A' and 'B' are equally valid), what kind of loss function should be used?
This can be solved using the techniques related to the more general problem of "ternary classification," but I think I'll lose some information by this generalization.
For the sake of example, let's say we want to classify verbs (English language) according to their tense forms (let us only consider the present and past tense)
Then the model should classify
{"work", "eat", "sing", ...} as "present tense"
{"worked", "ate", "sang", ...} as "past tense"
and,
{"read", "put", "cut", ...} as "any"
(note that the pronunciation is different for the present and past tense of 'read', but we are considering text-based classification)
This is different from the task that I am working on but probably should work as a valid example for this particular question.
PS: I am a student, and only have a basic understanding of this field, so if needed, please ask for any clarification regarding the question.
I think that you are in the situation of multi-label classification and not multi-class classifcation.
As stated here:
In machine learning, multi-label classification and the strongly
related problem of multi-output classification are variants of the
classification problem where multiple labels may be assigned to each
instance
Which means that instances can have more than 1 class associated to them.
Usually, when you work with a binary classification (e.g. 0, 1 classes) you can have as final layer of your network one neuron, which will output continues values between 0 and 1, using as activation function the sigmoid one, and as loss the binary cross-entropy
Given your situation you could decide to use:
two neurons as output of your neural network
for each one you can use the sigmoid activation function
and as loss the binary-cross entropy
in this way, each instance can be associated with both classes with a specific probability by the model.
This means that for each instance, you should associate two classes, or rather "labels".
For example, for your verbs you should have "past", "present" classes:
present past
work: 1 0
worked: 0 1
read 1 1
And your model will try to output two probabilities, with the architecture explained before:
present past sum
work: 0.9 0.3 1.2
worked: 0.21 0.8 1.01
read 0.86 0.7 1.5
Basically, you have two independent probabilites (if you check, the sum of one row is not 1), and therefore you can associate to one instance both classes.
Instead, if you wanted a mutually exclusive classification, with more than 2 classes, you should have used the categorical crossentropy as loss, and the softmax activation function in your last layer, the which will basically handle the outputs to generate a vector of probabilities that sums to 1. Example
present past both sum
work: 0.7 0.2 0.1 1
worked: 0.21 0.7 0.19 1
read 0.33 0.33 0.33 1
Check here to see an extensive example

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?
In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.
One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Delta component doesnt show in weight learning rule of sigmoid activation MLP

As a basic proof of concept, in a network that classifies K classes with input x, bias b, output y,S samples, weights v and t teacher signal in which t(k) equals 1 if the matching sample is under k class.
Variables
Let x_(is) represent the i_(th) input feature in the s_(th) sample.
v_(ks) represents the vector that holds the weights of connection to k_(th) output from all inputs within the s_(th) sample.
t_(s) represents the teacher signal for s_(th) sample.
If we extend the above variables to consider multiple samples, the changes below has to be applied while declaring the variable z_(k), the activation function f(.) and using the corss entropy as a cost function:
Derivation
Typically in learning rule, delta ( t_(k) - y_(k) ) is always included, why Delta doesnt show up in this equation? have i missed something or the delta rule showing up isnt a must?
I managed to find the solution, it's clear when we consider the Kronecker delta in which Where (δck = 1 if a class matches the classifier and δck otherwise). which means the derivation takes this shape:
Derivation
which leads to the delta rule.

Decision tree completeness and unclassified data

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Modeling features of Relation Extraction in the SVMlight input format

I am currently working on a project that focuses on relation extraction from a corpus of Wikipedia text, and I plan to use an SVM to extract these relations. To model this, I plan to use Word features, POS Tag features, Entity features, Mention features and so on as mentioned in the following paper - https://gate.ac.uk/sale/eswc06/eswc06-relation.pdf (Page 6 onwards)
Now, I have set up the pipeline for feature extraction and got the corpus annotated and I wish to use a package like SVM-Light for the purpose of the project. According to the input file format of the SVM-Light package, this is the requisite format -
.=. : : ... : #
Example (from the SVM-Light webpage) -
In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line
-1 1:0.43 3:0.12 9284:0.2 # abcdef
specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information for user defined kernels.
Now, I wish to know how do we model the features that I am using whose values include words, POS Tags and entity types and subtypes into the feature vector accepted by the SVM-Light package, where each feature has a real number value associated with it. How is the mapping from my choice of features to these real values done?
It would be of great help if someone who has worked at a similar problem before could just prod me in the right direction.
Thanks.

Resources