There is well-know problem in Tom's Mitchell Machine Learning book to build decision tree based on the following data, where Play ball is the target variable.
The resulting tree is following
I wonder whether it's possible to build this tree with scikit-learn. I found several examples where decision tree can be depicted as
export_graphviz(clf)
Source(export_graphviz(clf, out_file=None))
However it looks like scikit doesn't work well with categorical data, the data has to be binarized into several columns. So as result, it is impossible to build the tree exactly as in the picture. Is it correct?
Yes, it is correct that it is impossible to build such a tree with scikit-learn.
The primary reason is that this is a ternary tree (nodes with up to three children) but scikit-learn implements only binary trees - nodes have exactly two or no children:
cdef class Tree:
"""Array-based representation of a binary decision tree.
...
However, it is possible to get an equivalent binary tree of the form
Outlook == Sunny
true => Humidity == High
true => no
false => yes
false => Outlook == Overcast
true => yes
false => Wind == Strong
true => no
false => yes
Related
I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍
You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).
My Weka OneR models are all returning what seems like an overfit set, concluding with a question mark leading to a certain results like so:
FollowersMeanCoords_Col:
< 0.33340000000000003 -> False
>= 0.33340000000000003 -> True
? -> False
(114357/163347 instances correct)
Is this OneR simply saying "I can't find anything, so we assume the rest is false"? But then, why is there a clear cut in the date (everything below 0.33 is False, above or equal is True)? And is there a way to prevent this?
Thanks in advance!
The ? refers to missing values - your training data must have some values of FollowersMeanCoords_Col missing for some instances.
The model in your question says that if FollowersMeanCoords_Col for an instance (data point) is less than 0.3334..., or is missing, it will classify the instance as False, otherwise it will classify it as True.
OneR is a very simple classification algorithm which works by finding the one attribute from the training data that gives the least error when used to make a classification rule. For OneR to overfit there would need to be an attribute that happened to classify the training data well, but didn't generalise to future test data. It's more likely that OneR will give you models that are robust but inaccurate.
This code works ok, but not great, because it fails to account for some non-linear e.g. quadratic behavior at the extremes of the function
LinearRegression finalClassifier = new LinearRegression();
finalClassifier.buildClassifier(adjInstances);
double predictedValue = finalClassifier.classifyInstance(finalInstance);
and this code produces completely bogus results
MultilayerPerceptron finalClassifier = new MultilayerPerceptron();
finalClassifier.buildClassifier(adjInstances);
double predictedValue = finalClassifier.classifyInstance(finalInstance);
I believe a MultilayerPerceptron should always outperform LinearRegression. There are just certain function shapes a LinearRegression cannot handle (e.g. f(x) = x ^ 2) which are a piece of cake for MultilayerPerceptron neural network.
So I'm probably using the API incorrectly or there are some undocumented requirements on the acceptable inputs for a MultilayerPerceptron. What could it be?
My data instances consist of a combination of 20 numeric and nominal attributes, for example:
A01 750
A02 1
A03 1
A04 true
A05 false
A06 false
A07 false
A08 false
A09 true
A10 false
A11 true
A12 false
A13 false
A14 false
A15 true
A16 false
A17 false
A18 false
A19 Yes
A20 34
The only part of your question that can be answered is
I believe a MultilayerPerceptron should always outperform LinearRegression. There are just certain function shapes a LinearRegression cannot handle (e.g. f(x) = x ^ 2) which are a piece of cake for MultilayerPerceptron neural network.
This is simply false. Why LR can be better?
Your data can be well represented with linear model, in such case MLP will likely overfit, while LR will work just great. This is a very common missconception - more complex models are not "better", they are simply "required if your relationship is complex", but for simple problems - complex models will fail.
You do not fit your model well. LR is trivial to fit, actually, without regularization (Ridge regression) it is one of the simpliest possible models to fit, you actually have a closed form solution (OLS method) for it and no hyperparameters. However, for even the simpliest MLP you do not have any training method which guarantees optimal solution and you have to fit multiple hyperparameters (number of hidden untis, activation function, learning rate, momentum rate, ...). In real life you nearly never train neural network well, it is actually the greatest problem with NNs - they are extremely hard to be trained, and so they should never be used by inexperienced machine learners. There are numerous other regressors which can be used by someone without a deep understanding of the field, such as SVR, Ridge Regression (and its Kernelized version).
If the code provided is your actual code, then the most likely reason for your result is the second point above - you cannot simply say "build me a neural network!" and expect it to work well, it does not work this way :)
I was using Vowpal Wabbit and was generating the classifier trained as a readable model.
My dataset had 22 features and the readable model gave as output:
Version 7.2.1
Min label:-50.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
101143:0.035237
101144:0.033885
101145:0.013357
101146:-0.007537
101147:-0.039093
101148:-0.013357
101149:0.001748
116060:0.499471
157941:-0.037318
157942:0.008038
157943:-0.011337
196772:0.138384
196773:0.109454
196774:0.118985
196775:-0.022981
196776:-0.301487
196777:-0.118985
197006:-0.000514
197007:-0.000373
197008:-0.000288
197009:-0.004444
197010:-0.006072
197011:0.000270
Can somebody please explain to me how to interpret the last portion of the file (after options: )? I was using logistic regression and I need to check how iterating over the training updates my classifier so that I can understand when I reach a convergence...
Thanks in advance :)
The values you see are the hash-values and weights of all your 22 features and one additional "Constant" feature (its hash value is 116060) in the resulting trained model.
The format is:
hash_value:weight
In order to see your original feature names instead of the hash value, you may use one of two methods:
Use the utl/vw-varinfo (in the source tree) utility on your training set with the same options you used for training. Try utl/vw-varinfo for a help/usage message
Use the relatively new --invert_hash readable.model option
BTW: inverting the hash values back to the original feature names is not the default due to the large performance penalty. By default, vw applies the one way hash on each feature string it sees. It doesn't maintain a hash-map between feature-names and their hash-values at all.
Edit:
Another little tidbit that may be of interest is the first entry after options: which reads:
:0
It essentially means that any "other" feature (all those which are not in the training-set, and thus, not hashed into the weight vector) defaults to a weight of 0. This means that it is redundant in vowpal-wabbit to train on features with values of zero, which is the default anyway. Explicit :0 value features simply won't contribute anything to the model. When you leave-out a weight in your training-set, as in: feature_name without a trailing :<value> vowpal wabbit implicitly assumes that it is a binary feature, with a TRUE value. IOW: it defaults all value-less features, to a value of one (:1) rather than a value of zero (:0). HTH.
Vowpal Wabbit also now has an --invert_hash option, which will give you a readable model with the actual variables, as well as just the hashes.
It consumes a LOT more memory, but since your model seems to be pretty small it will probably work.
Ie: "college" and "schoolwork" and "academy" belong in the same cluster,
the words "essay", "scholarships" , "money" also belong in the same cluster. Is this a ML or NLP problem?
It depends on how strict your definition of similar is.
Machine Learning Techniques
As others have pointed out, you can use something like latent semantic analysis or the related latent Dirichlet allocation.
Semantic Similarity and WordNet
As was pointed out, you may wish to use an existing resource for something like this.
Many research papers (example) use the term semantic similarity. The basic idea is of computing this is usually done by finding the distance between two words on a graph, where a word is a child if it is a type of its parent. Example: "songbird" would be a child of "bird". Semantic similarity can be used as a distance metric for creating clusters, if you wish.
Example Implementation
In addition, if you put a threshold on the value of some semantic similarity measure, you can get a boolean True or False. Here is a Gist I created (word_similarity.py) that uses NLTK's corpus reader for WordNet. Hopefully that points you towards the right direction, and gives you a few more search terms.
def sim(word1, word2, lch_threshold=2.15, verbose=False):
"""Determine if two (already lemmatized) words are similar or not.
Call with verbose=True to print the WordNet senses from each word
that are considered similar.
The documentation for the NLTK WordNet Interface is available here:
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
"""
from nltk.corpus import wordnet as wn
results = []
for net1 in wn.synsets(word1):
for net2 in wn.synsets(word2):
try:
lch = net1.lch_similarity(net2)
except:
continue
# The value to compare the LCH to was found empirically.
# (The value is very application dependent. Experiment!)
if lch >= lch_threshold:
results.append((net1, net2))
if not results:
return False
if verbose:
for net1, net2 in results:
print net1
print net1.definition
print net2
print net2.definition
print 'path similarity:'
print net1.path_similarity(net2)
print 'lch similarity:'
print net1.lch_similarity(net2)
print 'wup similarity:'
print net1.wup_similarity(net2)
print '-' * 79
return True
Example output
>>> sim('college', 'academy')
True
>>> sim('essay', 'schoolwork')
False
>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True
>>> sim('human', 'man')
True
>>> sim('human', 'car')
False
>>> sim('fare', 'food')
True
>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True
>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True
>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True
I suppose you could build your own database of such associations sing ML and NLP techniques, but you might also consider querying existing resources such as WordNet to get the job done.
If you have a sizable collection of documents related to the topic of interest, you might want to look at Latent Direchlet Allocation. LDA is a fairly standard NLP technique that automatically clusters words into topics, where similarity between words is determined by collocation in the same document (you can treat a single sentence as a document if that serves your needs better).
You'll find a number of LDA toolkits available. We'd need more detail on your exact problem before recommending one over another. I'm not enough of an expert to make that recommendation anyway, but I can at least suggest you look at LDA.
The famous quote regarding your question is by John Rupert Firth in 1957:
You shall know a word by the company it keeps
To start delving into this topic you can look into this presentation.
Word2Vec can play role to find similar words (contextually/semantically). In word2vec, we have words as vector in n-dimensional space, and can calculate distance between words (Euclidean Distance) or can simply make clusters.
After this, we can come up with some numerical value for similarity b/w 2 words.