Using C4.5 classifier with multiple outcomes - machine-learning

I'm looking at C4.5 classifier for a machine learning task. I have a large dataset containing city names, and need to differentiate between e.g. London Ontario, London England or even London in Burgundy in France, but looking at features from the surrounding text: E.g. Zip codes, state names, even when "Canada" or "England" are not mentioned. I also have access to meta data such as dialing codes which can help determine which country it is.
Subsequently once trained I want to run the classifier on the large dataset.
In all the examples I have found here there are only 2 states for the result (in this golf example play or don't play).
Can the c4.5 classifier handle London (Canada), London (England), London (France) as result classes or do I need to have different classifiers for London (Canada) True/False etc?

I see two options in your case.
The first approach is a straightforward extension to c4.5. In each leaf node, you keep all the labels instead of just the majority label. For example, as shown in the figure below, red labels actually present in three different leafs. When you have a query at the data point pointed by the arrow, the outputs are 3 labels (green, red and blue) together with their corresponding conditional probability p(c|v) (given feature x1 and x2, what is the probability of data x belongs to class c).
The 2nd approach is to generate multiple decision trees hence a random forest. The randomness can be injected by randomly sampling subset of training data made available to each individual tree. At classification time, you can aggregate the vote from all decision trees to get multi-class classification results.
The figures are borrowed from this excellent tutorial on multi-class classification by Andrew Zisserma.

Related

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Document classification using word vectors

While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!
The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

5 input and 3 output features for machine learning

Need some advise here.
I am trying to build a model where it can predict the 3 different output features when 5 input features are given.
for example,
5 input features: size of the house, house floor, house condition, number of rooms, parking.
3 output features: price for selling, price for buying, price for renting
What I am confusing right now is that, is that possible that the trained model are able to predict the 3 outputs? What I found from others' example/tutorial is that they mostly trying to do one thing only on their model.
Sorry if my explanations are bad, I am new to tensorflow and machine learning.
Neural network definitely can predict/approximate more outputs. I have experience with neuron regulator and there net produce control signal for two motors.
So I don't have experience with tensorflow. But this framework is from Google and is quite popular, so I'm almost sure, there is multioutput functionality.
There is nice example of such thing.
In common practice, we build a model to predict only one output, that is because in surpervised learning, we should input some certain kinds of variables, and find a relation between them with a wanted output. Because this relation generally cannot work between the input and another wanted output.
But we can have a special technique to fit your problem:
If we have four input variables : I1, I2, I3, I4, and we want three output lables (generally discrete): O1, O2, O3. so we can created a new lable O4 after mergering the original three outputs. For example, if O1, O2, O3 can onlt be 0 or 1, the O4 have 2^3, in total, 8 possible values. So, we can build a prediction model between four input variables and the output O4. And once value of O4 is known, O1-O3 are all known as well.
Howover, if the output variable are not all discrete, especially regression technique is used, the technique above wonnot work. So, to predict three output, we normally do training three times and make three models.

Which Stanford NLP package to use for content categorization

I have about 5000 terms in a table and I want to group them into categories that make sense.
For example some terms are:
Nissan
Ford
Arrested
Jeep
Court
The result should be that Nissan, Ford, Jeep get grouped into one category and that Arrested and Court are in another category. I looked at the Stanford Classifier NLP. Am I right to assume that this is the right one to choose to do this for me?
I would suggest you to use NLTK if there weren't many proper nouns. You can use the semantic similarity from WordNet as features and try to cluster the words. Here's a discussion about how to do that.
To use the Stanford Classifier, you need to know how many buckets (classes) of words you want. Besides I think that is for documents rather than words.
That's an interesting problem that the word2vec model that Google released may help with.
In a nutshell, a word is represented by an N-dimensional vector generated by a model. Google provides a great model that returns a 300-dimensional vector from a model trained on over 100 billion words from their news division.
The interesting thing is that there are semantics encoded in these vectors. Suppose you have the vectors for the words King, Man, and Woman. A simple expression (King - Man) + Woman will yield a vector that is exceedingly close to the vector for Queen.
This is done via a distance calculation (cosine distance is their default, but you can use your own on the vectors) to determine similarity between words.
For your example, the distance between Jeep and Ford would be much smaller than between Jeep and Arrested. Through this you could group terms 'logically'.

Resources