How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark - machine-learning

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)

The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

how to do Classification based on the correlation of multiple features for a Supervised scenario

I have 2 features: 'Contact_Last_Name' and 'Account_Last_Name' based on which I want to Classify my data:
The logic is that if the 2 features are same i.e. Contact_Last_Name is same as Account_Last_Name - then the result is 'Success' or else it is 'Denied'.
So. for example: if Contact_Last_Name is 'Johnson' and Account_Last_Name is 'Eigen' - the result is classified as 'Denied'. If both are equal say - 'Edison' - then the result is 'Success'.
How, can I have a Classification algorithm for this set of data?
[please note that usually we discard High Correlation columns but over here the correlation between columns seems to have the logic for Classification]
I have tried to use Decision Tree(C5.0) and Naive Bayes(naiveBayes) in R but both of these fail to Classify the dataset correctly.
First of all its not a good use case for machine learning, because this can be done with just string match, but still if you want to give to a classification algorith, then create a table with values as 'Contact_Last_Name' and 'Account_Last_Name' and 'Result' and give it for decision tree and predict the third column.
Note that you partition your data for training and testing.

How to normalize tf-idf vectors for SVMs?

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources