How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)
Related
I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
update:
Example:
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv
Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv
Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Now,
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?
The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)',
'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
random.shuffle(tokens)
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.
I wrote an article which is similar to your problem but not exactly the same.
https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.
I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.
Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].
I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)
Any idea on the recommended parameters for OpenCV RTrees? I have read the documentation and I'm trying to apply it to MNIST dataset, i.e. 60000 training images, with 10000 testing images. I'm trying to optimize MaxDepth, MinSampleCount, setMaxCategories, and setPriors? e.g.
Ptr<RTrees> model = RTrees::create();
/* Depth of the tree.
A low value will likely underfit and conversely
a high value will likely overfit.
The optimal value can be obtained using cross validation
or other suitable methods.
*/
model->setMaxDepth(?); // letter_recog.cpp uses 10
/* minimum samples required at a leaf node for it to be split.
A reasonable value is a small percentage of the total data e.g. 1%.
MNIST 70000 * 0.01 = 700
*/
model->setMinSampleCount(700?); letter_recog.cpp uses 10
/* regression_accuracy – Termination criteria for regression trees.
If all absolute differences between an estimated value in a node and
values of train samples in this node are less than this parameter
then the node will not be split. */
model->setRegressionAccuracy(0); // I think this is already correct
/*
use_surrogates – If true then surrogate splits will be built.
These splits allow to work with missing data and compute variable importance correctly.'
To compute variable importance correctly, the surrogate splits must be enabled in
the training parameters, even if there is no missing data.
*/
model->setUseSurrogates(true); // I think this is already correct
/*
Cluster possible values of a categorical variable into K \leq max_categories clusters
to find a suboptimal split. If a discrete variable, on which the training procedure
tries to make a split, takes more than max_categories values, the precise best subset
estimation may take a very long time because the algorithm is exponential.
Instead, many decision trees engines (including ML) try to find sub-optimal split
in this case by clustering all the samples into max_categories clusters that is
some categories are merged together. The clustering is applied only in n>2-class
classification problems for categorical variables with N > max_categories possible values.
In case of regression and 2-class classification the optimal split can be found
efficiently without employing clustering, thus the parameter is not used in these cases.
*/
model->setMaxCategories(?); letter_recog.cpp uses 15
/*
priors – The array of a priori class probabilities, sorted by the class label value.
The parameter can be used to tune the decision tree preferences toward a certain class.
For example, if you want to detect some rare anomaly occurrence, the training base will
likely contain much more normal cases than anomalies, so a very good classification
performance will be achieved just by considering every case as normal.
To avoid this, the priors can be specified, where the anomaly probability is
artificially increased (up to 0.5 or even greater), so the weight of the misclassified
anomalies becomes much bigger, and the tree is adjusted properly. You can also think about
this parameter as weights of prediction categories which determine relative weights that
you give to misclassification. That is, if the weight of the first category is 1 and
the weight of the second category is 10, then each mistake in predicting the
second category is equivalent to making 10 mistakes in predicting the first category.
*/
model->setPriors(Mat()); // ?
/* If true then variable importance will be calculated and
then it can be retrieved by CvRTrees::get_var_importance().
*/
model->setCalculateVarImportance(true); // I think this is already correct
/*
The size of the randomly selected subset of features at each tree node and
that are used to find the best split(s). If you set it to 0 then the size
will be set to the square root of the total number of features.
*/
model->setActiveVarCount(0); // I think this is already correct
/*
CV_TERMCRIT_ITER Terminate learning by the max_num_of_trees_in_the_forest;
CV_TERMCRIT_EPS Terminate learning by the forest_accuracy;
CV_TERMCRIT_ITER | CV_TERMCRIT_EPS Use both termination criteria.
*/
model->setTermCriteria(TC(100,0.01f)); // I think this is already correct