How to verify if two text datasets are from different distribution? - machine-learning

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
update:
Example:
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv
Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv
Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Now,
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?

The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)',
'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
random.shuffle(tokens)
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.

I wrote an article which is similar to your problem but not exactly the same.
https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.

Related

Why doc2vec is giving different and un-reliable results?

I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec implementation:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
def doc2vec_score(s):
s_list = tokenize_and_stem(s)
v1 = model.infer_vector(s_list)
similar_doc = model.docvecs.most_similar([v1])
original_match = (X[int(similar_doc[0][0])])
score = similar_doc[0][1]
match = similar_doc[0][0]
return score,match
final_data = []
# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
print(row['processed_description'])
data = (doc2vec_score(row['processed_description']))
L1=list(data)
L1.append(row['Number'])
final_data.append(L1)
with open('file_cosine_d2v.csv','w',newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['score','match','INC_NUMBER'])
for row in final_data:
csv_out.writerow(row)
But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?
First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.
See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?
If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.
For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.
And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.
More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)
So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources