Store textual dataset for binary classification - machine-learning

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?

In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.

One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

How to verify if two text datasets are from different distribution?

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.
How do I measure if both datasets are from same distribution?
The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.
I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.
update:
Example:
Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?
Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv
Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv
Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Now,
How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?
The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.
Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.
As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python
import requests
from bs4 import BeautifulSoup
urls = [
'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)',
'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]
Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.
import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])
# [5068, 4053]: texts are of approximately the same size
def word_freq_rmse(c1, c2):
result = 0
vocab = set(c1.keys()).union(set(c2.keys()))
n1, n2 = sum(c1.values()), sum(c2.values())
n = len(vocab)
for word in vocab:
result += (c1[word]/n1 - c2[word]/n2)**2 / n
return result**0.5
print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?
I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.
import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
random.shuffle(tokens)
c1 = Counter(tokens[:split])
c2 = Counter(tokens[split:])
distribution.append(word_freq_rmse(c1, c2))
Now I can see how unusual is the value of my observed test statistic under the null hypothesis:
observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value) # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011 0.0006 0.0004
We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.
I wrote an article which is similar to your problem but not exactly the same.
https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef
The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.
There are a few similarities between your problem and the one I had mentioned above.
You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared
So, my proposed solution to this will be as:
Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa
Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.
Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

How to tackle classification with string features?

I am working on a ad-click recommendation system in which I have to predict whether a user will click on a Advertisement. I have 98 features in total having both USER features and ADVERTISEMENT features. Some of the features which are very important for the prediction are having string values like this.
**FEATURE**
Inakdtive Kunmden
Stammkfunden
Stammkdunden
Stammkfunden
guteg Quartialskunden
gutes Quartialskunden
guteg Quartialskunden
gutes Quartialskunden
There are 14 different string value like this in whole data column. My model cannot take string values as input so I have to convert them to categorical int values. I have no idea how to do this and make these features useful. I am using K-MEANS CLUSTERING & RANDOMFOREST ALGORITHM.
Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not.
For instance, if:
'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5
Then the distance metric in your clustering algorithm would think that humans are more like mice than they are like dogs. It is usually more useful to turn them into 14 binary values e.g.
Turn this:
'Dog'
'Cat'
'Human'
'Mouse'
'Dog'
Into this:
'Dog' 'Cat' 'Mouse' 'Human'
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
Not this:
'Species'
1
2
5
4
1
However, if the data are going to be the 'targets' that you are classifying and not the data 'features', you can leave them as ints in most multi-classification algorithms in SciKit-Learn.
I like user1745038's answer and it should give you reasonably good results. However, if you want to extract more meaningful features out of your strings (specially if the number of strings increases significantly), consider using some NLP techniques. For example, 'Dog' and 'Cat' are more similar than 'Dog' and 'Mouse'.
Good luck

SVM Classifying binary data DNA

I'm working SVM in R software and I would appreaciate any input you may provide.
I have a data set that I need to train with SVM, the format of the data is the following
ToPredict Data1 Data2 Data3 Data4 DNA
S 1 12 1 11 000000000100
B -1 17 14 3 11011110111110111
S 1 4 0 4 0000
The question that I have is regarding the DNA column.
SVM is able to get an input like DNA and still calculate reliable predictions?
For my data set, 0≠00 or 1≠001 therefore, it cannot be taken as integers.Every value represents information that needs to be processed and the order is very important, it's a string of binary values, either is 1 or 0.
The 0101 information could be displayed as ABAB etc. (A=0, B=1)
How can I train a SVM with the data above?
Thank you.
For SVMs to work, "all" you need to have a kernel function.
So what is a sensible kernel function for your "DNA strings"? You probably don't need to be able to prove it is a proper kernel, but you can get away with a good similarity measure.
How would you evaluate similarity of your sequences? I cannot help you on that, because I don't know what the data means; this is up to the user (i.e. you) to specify.

Resources