I have been reading more modern posts about sentiment classification (analysis) such as this.
Taking the IMDB dataset as an example I find that I get a similar accuracy percentage using Doc2Vec (88%), however a far better result using a simple tfidf vectoriser with tri-grams for feature extraction (91%). I think this is similar to Table 2 in Mikolov's 2015 paper.
I thought that by using a bigger data-set this would change. So I re-ran my experiment using a breakdown of 1mill training and 1 mill test from here. Unfortunately, in that case my tfidf vectoriser feature extraction method increased to 93% but doc2vec fell to 85%.
I was wondering if this is to be expected and that others find tfidf to be superior to doc2vec even for a large corpus?
My data-cleaning is simple:
def clean_review(review):
temp = BeautifulSoup(review, "lxml").get_text()
punctuation = """.,?!:;(){}[]"""
for char in punctuation
temp = temp.replace(char, ' ' + char + ' ')
words = " ".join(temp.lower().split()) + "\n"
return words
And I have tried using 400 and 1200 features for the Doc2Vec model:
model = Doc2Vec(min_count=2, window=10, size=model_feat_size, sample=1e-4, negative=5, workers=cores)
Whereas my tfidf vectoriser has 40,000 max features:
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)
For classification I experimented with a few linear methods, however found simple logistic regression to do OK...
The example code Mikolov once posted (https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ) used options -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1 – which in gensim would be similar to dm=0, dbow_words=1, size=100, window=10, hs=0, negative=5, sample=1e-4, iter=20, min_count=1, workers=cores.
My hunch is that optimal values might involve a smaller window and higher min_count, and maybe a size somewhere between 100 and 400, but it's been a while since I've run those experiments.
It can also sometimes help a little to re-infer vectors on the final model, using a larger-than-the-default passes parameter, rather than re-using the bulk-trained vectors. Still, these may just converge on similar performance to Tfidf – they're all dependent on the same word-features, and not very much data.
Going to a semi-supervised approach, where some of the document-tags represent sentiments where known, sometimes also helps.
Related
I am fairly new to machine learning and have been tasked with building a machine learning modelt o predict whether a review is good (1) or bad (0). I have already tried to use a RandomForestClassifier that output an accuracy of 50%. I switched to the Naive Bayes classifier but am still not getting any improvement, even after conducting a grid search.
My data looks as such (I am happy to share with data with anyone):
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
[15000 rows x 2 columns]
My code to preprocess then text and use TfidfVectorizer before training the classifier is as such:
vect = TfidfVectorizer(stop_words=stopwords, max_features=5000)
X_train =vect.fit_transform(all_train_set['Reviews'])
y_train = all_train_set['Labels']
clf = MultinomialNB()
clf.fit(X_train, y_train)
X_test = vect.transform(all_test_set['Reviews'])
y_test = all_test_set['Labels']
print(classification_report(y_test, clf.predict(X_test), digits=4))
The results of the classification report seem to indicate that whilst one label is predicted very well, the other is extremly poor, bringing the whole thing down.
precision recall f1-score support
0 0.5000 0.8546 0.6309 2482
1 0.5000 0.1454 0.2253 2482
accuracy 0.5000 4964
macro avg 0.5000 0.5000 0.4281 4964
weighted avg 0.5000 0.5000 0.4281 4964
I have tried to follow 8 different tutorials now on this and tried each different way of coding but I can't seem to get it above 50% which makes me think it may be a problem with my features.
If anyone has any idea or suggestions, I'd greatly appreciate it.
EDIT:
Okay so I have added several preprocessing steps here including, removing html tags, removing punctuation and single letter and removing multiple spaces from the code below:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def preprocess_text(sen):
# Removing html tags
sentence = remove_tags(sen)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)
return sentence
I believe TfidfVectorizer automatically puts everything in lower case and lemmatizes it. The end result is still only 0.5
Text preprocessing is very important here. Removal of stop words only is not enough, I think you should consider the following:
convert the text to lowercase
removal of punctuation
Apostrophe lookup ("'ll" -> " will"', "'ve" -> " have")
Removal of numbers
lemmatization and/or stemming for the reviews
etc.
Have a look at the text preprocessing methods.
I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable
I have recently started experimenting with OneClassSVM ( using Sklearn ) for unsupervised learning and I followed
this example .
I apologize for the silly questions But I’m a bit confused about two things :
Should I train my svm on both regular example case as well as the outliers , or the training is on regular examples only ?
Which of labels predicted by the OSVM and represent outliers is it 1 or -1
Once again i apologize for those questions but for some reason i cannot find this documented anyware
As this example you reference is about novelty-detection, the docs say:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
Meaning: you should train on regular examples only.
The approach is based on:
Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471.
Extract:
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specied value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement.
The above docs also say:
Inliers are labeled 1, while outliers are labeled -1.
This can also be seen in your example code, extracted:
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
...
# all regular = inliers (defined above)
y_pred_test = clf.predict(X_test)
...
# -1 = outlier <-> error as assumed to be inlier
n_error_test = y_pred_test[y_pred_test == -1].size
I have crafted a little program for gender classification based on image of a face. I used Yale face databse (175 images for males and the same number for females), converted them to grayscale and equalized histograms, so after preprocessing images look like this:
I ran following code to test results (it uses SVM and linear kernel):
def run_gender_classifier():
Xm, Ym = mkdataset('gender/male', 1) # mkdataset just preprocesses images,
Xf, Yf = mkdataset('gender/female', 0) # flattens them and stacks into a matrix
X = np.vstack([Xm, Xf])
Y = np.hstack([Ym, Yf])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.1,
random_state=100)
model = svm.SVC(kernel='linear')
model.fit(X_train, Y_train)
print("Results:\n%s\n" % (
metrics.classification_report(
Y_test, model.predict(X_test))))
And got 100% precision!
In [22]: run_gender_classifier()
Results:
precision recall f1-score support
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 19
avg / total 1.00 1.00 1.00 35
I could expect different results, but 100% correct for image classification look really suspicious to me.
Furthermore, when I changed kernel to RBF, results became totally bad:
In [24]: run_gender_classifier()
Results:
precision recall f1-score support
0 0.46 1.00 0.63 16
1 0.00 0.00 0.00 19
avg / total 0.21 0.46 0.29 35
Which seems even more strange for me.
So my questions are:
Is there any mistake in my approach or code?
If not, how can results for linear kernel be so good, and for RBF so bad?
Note, that I also got 100% correct results with logistic regression, and very poor results with deep belief networks, so it's not specific to SVM, but rather for linear and non-linear models.
Just for completeness, here's my code for preprocessing and making dataset:
import cv2
from sklearn import linear_model, svm, metrics
from sklearn.cross_validation import train_test_split
def preprocess(im):
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
im = cv2.resize(im, (100, 100))
return cv2.equalizeHist(im)
def mkdataset(path, label):
images = (cv2.resize(cv2.imread(fname), (100, 100))
for fname in list_images(path))
images = (preprocess(im) for im in images)
X = np.vstack([im.flatten() for im in images])
Y = np.repeat(label, X.shape[0])
return X, Y
All of described models require tuning parameters:
Linear SVM : C
RBF SVM : C, gamma
DBN : Layers count, Neurons count, Output classifier, Training rate ...
And you simply omitted this element. So it is quite natural, that models with smallest number of tunable parameters behaved better - as simply there is bigger probability that default parameters actually worked.
100% score always looks suspicious and you should double check it "by hand" - phisically split data into train and test (put into different directories), train on one part, save your model to a file. Then in separate code - load a model, and test it on test files with displaying image + label from the model. This way you will make sure, that there is no implmenentation error (as you really don't care whether there is any processing error, if you have a physical proof that your model recognizes those faces, right?). This is purely "psychological method", which makes it obvious that there is no error in data splitting/sharing and further evaluation.
UPDATE
As suggested in the comment, I also checked your dataset, and as as it is stated on the official website:
The extended Yale Face Database B contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions.
So this is for sure a problem - this is not the dataset for the gender recognition. Your classifier simply memorizes these 28 subjects, which are easily splitted to male/female. It simply won't work on any image from other subjects. The only "valuable" part of this dataset is the set of 28 faces of distinctive individuals, which you can extract by hand, but 28 images seems at least row of magnitude too small to be useful.
Friend from what I understand your description of your questions and I think the explanation is simple, due to the problem with linear kernel is better than the RBF believe in your logic is correct however you should is using RBF somewhat wrong I think that it will serve to your problem, continue trying to develop a way to just use the linear kernel
I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.