scikit-learn: clustering text documents using DBSCAN - machine-learning

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.
My minimal code is as follows:
docs = []
for item in [database]:
docs.append(item)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)
X = X.todense() # <-- This line was needed to resolve the isse
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...
(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)
When I run this code, I get the following error when instatiating DBSCAN and calling fit():
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range
Clicking on the line in dbscan_.py that throws the error, I noticed the following line
...
X = np.asarray(X)
n = X.shape[0]
...
When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:
...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.
I upgraded sklearn to 0.16.1 and it worked for me on text.

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.
Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.
You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.
You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.
Mean Shift may actually need your data to be vector space of fixed dimensionality.

Related

Why doc2vec is giving different and un-reliable results?

I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec implementation:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
def doc2vec_score(s):
s_list = tokenize_and_stem(s)
v1 = model.infer_vector(s_list)
similar_doc = model.docvecs.most_similar([v1])
original_match = (X[int(similar_doc[0][0])])
score = similar_doc[0][1]
match = similar_doc[0][0]
return score,match
final_data = []
# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
print(row['processed_description'])
data = (doc2vec_score(row['processed_description']))
L1=list(data)
L1.append(row['Number'])
final_data.append(L1)
with open('file_cosine_d2v.csv','w',newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['score','match','INC_NUMBER'])
for row in final_data:
csv_out.writerow(row)
But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?
First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.
See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?
If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.
For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.
And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.
More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)
So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

How to normalize tf-idf vectors for SVMs?

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

How to optimize affinity propagation results?

I was wondering if anybody knows anything deeper than what I do about the Affinity Propagation clustering algorithm in the python scikit-learn package?
All I know right now is that I input an "affinity matrix" (affmat), which is calculated by applying a heat kernel transformation to the "distance matrix" (distmat). The distmat is the output of a previous analysis step I do. After I obtain the affmat, I pass that into the algorithm by doing something analogous to:
af = AffinityPropagation(affinity = 'precomputed').fit(affmat)
On one hand, I appreciate the simplicity of the inputs. On the other hand, I'm still not sure if I've got the input done correctly?
I have tried evaluating the results by looking at the Silhouette scores, and frequently (but not a majority of the time) I am getting scores close to zero or in the negative ranges, which indicate to me that I'm not getting clusters that are 'naturally grouped'. This, to me, is a problem.
Additionally, when I observe some of the plots, I'm getting weird patterns of clustering, such as the image below, where I clearly see 3 clusters of points, but AP is giving me 10:
I also see other graphs like this, where the clusters overlap a lot.:
How are the cluster centers' euclidean coordinates identified? I notice when I put in percent values (99.2, 99.5 etc.), rather than fraction values (0.992, 0.995 etc.), the x- and y-axes change in scale from [0, 1] to [0, 100] (but of course the axes are scaled appropriately such that I'm getting, on occasion, [88, 100] or [92, 100] and so on).
Sorry for the rambling here, but I'm having quite a hard time understanding the underpinnings of the python implementation of the algorithm. I've gone back to the original paper and read it, but I don't know how the manipulate it in the scikit-learn package to get better clustering.

Resources