I am classifying small texts (tweets) using Naive Bayes (MultinominalNB) in scikit-learn.
My train data has 1000 features, and my test data has 1200 features.
Let's say 500 features are common for both train and test data.
I wonder why MultinominalNB in scikit learn does not handle unseen features, and gives me an error:
Traceback (most recent call last):
File "/Users/osopova/Documents/00_KSU_Masters/01_2016_Spring/Twitter_project/mda_project_1/step_4.py", line 60, in <module>
predict_Y = classifiers[i].predict(test_X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 65, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 672, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
ValueError: matrices are not aligned
It does not handle unseen features because you do not pass any reference naming features. Why do you have 1200 features in one case and 1000 in another? Probably because there were objects in the test setting not present in the training - but how Naive Bayes is supposed to figure out which ones of these 1200 are missing in 1000? In this implementation (which is the only possible when you assume arrays as input) it is your duty to remove all columns, which do not correspond to the ones in the training set, add columns of zeros (in valid spots) if it is the other way around, and most importantly - make sure that "ith" column in one set is the same (captures occurence of the same word/object) as "ith" column in the second one. Consequently in your case there are just 500 columns which can actually be used, and Naive Bayes has no information how to find these. You have to provide, in test scenario, the same 1000 features which were used in train, thus in your case it means removing 700 columns not seen during train, and adding (in valid spots!) 500 columns of zeros.
In particular, scikit-learn gives you plenty of data preprocessing utilities, which do this for you (like CountVectorizer etc.).
Related
I have trained a set of documents using Doc2vecc.
https://github.com/mchen24/iclr2017
I am trying to generate the embedding vector for the unseen documents.I have trained the documents as mentioned in the go.sh.
"""
time ./doc2vecc -train ./aclImdb/alldata-shuf.txt -word
wordvectors.txt -output docvectors.txt -cbow 1 -size 100 -window 10 -
negative 5 -hs 0 -sample 0 -threads 4 -binary 0 -iter 20 -min-count 10
-test ./aclImdb/alldata.txt -sentence-sample 0.1 -save-vocab
alldata.vocab
"""
I get the docvectors.txt and wordvectors.txt for the train set. Now from here how do I generate vectors for unseen test using the same model without retraining.
As far as I can tell, the author (https://github.com/mchen24) of that doc2vecc.c code (and paper) just made minimal changes to some example 'paragraph vector' code that was itself a minimal change to the original Google/Mikolov word2vec.c (https://github.com/tmikolov/word2vec/blob/master/word2vec.c).
Neither the 'paragraph vector' changes nor the subsequent doc2vecc changes appear to include any functionality for inferring vectors for new documents.
Because these are unsupervised algorithms, for some purposes it may be appropriate to calculate the document-vectors for some downstream classification task, for both training and test texts, in the same combined bulk training. (Your ultimate goals may in fact have unlabeled examples to help learn the document-vectorization, even if your classifier should be trained an evaluated on some subset of known-label texts.)
Doc2VecC is expressly designed to create document vectors as averages of the word-vectors in each document. This is unlike Doc2Vec where document embeddings are trained alongside the word embeddings making it impossible to handle unseen documents. The amount of trained vectors is also enormous in Doc2Vec.
To build the vector for an unseen document, just count all the words from your vocabulary in it and compute an average of the word-vectors.
I am using LSTM neural networks (stateful) for time series prediction.
I'm hoping that the stateful LSTM can capture the hidden patterns and make a satisfactory prediction (the physical law that cause the variation of the time series is not clear).
I have a time series X with a length of 1500 (actual observational data), and my purpose is to predict the future 100.
I suppose predict the next 10 will be more promising than predict the next 100 (is that right?).
So, I prepare the training data like this (always using 100 values to predict the next 10; x_n denotes the n-th element in X):
shape of trainX: [140, 100, 1]
shape of trainY: [140, 10, 1]
---
0: [x_0, x_1, ..., x_99] -> [x_100, x_101, ..., x_109]
1: [x_10, x_11, ..., x_109] -> [x_110, x_111, ..., x_119]
2: [x_20, x_21, ..., x_119] -> [x_120, x_121, ..., x_129]
...
139: [x_1390, x_1391, ..., x_1489] -> [x_1490, x_1491, ..., x_1499]
---
After the training, I want to use the model to predict the next 10 values [x_1500 - x_1509] with [x_1400 - x_1499], and then predict the next 10 values [x_1510 - x_1519] with [x_1410 - x_1509].
Is this the right way?
After a lot of reading of documents and examples, I can train a model and make the prediction, but the result seems not satisfactory.
To validate the method, I assume that the last 100 (x_1400 - x_1499) values are unknown, and remove them from trainX and trainY, then try to train a model and predict them. Lastly, compare the predicted values with the observed values.
Any suggestions or comments will be appreciated.
The time series looks like this:
Your question is really complexed. Before I will try to answer it - I'll share my doubts with you about is it sensible to use LSTM for your task. You want to use a really advanced model (LSTM are capable to learn really complexed patterns) to a time series which seems to be relatively easy. Moreover - you have a really small amout of data. To be honest - I would try to train simpler and easier methods first (like ARMA or ARIMA).
To answer your question - if your approach is good - it seems to be reasonable. Other reasonable methods are predicting all 100 steps or e.g. 50 steps twice. With 10 steps you might come across error cumulation - but still it might be a good method.
As I mentioned earlier - I would rather try easier ML method for this task but if you really want to use LSTM you may tackle this problem in a following way:
Define metaparameters like number of steps ahead you want to predict, the size of input fed to network.
Try to use e.g. grid search in order to find the best value of this metaparameters. Evaluate each setup using k-fold crossvalidation.
Retrain final model using the best metaparameter setup.
You have relatively small amount of data so you may easily find the best values of hyperparameters. This will also show you if your approach is good or not - simply check the results provided by the best solution.
I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.
My minimal code is as follows:
docs = []
for item in [database]:
docs.append(item)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)
X = X.todense() # <-- This line was needed to resolve the isse
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...
(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)
When I run this code, I get the following error when instatiating DBSCAN and calling fit():
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range
Clicking on the line in dbscan_.py that throws the error, I noticed the following line
...
X = np.asarray(X)
n = X.shape[0]
...
When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:
...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.
It looks like sparse representations for DBSCAN are supported as of Jan. 2015.
I upgraded sklearn to 0.16.1 and it worked for me on text.
The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.
Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.
You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.
You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.
Mean Shift may actually need your data to be vector space of fixed dimensionality.
I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.
I'm trying to use sk-learn's RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset's load_svmlight_file. It produces a sparse matrix of 'feature values' (1.177.245 * 40) and one array of 'target classes' (1s and 0s, 1.177.245 of them). I don't know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.
As the sk-learn's RFC doesn't accept sparse matrices, I convert the sparse matrix to a dense array (if I'm saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.
When I initiate the classifier and start fitting it to the data, it takes this long:
[Parallel(n_jobs=40)]: Done 1 out of 40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done 40 out of 40 | elapsed: 27.2min finished
(is that output right? Those 963 minutes take about 2 and a half...)
I then dump it using joblib.dump.
When I re-load it:
RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
criterion=gini, max_depth=None, max_features=auto,
min_density=0.1, min_samples_leaf=1, min_samples_split=1,
n_estimators=1500, n_jobs=40, oob_score=False,
random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
verbose=1)
And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get "unexpected" results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.
Now I have no reason to believe anything is wrong with what's happening, it's just that I get weird results, and furthermore I think it's all done awfully quick. It's probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours...
Can anyone enlighten me whether I have any reason to believe something is not working the way it's supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.
Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).
Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.