This question already has answers here:
Does the SVM in sklearn support incremental (online) learning?
(6 answers)
Closed 4 years ago.
I'm trying to perform sentiment analysis over the twitter dataset "Sentiment140" which consists of 1.6 million labelled tweets . I'm constructing my feature vector using Bag Of Words ( Unigram ) model , so each tweet is represented by about 20000 features . Now to train my sklearn model (SVM,Logistic Regression,Naive Bayes) using this dataset , i have to load the entire 1.6m x 20000 feature vectors into one variable and then feed it to the model . Even on my server machine which has a total of 115GB of memory , it causes the process to be killed .
So i wanted to know if i can train the model instance by instance , rather than loading the entire dataset into one variable ?
If sklearn does not have this flexibility , then is there any other libraries that you could recommend (which support sequential learning) ?
It is not really necessary (let alone efficient) to go to the other extreme and train instance by instance; what you are looking for is actually called incremental or online learning, and it is available in scikit-learn's SGDClassifier for linear SVM and logistic regression, which indeed contains a partial_fit method.
Here is a quick example with dummy data:
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)
clf.partial_fit(X, Y, classes=np.unique(Y))
X_new = np.array([[-1, -1], [2, 0], [0, 1], [1, 1]])
Y_new = np.array([1, 1, 2, 1])
clf.partial_fit(X_new, Y_new)
The default values for the loss and penalty arguments ('hinge' and 'l2' respectively) are these of a LinearSVC, so the above code essentially fits incrementally a linear SVM classifier with L2 regularization; these settings can of course be changed - check the docs for more details.
It is necessary to include the classes argument in the first call, which should contain all the existing classes in your problem (even though some of them might not be present in some of the partial fits); it can be omitted in subsequent calls of partial_fit - again, see the linked documentation for more details.
Related
My question concerns the ability of UMAP to transform new data in a way that intuitively makes sense. There is a nice example of this in the documentation: https://umap-learn.readthedocs.io/en/latest/transform.html. Here MNIST data is used to train a model, and new (withheld) MNIST data is passed to the model, with the result that the new data is mapped into the expected regions of learned space (i.e. same as training data). However, when I try this on a synthetic dataset I'm unable to reproduce this behavior.
I first created a set of 500 training examples, X1, each with 32 features. These are generated randomly using np.random.rand and have values between 0 and 1. I then created a small test set of 10 examples, X2, where all examples are identical to one of the training samples, except it is multiplied by 10. It is therefore much different than training data. See plot comparing a single example from X1 and X2.
import numpy as np
import random
from sklearn.decomposition import PCA
import umap
from matplotlib import pyplot as plt
X1 = np.random.rand(500, 32)
X2 = np.copy(X1[0:10])
X2[0:10] = X1[0]*10
plt.plot(X1[0], 'go-', label='train')
plt.plot(X2[0], 'bo-', label='test')
plt.legend();
Next I compare the performance of PCA and UMAP trained on X1 and used to transform X2. The results are shown below.
pca = PCA(n_components=2, random_state=11)
X1_pca = pca.fit_transform(X1)
X2_pca = pca.transform(X2)
umap = umap.UMAP(n_components=2, random_state=11)
X1_umap = umap.fit_transform(X1)
X2_umap = umap.transform(X2)
plt.subplot(1, 2, 1)
plt.scatter(X1_pca[:, 0], X1_pca[:, 1], s=40, c='blue', alpha=0.25, label='train')
plt.scatter(X2_pca[:, 0], X2_pca[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('PCA')
plt.subplot(1, 2, 2)
plt.scatter(X1_umap[:, 0], X1_umap[:, 1], s=40, c='blue', alpha=0.25, label='train')
plt.scatter(X2_umap[:, 0], X2_umap[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('UMAP')
plt.show()
PCA generates a result that one would naively expect. The training data is clustered together, the test data is clearly separated, and all the points within the test data are mapped to the same point in 2D space. UMAP, on the other hand, does neither. That is, the test data is not clearly differentiated from the training data, and even though the test data consists of identical features, they are not even mapped to the same region in the 2D space.
Am I doing something wrong, or is my understanding of how UMAP should work incorrect? Any help is greatly appreciated.
I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)
Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])
I have a problem and at this point I'm completely lost as to how to solve it. I'm using Keras with an LSTM layer to project a time series. I'm trying to use the previous 10 data points to predict the 11th.
Here's the code:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
def _load_data(data):
"""
data should be pd.DataFrame()
"""
n_prev = 10
docX, docY = [], []
for i in range(len(data)-n_prev):
docX.append(data.iloc[i:i+n_prev].as_matrix())
docY.append(data.iloc[i+n_prev].as_matrix())
if not docX:
pass
else:
alsX = np.array(docX)
alsY = np.array(docY)
return alsX, alsY
X, y = _load_data(df_test)
X_train = X[:25]
X_test = X[25:]
y_train = y[:25]
y_test = y[25:]
in_out_neurons = 2
hidden_neurons = 300
model = Sequential()
model.add(LSTM(in_out_neurons, hidden_neurons, return_sequences=False))
model.add(Dense(hidden_neurons, in_out_neurons))
model.add(Activation("linear"))
model.compile(loss="mean_squared_error", optimizer="rmsprop")
model.fit(X_train, y_train, nb_epoch=10, validation_split=0.05)
predicted = model.predict(X_test)
So I'm taking the input data (a two column dataframe), creating X which is an n by 10 by 2 array, and y which is an n by 2 array which is one step ahead of the last row in each array of X (labeling the data with the point directly ahead of it.
predicted is returning
[[ 7.56940445, 5.61719704],
[ 7.57328357, 5.62709032],
[ 7.56728049, 5.61216415],
[ 7.55060187, 5.60573629],
[ 7.56717342, 5.61548522],
[ 7.55866942, 5.59696181],
[ 7.57325984, 5.63150951]]
but I should be getting
[[ 73, 48],
[ 74, 42],
[ 91, 51],
[102, 64],
[109, 63],
[ 93, 65],
[ 92, 58]]
The original data set only has 42 rows, so I'm wondering if there just isn't enough there to work with? Or am I missing a key step in the modeling process maybe? I've seen some examples using Embedding layers etc, is that something I should be looking at?
Thanks in advance for any help!
Hey Ryan!
I know it's late but I just came across your question hope it's not too late or that you still find some knowledge here.
First of all, Stackoverflow may not be the best place for this kind of question. First reason to that is you have a conceptual question that is not this site's purpose. Moreover your code runs so it's not even a matter of general programming. Have a look at stats.
Second from what I see there is no conceptual error. You're using everything necessary that is:
lstm with propper dimensions
return_sequences=false just before your Dense layer
linear activation for your output
mse cost/loss/objective function
Third I however find it extremely unlikely that your network learns anything with so few pieces of data. You have to understand that you have less data than parameters here! For the great majority of supervised learning algorithm, the first thing you need is not a good model, it's good data. You can not learn from so few examples, especially not with a complex model such as LSTM networks.
Fourth It seems like your target data is made of relatively high values. First step of pre-processing here could be to standardize the data : center it around zero - that is translate your data by its mean - and rescale by ists standard deviation. This really helps learning!
Fifth In general here are a few things you should look into to improve learning and reduce overfitting :
Dropout
Batch Normalization
Other optimizer (such as Adam)
Gradient clipping
Random hyper parameter search
(This is not exhaustive, if you're reading this and think something should be added, comment it so it's useful for future readers!)
Last but NOT least I suggest you look at this tutorial on Github, especially the recurrent tutorial for time series with keras.
PS: Daniel Hnyk updated his post ;)
I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.
When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.
Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.
This is a known interesting (and often painful!) problem with hierarchical predictions. A problem with training a number of predictors over the train data, then training a higher predictor over them, again using the train data - has to do with the bias-variance decomposition.
Suppose you have two predictors, one essentially an overfitting version of the other, then the former will appear over the train set to be better than latter. The combining predictor will favor the former for no true reason, just because it cannot distinguish overfitting from true high-quality prediction.
The known way of dealing with this is to prepare, for each row in the train data, for each of the predictors, a prediction for the row, based on a model not fit for this row. For the overfitting version, e.g., this won't produce a good result for the row, on average. The combining predictor will then be able to better assess a fair model for combining the lower-level predictors.
Shahar Azulay & I wrote a transformer stage for dealing with this:
class Stacker(object):
"""
A transformer applying fitting a predictor `pred` to data in a way
that will allow a higher-up predictor to build a model utilizing both this
and other predictors correctly.
The fit_transform(self, x, y) of this class will create a column matrix, whose
each row contains the prediction of `pred` fitted on other rows than this one.
This allows a higher-level predictor to correctly fit a model on this, and other
column matrices obtained from other lower-level predictors.
The fit(self, x, y) and transform(self, x_) methods, will fit `pred` on all
of `x`, and transform the output of `x_` (which is either `x` or not) using the fitted
`pred`.
Arguments:
pred: A lower-level predictor to stack.
cv_fn: Function taking `x`, and returning a cross-validation object. In `fit_transform`
th train and test indices of the object will be iterated over. For each iteration, `pred` will
be fitted to the `x` and `y` with rows corresponding to the
train indices, and the test indices of the output will be obtained
by predicting on the corresponding indices of `x`.
"""
def __init__(self, pred, cv_fn=lambda x: sklearn.cross_validation.LeaveOneOut(x.shape[0])):
self._pred, self._cv_fn = pred, cv_fn
def fit_transform(self, x, y):
x_trans = self._train_transform(x, y)
self.fit(x, y)
return x_trans
def fit(self, x, y):
"""
Same signature as any sklearn transformer.
"""
self._pred.fit(x, y)
return self
def transform(self, x):
"""
Same signature as any sklearn transformer.
"""
return self._test_transform(x)
def _train_transform(self, x, y):
x_trans = np.nan * np.ones((x.shape[0], 1))
all_te = set()
for tr, te in self._cv_fn(x):
all_te = all_te | set(te)
x_trans[te, 0] = self._pred.fit(x[tr, :], y[tr]).predict(x[te, :])
if all_te != set(range(x.shape[0])):
warnings.warn('Not all indices covered by Stacker', sklearn.exceptions.FitFailedWarning)
return x_trans
def _test_transform(self, x):
return self._pred.predict(x)
Here is an example of the improvement for the setting described in #MaximHaytovich's answer.
First, some setup:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import ensemble
from sklearn import metrics
y = np.random.randn(100)
x0 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x1 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x = np.zeros((100, 2))
Note that x0 and x1 are just noisy versions of y. We'll use the first 80 rows for train, and the last 20 for test.
These are the two predictors: a higher-variance gradient booster, and a linear predictor:
g = ensemble.GradientBoostingRegressor()
l = linear_model.LinearRegression()
Here is the methodology suggested in the answer:
g.fit(x0[: 80, :], y[: 80])
l.fit(x1[: 80, :], y[: 80])
x[:, 0] = g.predict(x0)
x[:, 1] = l.predict(x1)
>>> metrics.r2_score(
y[80: ],
linear_model.LinearRegression().fit(x[: 80, :], y[: 80]).predict(x[80: , :]))
0.940017788444
Now, using stacking:
x[: 80, 0] = Stacker(g).fit_transform(x0[: 80, :], y[: 80])[:, 0]
x[: 80, 1] = Stacker(l).fit_transform(x1[: 80, :], y[: 80])[:, 0]
u = linear_model.LinearRegression().fit(x[: 80, :], y[: 80])
x[80: , 0] = Stacker(g).fit(x0[: 80, :], y[: 80]).transform(x0[80:, :])
x[80: , 1] = Stacker(l).fit(x1[: 80, :], y[: 80]).transform(x1[80:, :])
>>> metrics.r2_score(
y[80: ],
u.predict(x[80:, :]))
0.992196564279
The stacking prediction does better. It realizes that the gradient booster is not that great.
Ok, after spending some time on googling 'stacking' (as mentioned by #andreas earlier) I found out how I could do the weighting in python even with scikit-learn. Consider the below:
I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.
Then I just train a linear regression on this new dataframe:
#df - dataframe with results of 3 regressors and true output
from sklearn linear_model
stacker= linear_model.LinearRegression()
stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])
So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:
stacker.predict()
on outputs of my 3 regressors. And get a result.
The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.
What you describe is called "stacking" which is not implemented in scikit-learn yet, but I think contributions would be welcome. An ensemble that just averages will be in pretty soon: https://github.com/scikit-learn/scikit-learn/pull/4161
Late response, but I wanted to add one practical point for this sort of stacked regression approach (which I use this frequently in my work).
You may want to choose an algorithm for the stacker which allows positive=True (for example, ElasticNet). I have found that, when you have one relatively stronger model, the unconstrained LinearRegression() model will often fit a larger positive coefficient to the stronger and a negative coefficient to the weaker model.
Unless you actually believe that your weaker model has negative predictive power, this is not a helpful outcome. Very similar to having high multi-colinearity between features of a regular regression model. Causes all sorts of edge effects.
This comment applies most significantly to noisy data situations. If you're aiming to get RSQ of 0.9-0.95-0.99, you'd probably want to throw out the model which was getting a negative weighting.
I beieve SGDClassifier() with loss='log' supports Multilabel classification and I do not have to use OneVsRestClassifier. Check this
Now, my dataset is quite big and I am using HashingVectorizer and passing result as input to SGDClassifier. My target has 42048 features.
When I run this, as follows:
clf.partial_fit(X_train_batch, y)
I get: ValueError: bad input shape (300000, 42048).
I have also used classes as the parameter as follows, but still same problem.
clf.partial_fit(X_train_batch, y, classes=np.arange(42048))
In the documentation of SGDClassifier, it says y : numpy array of shape [n_samples]
No, SGDClassifier does not do multilabel classification -- it does multiclass classification, which is a different problem, although both are solved using a one-vs-all problem reduction.
Then, neither SGD nor OneVsRestClassifier.fit will accept a sparse matrix for y. The former wants an array of labels, as you've already found out. The latter wants, for multilabel purposes, a list of lists of labels, e.g.
y = [[1], [2, 3], [1, 3]]
to denote that X[0] has label 1, X[1] has labels {2,3} and X[2] has labels {1,3}.