I am currently trying to use the SGDRegressor from scikits learn to solve a multivariate target problem over a large dataset, X ~= (10^6,10^4). As such I am generating the design matrix (X) in parts with the following code, where each iteration produces a batch of size roughly (10^3,10^4):
design = self.__iterX__(events)
reglins = [linear_model.SGDRegressor(fit_intercept=True) for i in range(nTargets)]
for X,times in design:
for i in range(nTargets):
reglins[i].partial_fit(X,y.ix[times].values[:,i])
However I get the following stack trace:
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site- packages/sklearn/linear_model/stochastic_gradient.py", line 841, in partial_fit
coef_init=None, intercept_init=None)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 812, in _partial_fit
sample_weight, n_iter)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 948, in _fit_regressor
intercept_decay)
File "sgd_fast.pyx", line 508, in sklearn.linear_model.sgd_fast.plain_sgd (sklearn/linear_model/sgd_fast.c:8651)
ValueError: floating-point under-/overflow occurred.
Looking around it seems that this can be cause by not normalizing X properly. I understand scikits learn has a variety of functions for this, however given that I generate X in blocks, is it enough to simply normalize each block or would I need to figure out a way to normalize whole columns at a time?
Incidentally, is there a particular reason that the partial_fit function does not allow multivariate targets?
You can fit one block and apply to others:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x1 = scalar.fit_transform(X_block_1)
xn = scalar.transform(X_block_n)
You can choose other normalization methods from this page.
Related
I'm scripting a toy model to both practice PyTorch and GAN models, and I'm making sure I understand each step as much as possible.
That leaded me to checking my understanding of the BCEloss function, and apparently I understand it... with a ratio of 2.3.
To check the results, I write the intermediate values for Excel:
tmp1 = y_pred.tolist() # predicted values in list (to copy/paste on Excel)
tmploss = nn.BCELoss(reduction='none') # redefining a loss giving the whole BCEloss tensor
tmp2 = tmploss(y_pred, y_real).tolist() # BCEloss values in list (to copy/paste Exel)
Then I copy tmp1 on Excel and calculate: -log(x) for each values, which is the BCEloss formula for y_target = y_real = 1.
Then I compare the resulting values with the values of tmp2: these values are 2.3x higher than "mine".
(Sorry, I couldn't figure out how to format tables on this site...)
Can you please tell me what is happening? I feel a PEBCAK coming :-)
This is because in Excel the Log function calculates the logarithm to the base 10.
The standard definition of binary cross entropy uses a log function to the base e.
The ratio you're seeing is just log(10)=2.302585
I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.
Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)
I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.
My minimal code is as follows:
docs = []
for item in [database]:
docs.append(item)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)
X = X.todense() # <-- This line was needed to resolve the isse
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...
(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)
When I run this code, I get the following error when instatiating DBSCAN and calling fit():
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range
Clicking on the line in dbscan_.py that throws the error, I noticed the following line
...
X = np.asarray(X)
n = X.shape[0]
...
When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:
...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.
It looks like sparse representations for DBSCAN are supported as of Jan. 2015.
I upgraded sklearn to 0.16.1 and it worked for me on text.
The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.
Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.
You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.
You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.
Mean Shift may actually need your data to be vector space of fixed dimensionality.
I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer's fit_transform:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)
# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))
testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))
print "accuracy: " + str(Output)
My problem is that the fit_transform method is working on the train dataset, which contains around 8000 instances, but when I try to convert my test set to numerical features too, which is around 80000 instances, I get a memory error saying that:
testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError
What could possibly cause this and is there any workaround? Many thanks!
You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.
For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).
UPDATE
If the only problem is fitting test data, simply split it to small chunks. Instead of something like
x=vect.transform(test)
eval(x)
you can do
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
eval(x)
and record results/stats and analyze them afterwards.
in particular
predictions = []
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list
print accuracy_score( predictions, true_labels )