sklearn logistic regression parameter in GridSearch - machine-learning

just wondering how to separate parameters into a group and pass it to gridsearch?
As i want to pass penalty l1 and l2 to grid search and corresponding solver newton-cg to L2.
However, when i run the code below, the gridsearch will first run l1 with newton-cg and result in error msg
ValueError: Solver newton-cg supports only l2 penalties, got l1 penalty.
Thanks
param_grid = [
{'penalty':['l1','l2'] ,
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
]

Try this example:
param_grid = [
{'penalty': ['l1'], 'solver': [ 'lbfgs', 'liblinear', 'sag', 'saga']},
{'penalty': ['l2'], 'solver': ['newton-cg']},
]
here l1 will be tried with 'lbfgs', 'liblinear', 'sag', 'saga' and l2 will be tried with only 'newton-cg'

The official doc says:
... or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
So just supply a list of dictionaries each dictionary with consistent set of arguments that work together
There is also an explicit example in the GridSearchCV User Guide, which serves as a good example.

Related

Sklearn: Found input variables with inconsistent numbers of samples:

I have built a model.
est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')
m1= Pipeline([('FeaturePreprocessing', est1_pre),
('clf',alternative)])
m2= Pipeline([('FeaturePreprocessing', est2_pre),
('clf',alternative)])
model_combo = StackingClassifier(
estimators=[('cate',m1),('text',m2)],
final_estimator=RandomForestClassifier(n_estimators=10,
random_state=42)
)
I can successfully, fit and predict using m1 and m2.
However, when I look at the combination model_combo
Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:
model_fitted=model_combo.fit(x_train,y_train)
x_train contains Var1 and Var2
How to fit model_combo?
The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop')

Ambiguity in recurrent neural network training in Julia Flux

I'm using Julia's Flux library to learn about neural networks. According to the documentation for train! (where train! takes arguments (loss, params, data, opt)):
For each datapoint d in data, compute the gradient of loss with respect to params through backpropagation and call the optimizer opt.
(see source for train!: https://github.com/FluxML/Flux.jl/blob/master/src/optimise/train.jl)
For a conventional NN based on Dense -- let's say with a one-dimensional input and output, i.e. with one feature -- this is easy to understand. Each element in data is a pair of single numbers, an independent sample of 1-d input/output values. train! does forward- and backpropagation on each pair of 1-d samples one at a time. In the process, the loss function is evaluated on each sample. (Do I have this right?)
My question is: how does this extend to a recurrent NN? Take the case of an RNN with 1-d (i.e. one feature) input and output. It seems like there's some ambiguity in how to structure the input and output data, and the results change based on the structure. As one example:
x = [[1], [2], [3]]
y = [4, 5, 6]
data = zip(x, y)
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), data, opt)
(loss function taken from: https://github.com/FluxML/Flux.jl/blob/master/docs/src/models/recurrence.md)
In this example, when train! loops through each sample (for d in data), each value of d is a pair of single values from x and y, e.g. ([1], 4). loss is evaluated based on these single values. This is the same as in the Dense case.
On the other hand, consider:
x = [[[1], [2], [3]]]
y = [[4, 5, 6]]
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), zip(x, y), opt)
Note that the only difference here is that x and y are nested in an extra pair of square brackets. As a result there's only one d in data, and it's a pair of sequences: ([[1], [2], [3]], [4, 5, 6]). loss can be evaluated on this version of d, and it returns a 1-d value, as required for training. But the value returned by loss is different than in any of the three results from the previous case, so the training process turns out differently.
The point is that both structures are valid in the sense that loss and train! handle them without error. Conceptually, I can make an argument for both structures being correct. But the results are different, and I assume that only one way is right. In other words, for training an RNN, should each d in data be a whole sequence, or a single element from a sequence?

Do PCA transform/project functions from sklearn/opencv libraries retain the order of the original data set?

I was wondering if the transform() from sklearn.decomposition.PCA or project() from C++ opencv rearrange the results from the original data.
pca.py:
import pandas as pd
from sklearn.decomposition import PCA
# read csv into dataframe
df = pd.read_csv('sample.csv')
columns=['x', 'y']
pca = PCA()
pca.fit(df)
t = pca.transform(df)
print(t)
sample.csv:
x,y
2.5,2.4
0.5,0.7
2.2,2.9
1.9,2.2
3.1,3.0
2.3,2.7
2,1.6
1,1.1
1.5,1.6
1.1,0.9
transformed data:
[[-0.82797019 -0.17511531]
[ 1.77758033 0.14285723]
[-0.99219749 0.38437499]
[-0.27421042 0.13041721]
[-1.67580142 -0.20949846]
[-0.9129491 0.17528244]
[ 0.09910944 -0.3498247 ]
[ 1.14457216 0.04641726]
[ 0.43804614 0.01776463]
[ 1.22382056 -0.16267529]]
I wanted to ensure that (2.5, 2.4) corresponds to (-0.82797019, -0.17511531), (0.5,0.7) to (1.77758033,0.14285723) and so on.
OpenCV PCA class' project() outputs the same. But the eigenvalues seem to get rearranged in descending order automatically from OpenCV library so I was wondering if these transformed/projected data are arranged accordingly as well.
In short, yes, order of samples is the same with that of projections/transform. Eigenvalues, on the other hand, are always sorted in descending order in both sklearn and OpenCv and pretty much all PCA implementations since that is the gist of PCA: to retain largest eigenvalues.

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs).
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector)
Is model.docvecs designed to provide all training docvecs?
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.
Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

How to use a ValidationMonitor for an Estimator in TensorFlow 1.0?

TensorFlow provides the possibility for combining ValidationMonitors with several predefined estimators like tf.contrib.learn.DNNClassifier.
But I want to use a ValidationMonitor for my own estimator which I have created based on 1.
For my own estimator I initialize first a ValidationMonitor:
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(testX,testY,every_n_steps=50)
estimator = tf.contrib.learn.Estimator(model_fn=model,model_dir=direc,config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))
input_fn = tf.contrib.learn.io.numpy_input_fn({"x": x}, y, 4, num_epochs=1000)
Here I pass the monitor as shown in 2 for tf.contrib.learn.DNNClassifier:
estimator.fit(input_fn=input_fn, steps=1000,monitors=[validation_monitor])
This fails and following error was printed:
ValueError: Features are incompatible with given information. Given features: Tensor("input:0", shape=(?, 1), dtype=float64), required signatures: {'x': TensorSignature(dtype=tf.float64, shape=TensorShape([Dimension(None)]), is_sparse=False)}.
How can I use monitors for my own estimators?
Thanks.
Problem is solved when passing input_fn containing testX and testY to ValidationMonitor instead of passing the tensors testX and testY directly.
For the record, your error was caused by the fact that ValidationMonitor expects x to be a dictionary like { 'feature_name_as_a_string' : feature_tensor }, which in your input_fn is done internally by the call to tf.contrib.learn.io.numpy_input_fn(...).
More information about how to build features dictionaries can be found in the Building Input Functions with tf.contrib.learn article of the documentation.

Resources