Pipeline using CountVectorizer (max_df) before tfidf - machine-learning

Currently i am not sure if this quation is for stackoverflow or another more theoretical statistical QA. But im confused about the following.
I am doing a binairy tekst classification task. For this task i use a pipeline, one of the example codes is below:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
'vect__stop_words': [None, stopwords.words('dutch'), stopwordList],
'clf__C': [0.1, 1, 10, 100, 1000]
}
So nothing really strange about this, but then i started playing with the parameter options/settings and noted that the code below (so the steps and parameters in the code) had the highest accuracy score (f1 score):
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1,1)],
'vect__stop_words': [None],
'vect__max_df': [0.2],
'vect__max_features': [10000],
'clf__C': [100]
}
So im pleased to sort of find out with which parameter settings and methods i get the highest score, but i cant figure out the exact meaning. As with the 'vectorizor'-step the settings for max_df (ignoring terms that appear in more 20% of the documents) seems to be strange to apply before tfidf (or somehow double)
Furthermore it also uses max_features of 10.000. What step is used before the max_df or the max_features? and how do i interpret max_features setting this parameter and doing tfidf afterwards. Does it then perform a tfidf over the 10.000 features?
For me it seems rather strange to do a tfidf after using parameters such as max_df and max_features? Am i correct? and why? or should i just do what gives the highest outcome..
I hope someone can help me in the right direction, thanks a lot in advance.

Related

Complex nesting within imblearn pipelines

I have been trying to find a solution to this but unsuccessfully so far.
I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets has to change within the pipeline. In order to do this I am using FunctionSampler from imblearn.
My problem is that the main pipeline is composed of steps which are, actually, pipelines themselves, which is giving me some problems.
The code below shows an extremely simplified version of the scenario I am working in. Please note this is not the actual code I am using (the transformers/classifiers are different and many more in the original code), only the structure is similar.
# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler
def outlier_extractor(X, y):
# just an example
return X, y
pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
("variance_threshold", VarianceThreshold()),
("outlier_correction", FunctionSampler(func=outlier_extractor)),
("classifier", QuadraticDiscriminantAnalysis())])
# definition of the feature engineering options
feature_engineering_options = [
Pipeline(steps=[
("scaling", StandardScaler()),
("PCA", PCA(n_components=3))
]),
Pipeline(steps=[ # add div and prod features
("polynomial", PolynomialFeatures()),
("kBest", SelectKBest())
])
]
outlier_correction_options = [
FunctionSampler(func=outlier_extractor),
Pipeline(steps=[
("center_scaling", StandardScaler()),
("normalisation", Normalizer(norm="l2"))
])
]
# definition of the parameters to optimize in the pipeline
params = [ # support vector machine
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [SVC()],
"classifier__C": [0.1, 1, 10, 50],
"classifier__kernel": ["linear", "rbf"],
},
# quadratic discriminant analysis
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [QuadraticDiscriminantAnalysis()]
}
]
When using GridSearchCV(pipe, param_grid=params) I receive the error TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. I know that I should unpack the pipelines, and I have also tried to follow this and this in order to solve the problem but my case seems (to me, at least) more complicated and I could not get these workarounds to work.
Any help/suggestion is very much appreciated. Thanks

RandomizedSearchCV does not work in a Pipeline when evaluating alternative classifiers

In Scikit Learn, RandomizedSearchCV can work for evaluating different parameters in a pipeline, but only in some cases where the classifiers share similar/same parameters. When you pass blocks of parameters for different classifiers it fails when GridSearchCV succeeds.
You will notice in the code below, the problem setup is the same for gridsearch and random search but only random search fails.
numpy.random.seed(52)
MY_RAND_SEED=numpy.random.seed(52)
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler' , StandardScaler()),
('classify', LogisticRegression())
])
X, y = make_classification(n_samples= 500, n_features=58, n_redundant=13, n_informative=7, n_clusters_per_class=2)
param_grid_linear = [
{'classify' : [LogisticRegression(),],
'classify__penalty' : ['l1', 'l2'],
'classify__C' : numpy.logspace(-4, 4, 50),
'classify__solver' : ['liblinear']},
{'classify' : [LogisticRegression(),],
'classify__penalty' : ['l2'],
'classify__C' : numpy.logspace(-4, 4, 50),
'classify__solver' : ['lbfgs']},
{'classify': [SVC(),],
'classify__kernel': ['linear',],
'classify__C': numpy.linspace(0.001,200, 10),},
]
innercv=StratifiedKFold(n_splits=5, shuffle=True, random_state=numpy.random.seed(52))
gridA = GridSearchCV(pipe, param_grid_linear, scoring='accuracy', iid=False, verbose=1, n_jobs=12)
gridA.fit(X, y)
print("finished grid search")
gridB = RandomizedSearchCV(pipe, param_grid_linear, scoring='accuracy', n_iter=5, iid=False, verbose=1, n_jobs=12)
gridB.fit(X, y)
Apparently, usually one passes only lists of dictionaries as parameters {dict, dict, dict}, but to do what I suggested above requires passing a list of lists of dictionaries, which is currently only accepted by GridSearchCV.
RandomizedSearchCV does not accept this now but will in a future release of sklearn. Here is a response I received on GitHub:
From: Thomas J Fan
Date: Wednesday, August 14, 2019 at 7:13 PM
This was addressed in #14549
This feature is not released yet, but you can try it out by installing the nightly build of scikit-learn:
pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn

Constructing discrete table-based CPDs in tensorflow-probablity?

I'm trying to construct the simplest example of Bayesian network with several discrete random variables and conditional probabilities (the "Student Network" from Koller's book, see 1)
Although a bit unwieldy, I managed to build this network using pymc3. Especially, creating the CPDs is not that straightforward in pymc3, see the snippet below:
import pymc3 as pm
...
with pm.Model() as basic_model:
# parameters for categorical are indexed as [0, 1, 2, ...]
difficulty = pm.Categorical(name='difficulty', p=[0.6, 0.4])
intelligence = pm.Categorical(name='intelligence', p=[0.7, 0.3])
grade = pm.Categorical(name='grade',
p=pm.math.switch(
theano.tensor.eq(intelligence, 0),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.3, 0.4, 0.3], # I=0, D=0
[0.05, 0.25, 0.7] # I=0, D=1
),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.9, 0.08, 0.02], # I=1, D=0
[0.5, 0.3, 0.2] # I=1, D=1
)
)
)
letter = pm.Categorical(name='letter', p=pm.math.switch(
...
But I have no idea how to build this network using tensoflow-probability (versions: tfp-nightly==0.7.0.dev20190517, tf-nightly-2.0-preview==2.0.0.dev20190517)
For the unconditioned binary variables, one can use categorical distribution, such as
from tensorflow_probability import distributions as tfd
from tensorflow_probability import edward2 as ed
difficulty = ed.RandomVariable(
tfd.Categorical(
probs=[0.6, 0.4],
name='difficulty'
)
)
But how to construct the CPDs?
There are few classes/methods in tensorflow-probability that might be relevant (in tensorflow_probability/python/distributions/deterministic.py or the deprecated ConditionalDistribution) but the documentation is rather sparse (one needs deep understanding of tfp).
--- Updated question ---
Chris' answer is a good starting point. However, things are still a bit unclear even for a very simple two-variable model.
This works nicely:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Bernoulli(probs=tf.gather([0.1, 0.9], indices=dist_x), validate_args=True)
))
print(jdn.sample(10))
but this one fails
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather_nd([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
(I'm trying to model categorical explicitly in the second example just for learning purposes)
-- Update: solved ---
Obviously, the last example wrongly used tf.gather_nd instead of tf.gather as we only wanted to select the first or the second row based on the dist_x outome. This code works now:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
The tricky thing about this, and presumably the reason it's subtler than expected in PyMC, is -- as with almost everything in vectorized programming -- handling shapes.
In TF/TFP, the (IMO) nicest way to solve this is with one of the new TFP JointDistribution{Sequential,Named,Coroutine} classes. These let you naturally represent hierarchical PGM models, and then sample from them, evaluate log probs, etc.
I whipped up a colab notebook demoing all 3 approaches, for the full student network: https://colab.research.google.com/drive/1D2VZ3OE6tp5pHTsnOAf_7nZZZ74GTeex
Note the crucial use of tf.gather and tf.gather_nd to manage the vectorization of the various binary and categorical switching.
Have a look and let me know if you have any questions!

Why does cross_val_score always report the same score for each fold?

I'm running a Scikit-learn pipeline using cross validation via cross_val_score.
But after running it a couple of times the results for each fold are always the same. I'm bothered by this because shouldn't the splits be random?
This is the relevant part of my code:
pipeline = Pipeline([
('vect', CountVectorizer(preprocessor=clean_text_custom, max_features=MAX_NB_WORDS, strip_accents='unicode')),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC(),n_jobs=-1)),
])
cross_val_score(pipeline, data, binary_label_data, cv=5,scoring='f1_micro')
# array([ 0.25129587, 0.37780563, 0.33195376, 0.31269861, 0.14555337])
# then i run it again and I get he exact same scores for each fold
cross_val_score(pipeline, data, binary_label_data, cv=5,scoring='f1_micro')
# array([ 0.25129587, 0.37780563, 0.33195376, 0.31269861, 0.14555337])

TensorFlow Classification Using Dataset

I need to utilize TensorFlow for a project to classify items based on their attributes to a certain class (either 1, 2, or 3).
Only problem is almost every TF tutorial or example I find online is about image recognition or text classification. I can't find anything about classification based on numbers. I guess what I'm asking for is where to get started. If anyone knows of a relevant example, or if I'm just thinking about this completely wrong.
We are given the 13 attributes for each item, and need to use the TF neural network to classify each item correctly (or mark the margin of error). But nothing online is showing me even how to start with this kind of dataset.
Example of dataset: (first value is class, other values are attributes)
2, 11.84, 2.89, 2.23, 18, 112, 1.72, 1.32, 0.43, 0.95, 2.65, 0.96, 2.52, 500
3, 13.69, 3.26, 2.54, 20, 107, 1.83, 0.56, 0.5, 0.8, 5.88, 0.96, 1.82, 680
3, 13.84, 4.12, 2.38, 19.5, 89, 1.8, 0.83, 0.48, 1.56, 9.01, 0.57, 1.64, 480
2, 11.56, 2.05, 3.23, 28.5, 119, 3.18, 5.08, 0.47, 1.87, 6, 0.93, 3.69, 465
1, 14.06, 1.63, 2.28, 16, 126, 3, 3.17, 0.24, 2.1, 5.65, 1.09, 3.71, 780
Suppose you have the data in a file, data.txt. You can use Numpy to read this:
import numpy as np
xy = np.loadtxt('data.txt', unpack=True, dtype='float32')
x_data = xy[1:]
y_data = xy[0];
More information: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html
Perhaps, you may need 'np.transpose' depends on the shape of your weights and operations.
x_data = np.transpose(xy[1:])
Then, use 'placeholders' and 'feed_dict' to train/test your model:
X = tf.placeholder("float", ...
Y = tf.placeholder("float", ...
....
with tf.Session() as sess:
....
sess.run(optimizer, feed_dict={X:x_data, Y:y_data})
for this kind problem TensorFlow have an in depth tutorial here
or in toward data science here
if your looking for videos to start i think sentdex's tutorials on the titanic data-set
is what your looking for although he is using k means to do the classification
(actually I think his entire deep learning/machine learning playlist is great to start with)
you can find it here
otherwise if your looking for basic how to start
first prepossessing:
try first separating the data into class labels and inputs (pandas lib should be able to help you with this)
make your class labels into a one-hot array
than normalize the data:
it looks like your different data attributes have wildly different ranges, make sure to get them all in the same range between 0 and 1
build your model:
a simple fully connected net should do the trick
remember to make the output layer the same size as the number of classes you have
use an argmax function on the output of the finale layer to decide which class the model thinks is the proper classification

Resources