I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?
Take this text classification flow as an example:
model = Pipeline([
('vectorizer', FeatureUnion(transformer_list=[
('word-freq', TfidfVectorizer()), # vocab-size dimensional
('doc2vec', MyDoc2VecVectorizer()), # 32 dimensional (custom transformer)
('doc-length', MyDocLengthVectorizer()), # 1 dimensional (custom transformer)
('sentiment', MySentimentVectorizer()), # 3 dimensional (custom transformer)
... # possibly many other transformers
])),
('classifier', SVC())
])
I suspect this may fall under the requested dynamic-pipeline functionality of scikit slep002. If so how to handle in the interim?
While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection or dimensionality reduction modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:
model = Pipeline([
('vectorizer', FeatureUnion(transformer_list=[...])),
('feature_selector', GenericUnivariateSelect(
mode='percentile',
param=0.20, # hyper-tunable parameter
)),
('classifier', SVC())
])
In a feature discovery context (ie: find the optimal expressive signals), this technique is more powerful over cherry-picking transformers. However, in an architecture discovery context (ie: find the optimal pipeline layout & use of transformers) this problem seems to remain open..
Related
Does anyone have experience of training a support vector machine (SVM) in Julia (1.4.1) ?
I tried the LIBSVM interface, but the example on the gituhub page gave an error :
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = convert(Vector, iris[:Species])
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);```
ERROR: MethodError: no method matching LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}, ::Array{Float64,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1})
Closest candidates are:
LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::Array{T,1}, ::AbstractArray{U,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1}) where {T, U} at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:18
LIBSVM.SupportVectors(::LIBSVM.SVMModel, ::Any, ::Any) at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:27
It looks like LIBSVM.jl documentation is rather outdated and package was not updated appropriately, so it worth an issue (or at least pull request to update README).
Error that you see is not related to the package itself, but the fact that in current versions of DataFrames.jl and RDatasets.jl labels column is no longer Vector (as it was at the time when LIBSVM.jl was developed) but CategoricalArray. You can avoid this problem by converting CategoricalArray to usual Vector{String}. Complete example looks like this
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:Species]))
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100
Alternatively, you can use MLJ.jl or ScikitLearn.jl
which should correctly wrap LIBSVM.jl on their own.
Oskin's answer is for an older version.
In the current version, it should be modified as,
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:,:Species]))
# First dimension of input data is features; second is instances
instances = Matrix(iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100
Is there an ml library, with the implementation of ensemble trees (RF or Boosted) which allows randomized feature selection (max_features in sklearn implementation of RFR and GBR or colsample_bytree in xgboost implementation) based on some feature grouping, rather than randomized over complete feature set, for each tree.
e.g. Say I have 10 independent features namely (F1,....F10) but these features can be grouped based on subject knowledge in 4 broad group as
{## FG short for feature_group
FG1 : [F1, F2],
FG2: [F3, F4, F5],
FG3: [F6,F7,F8],
FG4: [F9, F10]
}
Now with setting of max_feature = 0.2 in current randomized feature selection methodology, each tree will get any of 10 choose 2 features from 10 features. But I want to constraint feature selection at the group level, such that for one tree if FG1 and FG4 are chosen then all features in those groups are selected [F1,F2,F9,F10].
P.S. I have already created a Random forest Classifier to handle this using ML-From-Scratch library and find much robust result compared to randomized feature selection, but using post modeling steps such as MLI (shap or lime) is a challenge.
I trained a logistic regression classifier in sklearn. My base feature-file has 65 features, now I extrapolated them to a 1000 by considering quadratic combinations also (using PolynomialFeatures()). And then I reduced them back to 100 by Select-K-Best() method.
However, once I have my model trained and I get a new test_file, it would only have the 65 base features but my model expects 100 of them.
So, how can I apply the Select-K-Best() method on my test-set when I do not know the labels which is required in Select-K-Best.fit() function
You shouldn't fit SelectKBest again on test data - use the same (already fit) SelectKBest instance as in training instead. I.e. you should only use .transform method on test data, not .fit method.
scikit-learn provides an utility which makes managing multiple steps like that easier; it is called Pipeline. It should be something like that in your case (via make_pipeline helper):
pipe = make_pipeline(
PolynomialFeatures(2),
SelectKBest(100),
LogisticRegression()
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
I have time series data consisting of a vector
v=(x_1,…, x_n)
of binary categorical variables and the probabilities for four outcomes
p_1, p_2, p_3, p_4.
Given a new vector of categorical variables I want to predict the probabilities
p_1,…,p_4
The probabilities are very unbalanced with
p_1>.99 and p_2, p_3, p_4 < .01.
For example
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
v_2=(0,0,1,0,0,0,0,1), p_1=.99, p_2=.006, p_3=.0035, p_4= .0005
v_3=(0,1,0,0,1,1,1,0), p_1=.99, p_2=.005, p_3=.003, p_4= .002
v_4=(0,0,1,0,1,0,0,1), p_1=.99, p_2=.0075, p_3=.002, p_4= .0005
Given a new vector
v_5= (0,0,1,0,1,1,0,0)
I want to predict
p_1, p_2, p_3, p_4.
I should also note that the new vector could be identical to one of the input vectors, i.e.,
v_5=(0,0,1,0,1,0,0,1)= v_4.
My initial approach is to turn this into 4 regression problems.
The first would predict p_1, the second would predict p_2, the third would predict p_3, and the fourth would predict p_4. The problem with this is that I need
p_1+p_2+p_3+p_4=1
I’m not classifying, but should I also be worried about the unbalanced probabilities. Any ideas would be welcome.
Your suggestion of considering this as a multiple problem + final normalization, has some sense, but it's known to be problematic in many cases (see, e.g., the problem of masking).
What you're describing here is multiclass (soft) classification, and there are many many known techniques for doing so. You didn't specify which language/tool/library you're using, or if you're planning on rolling your own (which only makes sense for didactic purposes). I'd suggest starting with Linear Discriminant Analysis which is very simple to understand and implement, and - despite its strong assumptions - is known to often work well in practice (see the classical book by Hastie & Tibshirani).
Irrespective of the underlying algorithm you use for soft binary classification (e.g., LDA or not), It is not very difficult to transform aggregate input into labeled input.
Consider for example the instance
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
If your classifier supports instance weights, feed it 4 instances, labeled 1, 2, ..., with weights given by p_1, p_2, ..., respectively.
If it does not support instance weights, simply simulate what the law of large numbers says would happen: generate some large n instance from this input; for each such new input, choose a label randomly proportionally to its probability.
I always have trouble understanding the significance of chi-squared test and how to use it for feature selection. I tried reading the wiki page but I didn't get a practical understanding. Can anyone explain?
chi-squared test helps you to determine the most significant features among a list of available features by determining the correlation between feature variables and the target variable.
Example below is taken from https://chrisalbon.com/machine-learning/chi-squared_for_feature_selection.html
The below test will select two best features (since we are assigning 2 to the "k" parameter) among the 4 available features initially.
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
type(X_kbest)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
Chi-squared feature selection is a uni-variate feature selection technique for categorical variables. It can also be used for continuous variable, but the continuous variable needs to be categorized first.
How it works?
It tests the null hypothesis that the outcome class depends on the categorical variable by calculating chi-squared statistics based on contingency table. For more details on contingency table and chi-squared test check the video: https://www.youtube.com/watch?v=misMgRRV3jQ
To categorize the continuous data, there are range of techniques available from simplistic frequency based binning to advance approaches such as Minimum Description Length and entropy based binning methods.
Advantage of using chi-squared test on continuous variable is that it can capture the non-linear relation with outcome variable.