Please explain the difference between scikit-learn's ColumnTransformer and make_Column_transformer. Also, where to use what.
There is no such major difference between the two. They both give the same result.
as you can see in docs ColumnTransformer uses a list of a tuple with a name and make_column_transformer is just a tuple without a name. Name given to tuple is helpful when we use Gridsearchcv or Randomsearchcv, the estimator in these can be nested pipelines of transformers and classifier and a regressor if we went to give the param_grid to them, then we can use the name of that tuple. You can see in the StackOverflow question nested pipelines and ColumnTransformer in Gridsearchcv and how naming is helpful. Generally, I use make_columns_transformer if I don't have to use Gridseachcv.
sklearn docs
stackoverflow question
This is well described in the Sklearn API:
This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting with transformer_weights.
Related
I am developing a classification base model. I have used the concept of ColumnTransformer and Pipeline for feature engineering and selection, model selection, and for everything. I wanted to encode my categorical target (dependent) variable to numeric inside the pipeline. Came to know that we cannot use LabelEncoder inside both CT and Pipeline because the fit only takes (y) and throws an error, 'TypeError: fit_transform() takes 2 positional arguments but 3 were given.' What are other alternatives for the target variable? Found a lot of stacks for similar but for features and recommendations were to use OHE and OrdinalEncoder!
Basically, don't.
All (or at least most) sklearn classifiers will encode internally, and produce more useful information for you when they've been trained directly on the "real" target values. (E.g. predict will give the actual target values without you having to decode the mapping.)
(As for regression, if the target is actually ordinal in nature, you may be able to use TransformedTargetRegressor. Whether this makes sense probably depends on the model type.)
I'm trying to create an ensemble of an determined regressor, with this in mind i've searched for some way to use the sklearn already existing ensemble methods, and try to change the base estimator of the ensemble. the bagging documentation is clear because it says that you can change the base estimator by passing your regressor as parameter to "base_estimator", but with GradientBoosting you can pass a regressor in the "init" parameter.
My question is: passing my regressor in the init parameter of the GradientBoosting, will make it use the regressor i've specified as base estimator instead of trees? the documentation says that the init value must be "An estimator object that is used to compute the initial predictions", so i dont know if the estimator i'll pass in init will be the one used in fact as the weak learner to be enhanced by the bosting method, or it will just be used at the beginning and after that all the work is done by decision trees.
No.
GradientBoostingRegressor can only use regressor trees as base estimators; from the docs (emphasis mine):
In each stage a regression tree is fit
And as pointed out in a relevant Github thread (HT to Ben Reiniger for pointing this out in the comment below):
the implementation is entirely tied to the assumption that the base estimators are trees
In order to boost arbitraty base regressors (similar to bagging), you need AdaBoostRegressor, which, similarly again with bagging, takes also a base_estimator argument. But before doing so, you may want to have a look at own answer in Execution time of AdaBoost with SVM base classifier; quoting:
Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.
I'm using scikit-learn's DecisionTreeClassifier to construct a decision tree for a particular feature-set. To my surprise, one feature which was thought to be significant - was excluded.
Is there a way to take a peek under the hood, and figure out why the algorithm chose to exclude that feature?
Or really, get more information / analytics about any part of the decision-tree construction process?
Regarding your problem with a feature ignoring, its hard to tell why, but I can suggest to "play" with the weights of the sample_weight flag to change the weight each sample get, and therefore give more weight to the mentioned feature, which you can read an excellent explanation here.
Also, for debugging, there is a way to save an image of the trained tree, as demonstrated in the documentation:
The export_graphviz exporter supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. IPython notebooks can also render these plots inline using the Image() function:
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None, # clf: the trained classifier
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
I have used sklearn's PipeLine and FeatureUnion in some of my projects and find it extremely useful. I was wondering if there is any WEKA equivalent for it.
Thanks.
Short answer: no. Details below.
In weka, there's KnowledgeFlow, but this is a GUI element (weka.gui.knowledgeflow).
What you can use instead is the FilteredClassifier which is a classifier that works on filtered data. If you want to use several filters before the classifier, you can use the MultiFilter instead of a filter.
If you want more flexibility, you can wrap FilteredClassifier. You can create a field List<Object> filters and then apply these filters before applying the classifier (buildClassifier, classifyInstance) depending on which types of filters they are, for example AttributeSelection or Filter.
I have 20 numeric input parameters (or more) and single output parameter and I have thousands of these data. I need to find the relation between input parameters and output parameter. Some input parameters might not relate to output parameter or all input parameters might not relate to output parameter. I want some magic system that can statistically calculate output parameter when I provide all input parameters and it much be better if this system also provide confident rate with output result.
What’s technique (in machine learning) that I need to use to solve this problem? I think it should be Neural network, genetic algorithm or other related thing. But I don't sure. More than that, I need to know the limitation of this technique.
Thanks,
Your question seems to simply define the regression problem. Which can be solved by numerous algorithms and models, not just neural networks.
Support Vector Regression
Neural Networks
Linear regression (and many modifications and generalizations) using for example OLS method
Nearest Neighbours Regression
Decision Tree Regression
many, many more!
Simply look for "regression methods", "regression models" etc. in particular, sklearn library implements many of such methods.
I would recommend Genetic Programming (GP), which is genetic-based machine learning approach where the learnt model is a single mathematical expression/equation that best fits your data. Most GP packages out there come with a standard regression suite which you can run "as is" with your data, and with minimal setup costs.