I have used sklearn's PipeLine and FeatureUnion in some of my projects and find it extremely useful. I was wondering if there is any WEKA equivalent for it.
Thanks.
Short answer: no. Details below.
In weka, there's KnowledgeFlow, but this is a GUI element (weka.gui.knowledgeflow).
What you can use instead is the FilteredClassifier which is a classifier that works on filtered data. If you want to use several filters before the classifier, you can use the MultiFilter instead of a filter.
If you want more flexibility, you can wrap FilteredClassifier. You can create a field List<Object> filters and then apply these filters before applying the classifier (buildClassifier, classifyInstance) depending on which types of filters they are, for example AttributeSelection or Filter.
Related
I have recently been looking into different filter feature selection approaches and have noted that some are better suited for numerical data (Pearson) and some are better suited for categorical data (Chi-Square).
I am working with a dataset with a mixture of both data types and am unsure about what the best practice is in terms of applying the filter methods.
Is it best to split the dataset into categorical and numerical, performing different filter methods on each set and then joining the results?
Or should only one filter method be applied to the whole dataset?
You can have a look at Permutation Importance. The idea is to randomly shuffle the values of a feature and observe the change in error. If the feature is important, ideally the error should increase. It does not depend on the data type of the feature, unlike some statistical tests. Also it is very straightforward to implement and analyze. link1, link2
I'm using scikit-learn's DecisionTreeClassifier to construct a decision tree for a particular feature-set. To my surprise, one feature which was thought to be significant - was excluded.
Is there a way to take a peek under the hood, and figure out why the algorithm chose to exclude that feature?
Or really, get more information / analytics about any part of the decision-tree construction process?
Regarding your problem with a feature ignoring, its hard to tell why, but I can suggest to "play" with the weights of the sample_weight flag to change the weight each sample get, and therefore give more weight to the mentioned feature, which you can read an excellent explanation here.
Also, for debugging, there is a way to save an image of the trained tree, as demonstrated in the documentation:
The export_graphviz exporter supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. IPython notebooks can also render these plots inline using the Image() function:
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None, # clf: the trained classifier
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
I have around 300 features and I want to find the best subset of features by using feature selection techniques in weka. Can someone please tell me what method to use to remove redundant features in weka :)
There are mainly two types of feature selection techniques that you can use using Weka:
Feature selection with wrapper method:
"Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.
An example if a wrapper method is the recursive feature elimination algorithm." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
Feature selection with filter method:
"Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
Example of some filter methods include the Chi squared test, information gain and correlation coefficient scores." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
If you are using Weka GUI, then you can take a look at two of my video casts here and here.
"Weka: training and test set are not compatible" can be solved using batch filtering but at the time of training a model I don't have test.arff. My problem caused in the command "stringToWord vector" (on CLI).
So my question is, can Caret package(R) or Scikit learn (Python) provides any alternative for this one.
Note:
1. Functionality provided by "stringToWord vector" is a must requirement.
2. I don't want to retrain my model while testing because it takes lot of time.
Given the requirements you mentioned, you can use Weka's Filtered Classifier option during training and testing. I am not re-iterating what I have recorded as a video cast here and here.
But the basic idea is not to use the StringToWord vector as a direct filter rather to use it as a filtering option in the FilteredClassifier option. The model you generate will be just once. And then you can apply the model directly on your unlabelled data without retraining them or without applying StringToWord vector again on the unlabelled data. FilteredClassifier will take care of these concerns for you.
I am working on a project to classify short text.
One requirement I have is along with the vectorizing the short text, I will like to add additional feature like length of the text, number of url's etc as features for each input.
Is is supported in scikit-learn?
Link to any example notebook or a video with be very help.
Thanks,
Romit.
You can combine features extracted by different transfomers (e.g. one that extracts Bag of Words (BoW) features with one that extracts other statistics) by using the FeatureUnion class.
The normalization of those features and there small number with respect to the number of distinct BoW features could be problematic. Whether or not this is problem depends on the assumptions made by the models trained downstream and on the specific data and target task.
I haven't used FeatureUnion class. However my approach was simpler and rather straight forward. Extract the features from your custom pipeline and append it with what you extracted from scikit-learn pipeline. This is nothing but appending array in numpy/scipy.
Precautions:
a) You must remember what are the feature-id's extracted from your custom pipeline. This will help you in appending arrays, without mixing things.
b)You would have to do normalization(as required) of your custom pipeline features.
Solution:
Write a custom feature extractor class. Wrap functionality like feature extraction, normalization etc into it.