Does the bagging of bagging models make sense? [duplicate] - machine-learning

This question already has answers here:
Execution time of AdaBoost with SVM base classifier
(2 answers)
using random forest as base classifier with adaboost
(2 answers)
Closed 1 year ago.
Would it make sense to apply bagging of random forests? For example:
brf = BaggingRegressor(base_estimator=RandomForestModel,
n_estimators=10,
max_samples=1.0,
bootstrap=True, # Samples are drawn with replacement
n_jobs= n_jobs,
random_state=random_state).fit(X_train, Y_train)
Or to have, on a stacking model, a Random Forest as the base/final estimator?

Related

Why do we store X_test to y_preds variable in Scikit learn? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am currently working on a Machine Learning project with no prior hands-on experience of Machine Learning or Python. I have just encountered the following code online, but don't know why is that actually happening.
Where is the trained data stored? is it stored in X_train or X_test?
Why did we predict X_test and stored it to y_preds variable? Since we used y_preds, I was expecting something like this:
y_preds = clf.predict(y_test)
Code:
from sklearn.model_selection import train_test_split
# Using train_test_split() function, defining test data size + storing it to variables of test, train
and split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fitting the data into the training model defined above
clf.fit(X_train, y_train);
# Making predictions on our trained data
y_preds = clf.predict(X_test)
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
A) supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
B) unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).
Basically, machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
Take a look at the link below.
https://scikit-learn.org/stable/user_guide.html
That is an excellent resource for learning all about Scikit Learn. It's hard to get your mind around some of these things, but it's a great learning experience, and it really does work!

How to view a regression tree? [duplicate]

This question already has answers here:
How to visualize a Regression Tree in Python
(4 answers)
Visualizing decision tree in scikit-learn
(11 answers)
Python Decision Tree GraphViz
(3 answers)
Visualizing a decision tree ( example from scikit-learn )
(2 answers)
interpreting Graphviz output for decision tree regression
(1 answer)
Closed 2 years ago.
I am wanting to visualize my decision tree applied to regression (only this plot for classification worked) what is going wrong that only the values appear but not the tree built itself?
For a simpler approach, try:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
.
.
.
clf = DecisionTreeClassifier()
clf.fit(X,y)
clf.plot_tree()
Theplot_tree() method uses matplotlib tools to make a tree visualizer.
For a fancier approach, you can plot the tree using graphviz. Check out this article about this topic:
https://towardsdatascience.com/visualizing-decision-trees-with-python-scikit-learn-graphviz-matplotlib-1c50b4aa68dc

What if i train classifier two time? [duplicate]

This question already has an answer here:
How can i train multiple times an SVM classifier from sklearn in Python?
(1 answer)
Closed 4 years ago.
If I train a classifier two times, like:
clf.fit(X,y)
clf.fit(X,y)
Will it overwrite the existing classifier or will it just train it one time?
Yes, clf will be fit with the last data you try to fit it with. See the answer here https://stackoverflow.com/a/28884168/9458191 for more information.
Whenever you call .fit(...) on a classifier, it will only retain the new fit, essentially overwriting any previous training.
If you are using an entirely different dataset, the resulting classfier will obviously be different than before the second .fit(...) call. If you are using the same dataset, then the classifier may or may not be any different. Some classifiers are deterministic in training, if this is the case then they should not be any different. Some classifiers are non-deterministic, however, and those could have different results during the second training.

Identifying the most useful words in differentiating between classes [duplicate]

This question already has answers here:
How to get most informative features for scikit-learn classifiers?
(9 answers)
Closed 1 year ago.
Is it possible to use tfidf (tfidfvectorizer in Python) to figure out which words are most important when trying to distinguish between two text classes (i.e., positive or negative sentiment, etc.)? For example, which words were most important for identifying the positive class, and then separately, which were most useful for identifying the negative class?
You can let scikit learn do your heavy lifting - train a random forest on your binary tree, extract the classifier's feature importance ranking and use it to get the most important words:
clf = RandomForestClassifier()
clf.fit(data, labels)
importances = clf.feature_importances_
np.argsort(importances)[::-1]
feature_names = vectorizer.get_feature_names()
top_words = []
for i in xrange(100):
top_words.append(feature_names[indices[i]])
Note that this will only tell you what are the most important words - not what they say for each category. To say what each word say about each class you can classify the individual words and see what is their classification.
Another option is to take all positive/negative data samples, remove from them the word you are trying understand and see how this affects the classification of the sample.

Max/min of trained TensorFlow NN [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to understand what the best way is to conduct further analysis on a trained TensorFlow neural network for regression.
Specifically, I am looking on how to find further maxima/minima from a trained neural network (equivalent to finding max/min from a regression curve). The easy way is to obviously "try out" all possible combinations and check the result set for a max/min, but testing all combinations can quickly become a huge resource sink when having multiple inputs and dependent variables.
Is there any way to use a trained TensorFlow neural network to conduct these further analyses?
As networks are trained incrementally, you can find the maximum incrementally.
Suppose you have a neural network with an input size of 100 (e.g. a 10x10 image) and a scalar output of size 1 (e.g. the score of the image for a given task).
You can incrementally modify the input, starting from random noise, until you obtain a local maximum of the output. All you need is the gradients of the output with respect to the input:
input = tf.Variable(tf.truncated_normal([100], mean=127.5, stddev=127.5/2.))
output = model(input)
grads = tf.gradients(output, input)
learning_rate = 0.1
update_op = input.assign_add(learning_rate * grads)
ANNs is not something which can be checked analytically. It has sometimes millions of weights and thousands of neurons, non-linear activation functions of different types, convolution and max-pooling layers.. No way you analytically determine anything about it. Actually that's why networks are trained incrementally.

Resources