I have used multiclass data to run the Gradient Boosted Tree classifier and produced the tree result. Some of the trees don't show any condition to split the node and predicts directly. See the example below (few trees are shown):
println(gradient_boosted_tree_classifier1_model.toDebugString)
Tree 3 (weight 0.2):
Predict: 0.07481560563940803
Tree 4 (weight 0.2):
If (feature 10 <= 18.7)
Predict: 0.386736373979052
Else (feature 10 > 18.7)
Predict: -0.26852945052867006
Tree 5 (weight 0.2):
Predict: 0.05051789101360473
Same issue has been observed with Random Forest Classifier as well.
Any thoughts on the same?
Related
I'm new in the machine learning environment. I noticed that a random forest classifier is composed of Decision trees, which rely on statistics to classify a sample. is it possible for a random forest to erroneously classify a sample who was in his training set?
Yes.
If depth of Decision trees is not big enough to capture the essence of the data.
For example lets consider data with two features X1 and X2.
target = 1 if X1 >5 and X2 > 10, else target = 0
With depth one, decision tree will have to rely only on one of the features.
For example sample if decision tree use X1 feature to construct the split,
both samples (7, 15) and (7, 7) will be classified as 1, which is wrong for (7,7), because X2 = 7 < 10.
I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.
0.7 0.1 0.05 0.15
0.35 0.2 0.45 0.0
...
... would basically mean that
The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
etc., I'm sure you get the idea.
I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.
Can I train my model with scikit-learn multilabel classification (and how)?
Please note:
Feature extraction is not the problem.
Prediction is not the problem.
Can I handle this somehow with the multilable classification framework?
For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.
If yes: How?
Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:
F1 F2 F3 F4 Y
1 0 1 0 A
0 1 1 1 B
1 0 0 0 C
0 0 0 1 D
(...)
where F# are the features per each image and Y is the class as classified by doctors.
If no: Any other approaches?
For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.
I am new to the Neural network.
I have training dataset of 1K examples. each example contains the 5 features.
Initially, I provided some to value to weights.
So, Is there is 1K value is stored for weights associated with each example or the weight values remain same for all the 1K examples?
For example:
example1 => [f1,f2,f3,f4,f5] -> [w1e1,w2e1,w3e1,w4e1,w5e1]
example2 => [f1,f2,f3,f4,f5] -> [w1e2,w2e2,w3e2,w4e2,w5e2]
Here w1 means first weight and e1, e2 mean different examples.
or example1,example2,... -> [gw1,gw2,gw3,gw4,gw5]
Here g means global and w1 means weight for feature one as so on.
Start with a single node in the Neural network. It's output is sigmoid function applied to the linear combination of input as shown below.
So for 5 features you will have 5 weights + 1 bias for each node of the neural network. While training, a batch of inputs are fed, the output at then end of the neural network is calculated, the error is calculated with respect to the actual outputs and gradients are backpropogated based on the error. In simple words, the weights are adjusted based on the error.
So for each node you have 6 weights, and depending on the number of nodes (which depends on the number of layers and size of the layers) you can calculate number of weights. All the weights are updated once per batch (since you are doing batch training)
The computational complexity of the algorithm given training set D is O(n*|D|
log(|D|)), where n is the number of attributes describing the tuples in D and |D| is the number of training tuples in D. This means that the computational cost of growing a tree grows at most n Dlog(|D|) with |D| tuples.I am not able to log(|D|)part specifically.
Refrenece-Book Data minning concepts and tech.2nd edition page number 296
topic-Classification and prediction(Chapter 6)
The height of a balanced tree is at most O(log(n)). Is your tree balanced?
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.