I am new to this Data Science field. I have a question to apply Random forest to new data.
I have this table.
Y prop_A prop_B
A 0.8 0.2
A 0.7 0.3
B 0.5 0.5
B 0.4 0.6
B 0.1 0.9
I assumed that if the proportion of the group is high, chances are high that it is in the group. I built a model using random forest and test it with validation set (8/2 splits).
I thought the above model can be used for new data. This is an example of the data. The data structure and variable meaning is same, but the number of variable is different.
Y prop_C prop_D prop_E prop_F
- 0.8 0.1 0.05 0.05
- 0.6 0.3 0.05 0.05
- 0.5 0.4 0.05 0.05
- 0.4 0.2 0.4 0
- 0.1 0.5 0.4 0.4
The new data is unlabeled so I would like to make a label using the Random forest I used with previous data. Is it right approach to label the new data?
In the model, it doesn't works (due to different independent variables).
How should I do to label the new data based on a model using labelled data, which is different?
The no of independent variables and variables should be same. if you want give a try just omit (prop_E and Prop_F) and rename (prop_C and Prop_D) as (prop_A and Prop_B) it will work
Related
After training my CatBoostClassifier model I call get_proba function which returns me list of probabilities. The problem starts from an another point... I transfer that data into dataframe then to Excel after what I sum all float numbers in my list and get numbers approximately equal to 2.
(Example: 0,980831511 0,99695788 2,99173E-13 1,63919E-15 7,35072E-14 4,82846E-16 . Their sum is equal to 1,977789391 )
Parameters which were used:
'loss_function': 'MultiClassOneVsAll',
'eval_metric': 'ZeroOneLoss',
The problem is that I need to get dependant type of probabilities, so I get something more like: 0.2 0.5 0.1 0.2 where their sum will be equal to 1 and the highest probability (which might be obvious) is in the second category (which equals to 0.5)
I've completed several tests.
I've used different objectives aka loss functions and metrics, so if you need to get "dependant" probability you may use everything (correct me if I'm wrong), but loss_function multiclassova (in other words OneVsAll). I've used multiclassova as eval metric and everything seemed right.
In case you use OneVsAll (using multiclassova):
In another case, as you see, the sum of all events equals 1, while in the last case it could vary from 0.5 to 2.0 (using other loss_function):
It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?
At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.
A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
I am trying to figure out this whole machine learning thing, so I was making some testing. I wanted to make it learn the sinus function (with a radian angle). The neural network is:
1 Input (radian angle) / 2 hidden layer / 1 output (prediction of the sinus)
For the squash activation I am using: RELU and it's important to note that when I was using the Logistic function instead of RELU the script was working.
So to do that, I've made a loop that start at 0 and finish at 180, and it will translate the number in radian (radian = loop_index*Math.PI/180) then it'll basically do the sinus of this radian angle and store the radian and the sinus result.
So my table look like this for an entry: {input:[RADIAN ANGLE], output:[sin(radian)]}
for(var i = 0; i <= 180; i++) {
radian = (i*(Math.PI / 180));
train_table.push({input:[radian],output:[Math.sin(radian)]})
}
I use this table to train my Neural Network using Cross Entropy and a learning rate of 0.3 with 20000 iterations.
The problem is that it fail, when I try to predict anything it returns "NaN"
I am using the framework Synaptic (https://github.com/cazala/synaptic) and here is a JSfiddle of my code: https://jsfiddle.net/my7xe9ks/2/
A learning rate must be carefully tuned, this parameter matters a lot, specially when the gradients explode and you get a nan. When this happens, you have to reduce the learning rate, usually by a factor of 10.
In your specific case, the learning rate is too high, if you use 0.05 or 0.01 the network now trains and works properly.
Also another important detail is that you are using cross-entropy as a loss, this loss is used for classification, and you have a regression problem. You should prefer a mean squared error loss instead.
If I have a trained binary classifier, what is the probability of making a correct prediction by chance?
For example, lets say that I want to make 5 predictions. What is the probability of getting all 5 predictions correct by chance?
Is it: 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.0313 ?
You are correct, however, under the assumption that classes are equally probable.
As a similar thought experiment, if you have a model with 99% accuracy (meaning that for any, randomly chosen sample, it will provide correct label 99% of the time), it also does not have high probability of having all samples correctly. For 100 samples it is just about 36%, and for 300 it is less than 5%... for 1000 it is 0.004%.
In general probability of many event happening one by one will fall down very quickly (exponentially) if the probability of each success is constant.