H2ORandomForestEstimator with min_samples_split? - random-forest

What is the analogue of
min_samples_split
for H2ORandomForestEstimator and H2OGradientBoostingEstimator?
(h2o min_rows == sklearn min_samples_leaf)

It looks like the closest thing to min_samples_split is min_split_improvement: Minimum relative improvement in squared error reduction for a split to happen

Related

Scikit_learn precision and recall computed incorectly

I have unbelievably stupid problem. Calculating precision and recall by sci-kit learn gives me crazy values, totally different than calculated by me, using confusion matrix.
Here's my code:
I tries also average 'weighted' and 'macro', and separated functions f_score, precision_score and recall_score. Nothing helped.
I got these results:
Firstly there is y_test values, then y_pred (as you can see, there is only one true positive prediction) then recall and precision calculated out of confusion matrix results (precision 0.14 is something I did expected). In the end there are precision and recall calculated by sklearn function and... I don't understand! Why the difference?!
Does anyone have idea why these results look like this?
Yeah, that was veeery stupid problem. The solution was changing average='micro' to 'binary'. Then the results are correct.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Is Total Error Mean an adequate performance metric for regression models?

I'm working on a regression model and to evaluate the model performance, my boss thinks that we should use this metric:
Total Absolute Error Mean = mean(y_predicted) / mean(y_true) - 1
Where mean(y_predicted) is the average of all the predictions and mean(y_true) is the average of all the true values.
I have never seen this metric being used in machine learning before and I convinced him to add Mean Absolute Percentage Error as an alternative, yet even though my model is performing better regarding MAPE, some areas underperform when we look at Total Absolute Error Mean.
My gut feeling is that this metric is wrong in displaying the real accuracy, but I can't seem to understand why.
Is Total Absolute Error Mean a valid performance metric? If not, then why? If it is, why would a regression model's accuracy increase in terms of MAPE, but not in terms of Total Absolute Error Mean?
Thank you in advance!
I would kindly suggest to inform your boss that, when one wishes to introduce a new metric, it is on him/her to demonstrate why it is useful on top of the existing ones, not the other way around (i.e. us demonstrating why it is not); BTW, this is exactly the standard procedure when someone really comes up with a new proposed metric in a research paper, like the recent proposal of the Maximal Information Coefficient (MIC).
That said, it is not difficult to demonstrate in practice that this proposed metric is a poor one with some dummy data:
import numpy as np
from sklearn.metrics import mean_squared_error
# your proposed metric:
def taem(y_true, y_pred):
return np.mean(y_true)/np.mean(y_pred)-1
# dummy true data:
y_true = np.array([0,1,2,3,4,5,6])
Now, suppose that we have a really awesome model, which predicts perfectly, i.e. y_pred1 = y_true; in this case both MSE and your proposed TAEM will indeed be 0:
y_pred1 = y_true # PERFECT predictions
mean_squared_error(y_true, y_pred1)
# 0.0
taem(y_true, y_pred1)
# 0.0
So far so good. But let's now consider the output of a really bad model, which predicts high values when it should have predicted low ones, and vice versa; in other words, consider a different set of predictions:
y_pred2 = np.array([6,5,4,3,2,1,0])
which is actually y_pred1 in reverse order. Now, it easy to see that here we will also have a perfect TAEM score:
taem(y_true, y_pred2)
# 0.0
while of course MSE would have warned us that we are very far indeed from perfect predictions:
mean_squared_error(y_true, y_pred2)
# 16.0
Bottom line: Any metric that ignores element-wise differences in favor of only averages suffers from similar limitations, namely taking identical values for any permutation of the predictions, a characteristic which is highly undesirable for a useful performance metric.

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

How to check if gradient descent with multiple variables converged correctly?

In linear regression with 1 variable I can clearly see on plot prediction line and I can see if it properly fits the training data. I just create a plot with 1 variable and output and construct prediction line based on found values of Theta 0 and Theta 1. So, it looks like this:
But how can I check validity of gradient descent results implemented on multiple variables/features. For example, if number of features is 4 or 5. How to check if it works correctly and found values of all thetas are valid? Do I have to rely only on cost function plotted against number of iterations carried out?
Gradient descent converges to a local minimum, meaning that the first derivative should be zero and the second non-positive. Checking these two matrices will tell you if the algorithm has converged.
We can think of gradient descent as of something solving a problem of f'(x) = 0 where f' denotes gradient of f. For checking this problem convergence, as far as I know, the standard approach is to calculate discrepancy on each iteration and see if it converges to 0.
That is, check if ||f'(x)|| (or its square) converges to 0.
There are some things you can try.
1) Check if your cost/energy function is not improving as your iteration progresses. Use something like "abs(E_after - E_before) < 0.00001*E_before", i.e. check if the relative difference is very low.
2) Check if your variables have stopped changing. You can opt a very similar strategy like above to check this.
There is actually no perfect way to fully make sure that your function has converged, but some of the things mentioned above are what usually people try.
Good luck!

Resources