How h2o return mcc value - machine-learning

I create random forrest model. mcc value are a list of two value. Why?
mRF=H2ORandomForestEstimator(nfolds=10,keep_cross_validation_models = True,seed=12345,model_id="RF0",ntrees=1000)
mRF.train(x=column_use, y=target, training_frame=train,validation_frame=valid)
print("MCC valid",mRF.model_performance(valid).mcc())
MCC valid [[0.35743618321170406, 0.21239407659849494]]

If you don't set the parameter threshold, it returns an array with one result for optimal threshold - [[optimal threshold, mcc]]. You can set the array of own thresholds - [threshold1, threshold2, ...] it will return [[threshold1, mcc1], [threshold2, mcc2], ... ].

Related

How to normalize data which contain positive and negative numbers into 0 and 1 manually (without sklearn.preprocessing.MinMaxScaler package )?

I want to normalize data without using package. I use minmax scaler based on formula. but when i want to normalize data i get error like below.
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Normalization code :
minInsole = min(SmartInsole)
maxInsole = max(SmartInsole)
norm_data = (SmartInsole - minInsole) / ( maxInsole - minInsole )
Data Shape:

Using cv.matchTemplate to find multiple best matches

I am using the function cv.matchTemplate to try to find template matches.
result = cv.matchTemplate(img, templ, match_method)
After I run the function I have a bunch of answers in list result. I want to filter the list to find the best n matches. The data in result just a large array of numbers so I don't know what criteria to filter based on. Using extremes = cv.minMaxLoc(result, None) filters the result list in an undesired way before converting them to locations.
The match_method is cv.TM_SQDIFF. I want to:
filter the results down to the best matches
Use the results to obtain the locations
How can I acheive this?
You can treshold the result of matchTemplate to find locations with sufficient match. This tutorial should get you started. Read at the bottom of the page for finding multiple matches.
import numpy as np
threshold = 0.2
loc = np.where( result <= threshold) # filter the results
for pt in zip(*loc[::-1]): #pt marks the location of the match
cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0,0,255), 2)
Keep in mind depending on the function you use will determine how you filter. cv.TM_SQDIFF tends to zero as the match quality increases so setting the threshold closer to zero filters out worse. The opposite is true for cv.TM CCORR cv.TM_CCORR_NORMED cv.TM_COEFF and cv.TM_COEFF_NORMED matching methods (better tends to 1)
The above answer does not find the best N matches as the question asked. It filters out answers based on a threshold leaving open the (likely) possibility that you still have more than N results or zero results that beat the threshold.
To find the N 'best matches' we're looking for the N highest numbers in a 2d array and retrieving their indexes so we know the location. We can use nump.argpartition to find the highest N indexes in a 1d array and numpy.ndarray.flatten with numpy.unravel_index to go back and forth between a 2d and 1d array like so:
find_num = 5
result = cv.matchTemplate(img, templ, match_method)
idx_1d = np.argpartition(result.flatten(), -find_num)[-find_num:]
idx_2d = np.unravel_index(idx_1d, result.shape)
From here you have the x,y locations of the top 5 matches.

Compute annual mean using x-arrays

I have a python xarray dataset with time,x,y for its dimensions and value1 as its variable. I'm trying to compute annual mean of value1 for each x,y coordinate pair.
I've run into this function while reading the docs:
ds.groupby('time.year').mean()
This seems to compute a single annual mean for all x,y coordinate pairs in value1 at each given time slice
rather than the annual means of individual x,y coordinate pairs at each given time slice.
While the code snippet above produces the wrong output, I'm very interested in its oversimplified form. I would really like to figure out the "X-arrays trick" to doing annual mean for a given x,y coordinate pair rather than hacking it together myself.
Cam someone point me in the right direction? Should I temporarily turn this into a pandas object?
To avoid the default of averaging over all dimensions, you simply need to supply the dimension you want to average over explicitly:
ds.groupby('time.year').mean('time')
Note, that calling ds.groupby('time.year').mean('time') will be incorrect if you are working with monthly and not daily data. Taking the mean will place equal weight on months of different length, e.g., Feb and July, which is wrong.
Instead use below from NCAR:
def weighted_temporal_mean(ds, var):
"""
weight by days in each month
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Make sure the weights in each year add up to 1
np.testing.assert_allclose(wgts.groupby("time.year").sum(xr.ALL_DIMS), 1.0)
# Subset our dataset for our variable
obs = ds[var]
# Setup our masking for nan values
cond = obs.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
average_weighted_temp = weighted_temporal_mean(ds_first_five_years, 'TEMP')

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

How calculate the mean of Mean Squared Errors?

I have an array A where each element is an Mean Squared Error. How can I calculate the mean of A?
If I do a simply mean (If I do so I should got a mean of means) of the elements of A, is it a correct operation? If not why? And what's a solution?
Note: The elements in A are real in range from 0 to 1.
If you're after the total mean squared error you'll need the number of values that contributed to each element, n[i][j]. You can then compute
total_err2 = (Σ (n[i][j] * err2[i][j])) / (Σ n[i][j])
where Σ is the sum over all of the elements.

Resources