H2O Gain/Lift table - response

My question is about H2O Gain/Lift table. I understand that the response rate is the proportion of all the events that fall into the group/bin. HOW to get that pieces of data that fall into bin 1, bin 2, etc.? I want to see how the key variables look in each group/bin in respect to the Response Rate.
It would be great to have a full description of how the measures in Gain/Lift table are calculated (formulas)

The equations for the Gains and Lift Chart can be found in this file: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/GainsLift.java
Which shows:
E = total number of events
N = number of observations
G = number of groups (10 for deciles or 20 for demi-deciles)
P = overall proportion of observations that are events (P = E/N)
ei = number of events in group i, i=1,2,...,G
ni = number of observations in group i
pi = proportion of observations in group i that are events (pi = ei/ni)
groups: are hard coded to 16; if there are fewer than 16 unique probability values, then the number of groups is reduced to the number of unique quantile thresholds.
cumulative data fraction = sum_n_i/N
lower_threshold = set by quantile bins
lift = pi/P
cumulative_lift = (Σiei/Σini)/P
response_rate = 100*pi
cumulative_response_rate = 100*Σiei/Σini
capture_rate = 100*ei/E
cumulative_capture_rate = 100*Σiei/E
gain = 100*(lift-1)
cumulative_gain = 100*(sum_lift-1)
average_response_rate = E/N
So here is a example walkthrough using the H2O-3 Python API:
import h2o
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import and split the dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split dataset
train, valid = cars.split_frame(ratios=[.7],seed=1234)
# Initialize and train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
# Generate Gains and Lift Table
# documentation on this parameter can be found here:
# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?#h2o.model.H2OBinomialModel.gains_lift
gainslift = cars_gbm.gains_lift(train=False, valid=True, xval=False)
Table Overview
As expected we have 16 groups, because this is the hardcoded default behavior.
Cumulative data fractions
Threshold probability value
Response rates (proportion of observations that are events in a group)
Cumulative response rate
Event capture rate
Cumulative capture rate
Gain (difference in percentages between the overall proportion of events and the observed proportion of observations that are events in the group)
Cumulative gain
What if I Want Just the Deciles
By default the Gains and Lift Table provides you with more then just the deciles or ventiles, what this means is you have more flexibilty to pick out the percentiles in which you are interested.
Let's take the example of getting our deciles. In this example we see that we can start at row 6, skip row 7 and then take the rest of the rows to get our deciles.
Since the Gains and Lift Table returns a TwoDimTable we can use our group numbers as selection indices.
# show gains and lift table data type
print('H2O Gains Lift Table is of type: ', type(gainslift))
H2O Gains Lift Table is of type: <class 'h2o.two_dim_table.H2OTwoDimTable'>
# since this table is small and for ease of use let's covert to a pandas dataframe
pandas_gl = gainslift.as_data_frame()
pandas_gl.set_index('group')
gainslift_deciles = pandas_gl.iloc[pd.np.r_[5,7:16], :]
gainslift_deciles
What if I Want Just the Ventiles
Those are available to select out as well, so let's do that next.
gainslift_ventiles = pandas_gl.iloc[pd.np.r_[7,9,11,13,15], :]
gainslift_ventiles

Related

How can I cluster my weighted directed graph?

I've been trying to cluster my graph of jobs.
The edges weights are the count of the transitions between 2 nodes(jobs).
I've been reading about and I've based my code in this paper: https://hal.archives-ouvertes.fr/hal-01887680/document
Code:
G = nx.DiGraph() #Full transitions graph
G.add_weighted_edges_from(list(transitions_df.itertuples(index=False, name=None)))
H = nx.subgraph(G, list(df.query("sub_family_desc == 'ClientSupport' | sub_family_desc == 'Consulting' ").code.unique())) #Gruph with only two subfamily_jobs(clusters)
pos = nx.kamada_kawai_layout(H)
weights = nx.get_edge_attributes(H, "weight")
a = nx.spectral_graph_forge(H, 0.7)
adj_mat = nx.to_numpy_matrix(a)
sc = SpectralClustering(2, affinity='precomputed', n_init=100,assign_labels="kmeans",random_state=np.random.seed(1234))
sc.fit(adj_mat)
I also tryed to add random walks, but I have faild and couldn't pass the random walk to the SKlean Spectral Cluster
from stellargraph import StellarGraph
#converting it to stellar graph format so we can leverage biased random walk from this library
sg_graph = StellarGraph.from_networkx(H)
print(sg_graph.info())
from stellargraph.data import BiasedRandomWalk
from gensim.models import Word2Vec
#from each singular node/job we generate 40 biased (weight-biased) random walks with a max length of 10
rw = BiasedRandomWalk(sg_graph)
#40 sequences per node
weighted_walks = rw.run(nodes=sg_graph.nodes(),length=10, n=100, p=5, q=0.05, weighted=True, seed=1234)
print("Number of random walks: {}".format(len(weighted_walks)))
Should I have added the random_walk in the model? How can I make it?
Use louvain community detection.

Can I calculate the confidence bound of a Prophet model that would contain a certain value?

can I use the y-hat variance, the bounds, and the point estimate from the forecast data frame to calculate the confidence level that would contain a given value?
I've seen that I can change my interval level prior to fitting but programmatically that feels like a LOT of expensive trial and error.
Is there a way to estimate the confidence bound using only the information from the model parameters and the forecast data frame?
Something like:
for level in [.05, .1, .15, ... , .95]:
if value_in_question in (yhat - Z_{level}*yhat_variance/N, yhat + Z_{level}*yhat_variance/N):
print 'im in the bound level {level}'
# This is sudo code not meant to run in console
EDIT: working prophet example
# csv from fbprohets working examples https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csv
import pandas as pd
from fbprophet import Prophet
import os
df = pd.read_csv('example_wp_log_peyton_manning.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
# the smallest confidence level s.t. the confidence interval of the 30th prediction contains 9
## My current approach
def __probability_calculation(estimate, forecast, j = 30):
sd_residuals = (forecast.yhat_lower[j] - forecast.yhat[j])/(-1.28)
for alpha in np.arange(.5, .95, .01):
z_val = st.norm.ppf(alpha)
if (forecast.yhat[j]-z_val*sd_residuals < estimate < forecast.yhat[j]+z_val*sd_residuals):
return alpha
prob = __probability_calculation(9, forecast)
fbprophet uses the numpy.percentile method to estimate the percentiles as you can see here in the source code:
https://github.com/facebook/prophet/blob/0616bfb5daa6888e9665bba1f95d9d67e91fed66/python/prophet/forecaster.py#L1448
How to inverse calculate percentiles for values is already answered here:
Map each list value to its corresponding percentile
Combining everything based on your code example:
import pandas as pd
import numpy as np
import scipy.stats as st
from fbprophet import Prophet
url = 'https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv'
df = pd.read_csv(url)
# put the amount of uncertainty samples in a variable so we can use it later.
uncertainty_samples = 1000 # 1000 is the default
m = Prophet(uncertainty_samples=uncertainty_samples)
m.fit(df)
future = m.make_future_dataframe(periods=30)
# You need to replicate some of the preparation steps which are part of the predict() call internals
tmpdf = m.setup_dataframe(future)
tmpdf['trend'] = m.predict_trend(tmpdf)
sim_values = m.sample_posterior_predictive(tmpdf)
The sim_values object contains for every datapoint 1000 simulations on which the confidence interval is based.
Now you can call the scipy.stats.percentileofscore method with any target value
target_value = 8
st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples
# returns 44.26
To prove this works backwards and forwards you can get the output of the np.percentile method and put it in the scipy.stats.percentileofscore method.
This works for an accuracy of 4 decimals:
ACCURACY = 4
for test_percentile in np.arange(0, 100, 0.5):
target_value = np.percentile(sim_values['yhat'], test_percentile)
if not np.round(st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples, ACCURACY) == np.round(test_percentile, ACCURACY):
print(test_percentile)
raise ValueError('This doesnt work')

How to calculate feature_log_prob_ in the naive_bayes MultinomialNB

Here's my code:
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['Tim is smart!',
'Joy is the best',
'Lisa is dumb',
'Fred is lazy',
'Lisa is lazy'])
# Create target vector
y = np.array([1,1,0,0,0])
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data) #
# Create feature matrix
X = bag_of_words.toarray()
mnb = MultinomialNB(alpha = 1, fit_prior = True, class_prior = None)
mnb.fit(X,y)
print(count.get_feature_names())
# output:['best', 'dumb', 'fred', 'is', 'joy', 'lazy', 'lisa', 'smart', 'the', 'tim']
print(mnb.feature_log_prob_)
# output
[[-2.94443898 -2.2512918 -2.2512918 -1.55814462 -2.94443898 -1.84582669
-1.84582669 -2.94443898 -2.94443898 -2.94443898]
[-2.14006616 -2.83321334 -2.83321334 -1.73460106 -2.14006616 -2.83321334
-2.83321334 -2.14006616 -2.14006616 -2.14006616]]
My question is:
Let's say for word: "best": the probability for class 1 : -2.14006616.
What is the formula to calculate to get this score.
I am using LOG (P(best|y=class=1)) -> Log(1/2) -> can't get the -2.14006616
From the documentation we can infer that feature_log_prob_ corresponds to the empirical log probability of features given a class. Let's take an example feature "best" for the purpose of this illustration, the log probability of this feature for class 1 is -2.14006616 (as you pointed out), now if we were to convert it into actual probability score it will be np.exp(1)**-2.14006616 = 0.11764. Let's take one more step back to see how and why the probability of "best" in class 1 is 0.11764. As per the documentation of Multinomial Naive Bayes, we see that these probabilities are computed using the formula below:
Where, the numerator roughly corresponds to the number of times feature "best" appears in the class 1 (which is of our interest in this example) in the training set, and the denominator corresponds to the total count of all features for class 1. Also, we add a small smoothing value, alpha to prevent from the probabilities going to zero and n corresponds to the total number of features i.e. size of vocabulary. Computing these numbers for the example we have,
N_yi = 1 # "best" appears only once in class `1`
N_y = 7 # There are total 7 features (count of all words) in class `1`
alpha = 1 # default value as per sklearn
n = 10 # size of vocabulary
Required_probability = (1+1)/(7+1*10) = 0.11764
You can do the math in a similar fashion for any given feature and class.

How to add cluster label columns back into original dataframe- python, for supervised learning

I have a column in my data frame which contains Url information. It has 1200+ unique values. I wanted to use text mining to generate features from these values. I have used tfidfvectorizer to generate vectors and then used kmeans to identify clusters. I now want to assign these cluster labels back into my original dataframe, so that I can bin the URL information into these clusters.
Below code to generate vectors and cluster labels
from scipy.spatial.distance import cdist
vectorizer = TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english')
X = vectorizer.fit_transform(sample\['lead_lead_source_modified'\])
X = X.toarray()
distortions=\[\]
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape\[0\])
#append cluster labels
km = KMeans(n_clusters=4, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=\['ClusterLabel_lead_lead_source'\])
cluster_labels
Through the elbow method, I decided on 4 clusters. I now have cluster labels, but I am not sure how to add them bank to dataframe on its respective index. Concatenating along axis=1 is creating Nans due to indexing issues. Below is the sample output after concatenation.
lead_lead_source_modified ClusterLabel_lead_lead_source
0 NaN 3.0
1 NaN 0.0
2 NaN 0.0
3 ['direct', 'salesline', 'website', ''] 0.0
I want to know if this approach is the right way to do, if so then how to solve this issue. If not, is there a better way to do.
Adding index value during dataframe conversion solved the issue.
But it still want to know if this is the right approach

Compute annual mean using x-arrays

I have a python xarray dataset with time,x,y for its dimensions and value1 as its variable. I'm trying to compute annual mean of value1 for each x,y coordinate pair.
I've run into this function while reading the docs:
ds.groupby('time.year').mean()
This seems to compute a single annual mean for all x,y coordinate pairs in value1 at each given time slice
rather than the annual means of individual x,y coordinate pairs at each given time slice.
While the code snippet above produces the wrong output, I'm very interested in its oversimplified form. I would really like to figure out the "X-arrays trick" to doing annual mean for a given x,y coordinate pair rather than hacking it together myself.
Cam someone point me in the right direction? Should I temporarily turn this into a pandas object?
To avoid the default of averaging over all dimensions, you simply need to supply the dimension you want to average over explicitly:
ds.groupby('time.year').mean('time')
Note, that calling ds.groupby('time.year').mean('time') will be incorrect if you are working with monthly and not daily data. Taking the mean will place equal weight on months of different length, e.g., Feb and July, which is wrong.
Instead use below from NCAR:
def weighted_temporal_mean(ds, var):
"""
weight by days in each month
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Make sure the weights in each year add up to 1
np.testing.assert_allclose(wgts.groupby("time.year").sum(xr.ALL_DIMS), 1.0)
# Subset our dataset for our variable
obs = ds[var]
# Setup our masking for nan values
cond = obs.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
average_weighted_temp = weighted_temporal_mean(ds_first_five_years, 'TEMP')

Resources