How to calculate feature_log_prob_ in the naive_bayes MultinomialNB - machine-learning

Here's my code:
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['Tim is smart!',
'Joy is the best',
'Lisa is dumb',
'Fred is lazy',
'Lisa is lazy'])
# Create target vector
y = np.array([1,1,0,0,0])
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data) #
# Create feature matrix
X = bag_of_words.toarray()
mnb = MultinomialNB(alpha = 1, fit_prior = True, class_prior = None)
mnb.fit(X,y)
print(count.get_feature_names())
# output:['best', 'dumb', 'fred', 'is', 'joy', 'lazy', 'lisa', 'smart', 'the', 'tim']
print(mnb.feature_log_prob_)
# output
[[-2.94443898 -2.2512918 -2.2512918 -1.55814462 -2.94443898 -1.84582669
-1.84582669 -2.94443898 -2.94443898 -2.94443898]
[-2.14006616 -2.83321334 -2.83321334 -1.73460106 -2.14006616 -2.83321334
-2.83321334 -2.14006616 -2.14006616 -2.14006616]]
My question is:
Let's say for word: "best": the probability for class 1 : -2.14006616.
What is the formula to calculate to get this score.
I am using LOG (P(best|y=class=1)) -> Log(1/2) -> can't get the -2.14006616

From the documentation we can infer that feature_log_prob_ corresponds to the empirical log probability of features given a class. Let's take an example feature "best" for the purpose of this illustration, the log probability of this feature for class 1 is -2.14006616 (as you pointed out), now if we were to convert it into actual probability score it will be np.exp(1)**-2.14006616 = 0.11764. Let's take one more step back to see how and why the probability of "best" in class 1 is 0.11764. As per the documentation of Multinomial Naive Bayes, we see that these probabilities are computed using the formula below:
Where, the numerator roughly corresponds to the number of times feature "best" appears in the class 1 (which is of our interest in this example) in the training set, and the denominator corresponds to the total count of all features for class 1. Also, we add a small smoothing value, alpha to prevent from the probabilities going to zero and n corresponds to the total number of features i.e. size of vocabulary. Computing these numbers for the example we have,
N_yi = 1 # "best" appears only once in class `1`
N_y = 7 # There are total 7 features (count of all words) in class `1`
alpha = 1 # default value as per sklearn
n = 10 # size of vocabulary
Required_probability = (1+1)/(7+1*10) = 0.11764
You can do the math in a similar fashion for any given feature and class.

Related

Can I calculate the confidence bound of a Prophet model that would contain a certain value?

can I use the y-hat variance, the bounds, and the point estimate from the forecast data frame to calculate the confidence level that would contain a given value?
I've seen that I can change my interval level prior to fitting but programmatically that feels like a LOT of expensive trial and error.
Is there a way to estimate the confidence bound using only the information from the model parameters and the forecast data frame?
Something like:
for level in [.05, .1, .15, ... , .95]:
if value_in_question in (yhat - Z_{level}*yhat_variance/N, yhat + Z_{level}*yhat_variance/N):
print 'im in the bound level {level}'
# This is sudo code not meant to run in console
EDIT: working prophet example
# csv from fbprohets working examples https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csv
import pandas as pd
from fbprophet import Prophet
import os
df = pd.read_csv('example_wp_log_peyton_manning.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
# the smallest confidence level s.t. the confidence interval of the 30th prediction contains 9
## My current approach
def __probability_calculation(estimate, forecast, j = 30):
sd_residuals = (forecast.yhat_lower[j] - forecast.yhat[j])/(-1.28)
for alpha in np.arange(.5, .95, .01):
z_val = st.norm.ppf(alpha)
if (forecast.yhat[j]-z_val*sd_residuals < estimate < forecast.yhat[j]+z_val*sd_residuals):
return alpha
prob = __probability_calculation(9, forecast)
fbprophet uses the numpy.percentile method to estimate the percentiles as you can see here in the source code:
https://github.com/facebook/prophet/blob/0616bfb5daa6888e9665bba1f95d9d67e91fed66/python/prophet/forecaster.py#L1448
How to inverse calculate percentiles for values is already answered here:
Map each list value to its corresponding percentile
Combining everything based on your code example:
import pandas as pd
import numpy as np
import scipy.stats as st
from fbprophet import Prophet
url = 'https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv'
df = pd.read_csv(url)
# put the amount of uncertainty samples in a variable so we can use it later.
uncertainty_samples = 1000 # 1000 is the default
m = Prophet(uncertainty_samples=uncertainty_samples)
m.fit(df)
future = m.make_future_dataframe(periods=30)
# You need to replicate some of the preparation steps which are part of the predict() call internals
tmpdf = m.setup_dataframe(future)
tmpdf['trend'] = m.predict_trend(tmpdf)
sim_values = m.sample_posterior_predictive(tmpdf)
The sim_values object contains for every datapoint 1000 simulations on which the confidence interval is based.
Now you can call the scipy.stats.percentileofscore method with any target value
target_value = 8
st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples
# returns 44.26
To prove this works backwards and forwards you can get the output of the np.percentile method and put it in the scipy.stats.percentileofscore method.
This works for an accuracy of 4 decimals:
ACCURACY = 4
for test_percentile in np.arange(0, 100, 0.5):
target_value = np.percentile(sim_values['yhat'], test_percentile)
if not np.round(st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples, ACCURACY) == np.round(test_percentile, ACCURACY):
print(test_percentile)
raise ValueError('This doesnt work')

evaluate the output of autoML results

How do I interpret following results? What is the best possible algorithm to train based on autogluon summary?
*** Summary of fit() ***
Estimated performance of each model:
model score_val fit_time pred_time_val stack_level
19 weighted_ensemble_k0_l2 -0.035874 1.848907 0.002517 2
18 weighted_ensemble_k0_l1 -0.040987 1.837416 0.002259 1
16 CatboostClassifier_STACKER_l1 -0.042901 1559.653612 0.083949 1
11 ExtraTreesClassifierGini_STACKER_l1 -0.047882 7.307266 1.057873 1
...
...
0 RandomForestClassifierGini_STACKER_l0 -0.291987 9.871649 1.054538 0
The code to generate the above results:
import pandas as pd
from autogluon import TabularPrediction as task
from sklearn.datasets import load_digits
digits = load_digits()
savedir = "otto_models/" # where to save trained models
train_data = pd.DataFrame(digits.data)
train_target = pd.DataFrame(digits.target)
train_data = pd.merge(train_data, train_target, left_index=True, right_index=True)
label_column = "0_y"
predictor = task.fit(
train_data=train_data,
label=label_column,
output_directory=savedir,
eval_metric="log_loss",
auto_stack=True,
verbosity=2,
visualizer="tensorboard",
)
results = predictor.fit_summary() # display detailed summary of fit() process
Which algorithm seems to work in this case?
weighted_ensemble_k0_l2 is the best result in terms of validation score (score_val) because it has the highest value. You may wish to do predictor.leaderboard(test_data) to get the test scores for each of the models.
Note that the result shows a negative score because AutoGluon always considers higher to be better. If a particular metric such as logloss prefers lower values to be better, AutoGluon flips the sign of the metric. I would guess a val_score of 0 would be a perfect score in your case.

H2O Gain/Lift table

My question is about H2O Gain/Lift table. I understand that the response rate is the proportion of all the events that fall into the group/bin. HOW to get that pieces of data that fall into bin 1, bin 2, etc.? I want to see how the key variables look in each group/bin in respect to the Response Rate.
It would be great to have a full description of how the measures in Gain/Lift table are calculated (formulas)
The equations for the Gains and Lift Chart can be found in this file: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/GainsLift.java
Which shows:
E = total number of events
N = number of observations
G = number of groups (10 for deciles or 20 for demi-deciles)
P = overall proportion of observations that are events (P = E/N)
ei = number of events in group i, i=1,2,...,G
ni = number of observations in group i
pi = proportion of observations in group i that are events (pi = ei/ni)
groups: are hard coded to 16; if there are fewer than 16 unique probability values, then the number of groups is reduced to the number of unique quantile thresholds.
cumulative data fraction = sum_n_i/N
lower_threshold = set by quantile bins
lift = pi/P
cumulative_lift = (Σiei/Σini)/P
response_rate = 100*pi
cumulative_response_rate = 100*Σiei/Σini
capture_rate = 100*ei/E
cumulative_capture_rate = 100*Σiei/E
gain = 100*(lift-1)
cumulative_gain = 100*(sum_lift-1)
average_response_rate = E/N
So here is a example walkthrough using the H2O-3 Python API:
import h2o
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import and split the dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split dataset
train, valid = cars.split_frame(ratios=[.7],seed=1234)
# Initialize and train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
# Generate Gains and Lift Table
# documentation on this parameter can be found here:
# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?#h2o.model.H2OBinomialModel.gains_lift
gainslift = cars_gbm.gains_lift(train=False, valid=True, xval=False)
Table Overview
As expected we have 16 groups, because this is the hardcoded default behavior.
Cumulative data fractions
Threshold probability value
Response rates (proportion of observations that are events in a group)
Cumulative response rate
Event capture rate
Cumulative capture rate
Gain (difference in percentages between the overall proportion of events and the observed proportion of observations that are events in the group)
Cumulative gain
What if I Want Just the Deciles
By default the Gains and Lift Table provides you with more then just the deciles or ventiles, what this means is you have more flexibilty to pick out the percentiles in which you are interested.
Let's take the example of getting our deciles. In this example we see that we can start at row 6, skip row 7 and then take the rest of the rows to get our deciles.
Since the Gains and Lift Table returns a TwoDimTable we can use our group numbers as selection indices.
# show gains and lift table data type
print('H2O Gains Lift Table is of type: ', type(gainslift))
H2O Gains Lift Table is of type: <class 'h2o.two_dim_table.H2OTwoDimTable'>
# since this table is small and for ease of use let's covert to a pandas dataframe
pandas_gl = gainslift.as_data_frame()
pandas_gl.set_index('group')
gainslift_deciles = pandas_gl.iloc[pd.np.r_[5,7:16], :]
gainslift_deciles
What if I Want Just the Ventiles
Those are available to select out as well, so let's do that next.
gainslift_ventiles = pandas_gl.iloc[pd.np.r_[7,9,11,13,15], :]
gainslift_ventiles

Scikit-learn PCA .fit_transform shape is inconsistent (n_samples << m_attributes)

I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/

Resources