predict conditional mean of one potential outcome in causal forest? - random-forest

I fit a causal_forest() and used predict() to get estimates of tau(X) = E[Y(1) - Y(0) | X].
How can I get estimates of E[Y(0) | X] alone ?
Thank you !

Related

How to choose the order of variable elimination in bayes net?

I have the following bayes net with me.
I want to find P(+h|+e). So I have to find A = P(+h,+e) and B = P(+e) to find P(+h|+e). I wanted to follow variable elimination for find the probability. Taking different orders is giving me different probabilities. How should I choose my order of the variable elimination for accurate calculation of P(+h|+e)?
Will it be okay if I calculate P(+h,+u,+e) and eliminating +u instead of finding P(+i, +h, +t, +u, +e) and eliminating +i,+t and +u for finding P(+h,+e)?
How do I calculate P(+e)?
1.P(h|e) is the conditional probability of P(cause | effect ),we are using an effect to infer the cause (diagnostic direction).
P(c| e)P(e) /P(c) = P(h| e)P(e)/P(h) = P(h,e)P(e)/P(h)
So to calculate P(h,e) you would have to calculate joint distribution with all the variables and marginalise each one since they are relevant to the query and evidence variables.
P(+i, +h, +t, +u, +e) would be the correct choice
To calculate P(+e) we would need only its parents, i.e Good test taker and understands the material. So we need to calculate the underlying conditional distribution P(e| t,u) and marginalizing out the variables t, u.
P(+e)
= Sum_t( Sum_u( P(+e, t, u)))
= P( +e | +t,+u)P(+t)P(+u) + P( +e | +t,-u)P(+t)P(-u) + P( +e | -t,+u)P(-t)P(+u)+ P( +e | -t,-u)P(-t)P(-u)

Does majority class treated as positive in Sklearn? Sklearn calculate False positive rate as False negative rate

I'm working with classification with imbalanced data set using Sklearn. Sklearn has calculated the false_positive_rate and true_positive_rate wrong; when I want to calculate the AUC score, the result is different from what I have gotten from the confusion matrix.
From Sklearn I got the following confusion matrix:
confusion = confusion_matrix(y_test, y_pred)
array([[ 9100, 4320],
[109007, 320068]], dtype=int64)
of course, I understand the output as:
+-----------------------------------+------------------------+
| | Predicted | Predicted |
+-----------------------------------+------------------------+
| Actual | True positive = 9100 | False-negative = 4320 |
| Actual | False-positive = 109007 | True negative = 320068|
+--------+--------------------------+------------------------+
However, for FPR and TPR, I got the following result:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
(false_positive_rate, true_positive_rate)
(array([0. , 0.3219076, 1. ]),
array([0. , 0.7459488, 1. ]))
The result is different from the confusion_matrix. According to my table, the FPR is actually FNR, and the TPR is actually TNR. Then I checked the confusion matrix document, I found out that:
Thus, in binary classification, the count of true negatives is C0,0, false negatives is C1,0, true positives is C1,1 and false positives is C0,1.
This means that the confusion_matrix, according to Sklearn, looks like this:
+-----------------------------------+---------------------------+
| | Predicted | Predicted |
+-----------------------------------+---------------------------+
| Actual | True-Positive = 320068 | False-Negative = 109007 |
| Actual | False-Positive = 4320 | True-Negative = 9100 |
+--------+--------------------------+---------------------------+
According to the theory, for binary classification, the rare class is denoted as the positive class.
Why does Sklearn treat the majority class as positive?
After some experiments, I found out that when IsolationForest from sklearn is used for imbalanced data, if you check confusion_matrix, It can be seen that IsolationForest treats the majority (Normal) class as a positive class whereas minor class should be the positive class in Fraud/Outlier/Anomaly detection tasks.
To overcome this challenge there are 2 solutions:
Interpret the confusion matrix results in vice versa. FP instead of FN and TP instead of TN.
If you want to pass the results correctly due to this bad treatment of IF for imbalanced data you can use the following trick:
Typically IF returns -1 for outliers and 1 for inliers so if you replace 1 with -1 and then -1 with 1 in the output from the IsolationForest, Then you could use the standard metric calculations correctly in this case.
IF_model = IsolationForest(max_samples="auto",
random_state=11,
contamination = 0.1,
n_estimators=100,
n_jobs=-1)
IF_model.fit(X_train_sf, y_train_sf)
y_pred_test = IF_model.predict(X_test_sf)
counts = np.unique(y_pred_test, return_counts=True)
#(array([-1, 1]), array([44914, 4154]))
#replace 1 with -1 and then -1 with 1
if (counts[1][0] < counts[1][1] and counts[0][0] == -1) or (counts[1][0] > counts[1][1] and counts[0][0] == 1): y_pred_test = -y_pred_test
Considering confusion matrix documentation, and problem definition here, above trick should work and right form of confusion matrix for Fraud/Outlier/Anomaly detection or Binary classifiers based on litertures Ref.1, Ref.2, Ref.3 is as follows:
+----------------------------+---------------+--------------+
| | Predicted | Predicted |
+----------------------------+---------------+--------------+
| Actual (Positive class)[1] | TP | FN |
| Actual (Negative class)[-1]| FP | TN |
+----------------------------+---------------+--------------+
tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"\nFP: ", fp,"\nFN: " ,fn,"\nTP: ", tp)
print("Number of positive class instances: ",tp+fn,"\nNumber of negative class instances: ", tn+fp)
check the evaluation:
print(classification_report(y_test_sf, y_pred_test, target_names=["Anomaly", "Normal"]))

What are some specific examples of Ensemble Learning?

What are some concrete real life examples which can be solved using Boosting/Bagging algorithms? Code snippets would be greatly appreciated.
Ensembles are used to fight overfitting / improve generalization or to fight specific weaknesses / use strength of different classifiers. They can be applied in any classification task.
I used ensembles in my masters thesis. The code is on Github.
Example 1
For example, think of a binary problem where you have to tell if a data point is of class A or B. This could be an image and you have to decide if there is a (A) a dog or (B) a cat on it. Now you have two classifiers (1) and (2) (e.g. two neural networks, but trained in different ways; or one SVM and a decision tree, or ...). They make the following errors:
(1): Predicted
T | A B
R ------------
U A | 90% 10%
E B | 50% 50%
(2): Predicted
T | A B
R ------------
U A | 60% 40%
E B | 40% 60%
You could, for example, combine them to an ensemble by first using (1). If it predicts B, then you can use (2). Otherwise you stick with it.
Now, what would be the expected error, (falsely) assuming both are independent)?
If the true class is A, then we predict with 90% the true result. In 10% of the cases we predict B and use the second classifier. This one gets it right in 60% of the cases. This means if we have A, we predict A in 0.9 + 0.1*0.6 = 0.96 = 96% of the cases.
If the true class is B, we predict in 50% of the cases B. But we also need to get it right the second time, so only in 0.5*0.6 = 0.3 = 30% of the cases we get it right there.
So in this simple example we made the situation better for one class, but worse for the other.
Example 2
Now, lets say we have 3 classifiers with
Predicted
T | A B
R ------------
U A | 60% 40%
E B | 40% 60%
each, but the classifications are independent. What do you get when you make a majority vote?
If you have class A, the probability that at least two say it is class A is
0.6 * 0.6 * 0.6 + 0.6 * 0.6 * 0.4 + 0.6 * 0.4 * 0.6 + 0.4 * 0.6 * 0.6
= 1*0.6^3 + 3*(0.6^2 * 0.4^1)
= (3 nCr 3) * 0.6 + (3 nCr 2) * (0.6^2 * 0.4^1)
= 0.648
The same goes for the other class. So we improved the classifier to
Predicted
T | A B
R ------------
U A | 65% 35%
E B | 35% 65%
Code
See sklearns page on Ensembles for code.
The most specific example of ensemble learning are random forests.
Ensemble is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model.
Ensemble Learning Techniques:
Bagging : Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population.
Boosting : Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models.
Stacking : This is a very interesting way of combining models. Here we use a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.
for more reference:
Basics of Ensemble Learning Explained
Here's Python based pseudo code for basic Ensemble Learning:
# 3 ML/DL models -> first_model, second_model, third_model
all_models = [first_model, second_model, third_model]
first_model.load_weights(first_weight_file)
second_model.load_weights(second_weight_file)
third_model.load_weights(third_weight_file)
def ensemble_average(models: List [Model]): # averaging
outputs = [model.outputs[0] for model in all_models]
y = Average()(outputs)
model = Model(model_input, y, name='ensemble_average')
pred = model.predict(x_test, batch_size = 32)
pred = numpy.argmax(pred, axis=1)
E = numpy.sum(numpy.not_equal(pred, y_test))/ y_test.shape[0]
return E
def ensemble_vote(models: List [Model]): # max-voting
pred = []
yhats = [model.predict(x_test) for model in all_models]
yhats = numpy.argmax(yhats, axis=2)
yhats = numpy.array(yhats)
for i in range(0,len(x_test)):
m = mode([yhats[0][i], yhats[1][i], yhats[2][i]])
pred = numpy.append(pred, m[0])
E = numpy.sum(numpy.not_equal(pred, y_test))/ y_test.shape[0]
return E
# Errors calculation
E1 = ensemble_average(all_models);
E2 = ensemble_vote(all_models);

REPEATED in SPSS linear mixed model

I am anlyzing data from an experiment.
I have three groups ( GROUP, 1 between subject factor) to compare via a cognitive task.
Task is composed by a 3 way full factorial design (2x3x3); all subjects are presented two stimuli (factor1), for each stimulus there are three conditions (factor2), and for each condition three position on the screen (factor3). For each combination of factors, there are N trials that are averaged to give average accuracy (ACC) and average reaction time (RT).
I want to build a model in spss using linear mixed model.
I tried in SPSS 22 the following syntax:
MIXED ACC BY GROUP FACTOR1 FACTOR2 FACTOR3 GENDER WITH RT Age
/FIXED = GROUP FACTOR1 FACTOR2 FACTOR3 GROUP*FACTOR1 GROUP*FACTOR2 GROUP*FACTOR3 GENDER AGE RT | SSTYPE(3)
/RANDOM= INTERCEPT | SUBJECT(SUBID) COVTYPE(VC)
Considered I have averaged accuracy rates across trials for each combination, should I include a repeated statement as well? If this were the case, what is the difference between the following
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
and the following nomenclature?
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
In other words, what is the difference between including or less asterisks?
Thanks for your comments,
Alessandro
You have two questions here: (1) a statistical question about what type of analysis is appropriate, and (2) a code question.
(1) Very briefly, if you're going to use linear mixed models, I think you should use all the data, and not average across your N trials within each combination of factors. Those N trials are your repeated measurements.
(2) The IBM KnowledgeCenter page on the REPEATED subcommand states
Specify a list of variable names (of any type) connected by asterisks
(repeated measure) following the REPEATED subcommand.
which suggests that
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
should be a syntax error. It isn't, so I looked at the Model Information table in the output. For both REPEATED specifications, the Repeated Effects section of that table lists FACTOR1*FACTOR2*FACTOR3 as the effect.
Based on this, it's safe to say that the SPSS syntax parser interprets
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
to be equivalent to
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

How to find clusters a matrix

I have no clue about data mining or data analysis or statistical analysis but I think what I need is finding "clusters in a matrix". I have a data set of ~20k records and each has ~40 characteristics all of which are either turned on or off.
+--------+------+------+------+------+------+------+
| record | hasA | hasB | hasC | hasD | hasE | hasF |
+--------+------+------+------+------+------+------+
| foo | 1 | 0 | 1 | 0 | 0 | 0 |
| bar | 1 | 1 | 0 | 0 | 1 | 1 |
| baz | 1 | 1 | 1 | 0 | 0 | 0 |
+--------+------+------+------+------+------+------+
I'm quite convinced most of those 20k records have characteristics that fall into one of several categories. There must be means to determine how similar record 'foo' is to record 'bar'.
So, what is it that I'm actually looking at? What algorithm am I looking for?
Transform each record r into a binary vector v(r) so that i-th component of v(r) is set to 1 if r has i-th characteristic, and 0 otherwise.
Now run hierarchical clustering algorithm on this set of vectors under the Hamming distance or Jaccard distance, whichever you think is more appropriate; also make sure there's a notion of distance between clusters defined in terms of the underlying distance (see linkage criteria).
Then decide where to cut the resulting dendrogram based on common sense. Where you cut the dendrogram will affect the number of clusters.
One downside of hierarchical clustering is that it's rather slow. It takes O(n^3) time in general, so it would take quite a while on a large data set. For single- and complete-linkages you can bring the time down to O(n^2).
Hierarchical clustering is very easy to implement in languages such as Python. You can also use the implementation from the scipy library.
Example: Hierarchical Clustering in Python
Here's a code snippet to get you started. I assume S is the set of records transformed into binary vectors (i.e. each list in S corresponds to a record from your data set).
import numpy as np
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
# This is the set of binary vectors, each of which would
# correspond to a record in your case.
S = [
[0, 0, 0, 1, 1], # 0
[0, 0, 0, 0, 1], # 1
[0, 0, 0, 1, 0], # 2
[1, 1, 1, 0, 0], # 3
[1, 0, 1, 0, 0], # 4
[0, 1, 1, 0, 0]] # 5
# Use Hamming distance with complete linkage.
Z = sch.linkage(sch.distance.pdist(S, metric='hamming'), 'complete')
# Compute the dendrogram
P = sch.dendrogram(Z)
plt.show()
The result is as you'd expect: cut at 0.5 to get two clusters, one of the first three vectors (which have ones at beginning, zeros at the end) and the other of the last three vectors (which have ones at the end, zeros at the beginning). Here's the image:
Hierarchical clustering starts with each vector being its own cluster. In each successive steps it merges the closest clusters. It repeats this until there is a single cluster left.
The dendrogram essentially encodes the whole clustering process. At the beginning each vector is its own cluster. Then {3} and {5} merge into {3,5} and {0} and {2} merge into {0,2}. Next, {4} and {3,5} merge into {3,4,5}, and {1} and {0,2} merge into {0,1,2}. Finally, {0,1,2} and {3,4,5} merge into {0,1,2,3,4,5}.
From the dendrogram you can usually see at which point it makes the most sense to cut---this will define your clusters.
I encourage you to experiment with various distances (e.g. Hamming distance, Jaccard distance) and linkages (e.g. single linkage, complete linkage), and various representations (e.g. binary vectors).
Are you sure you want cluster analysis?
To find similar records you don't need cluster analysis. Simply find similar records with any distance measure such as Jaccard similarity or Hamming distance (both of which are for binary data). Or cosine distance, so that you can use e.g. Lucene to find similar records fast.
To find common patterns, the use of frequent itemset mining may yield much more meaningful results, because these can work on a subset of attributes only. For example, in a supermarket, the columns Noodles, Tomato, Basil, Cheese may constitute a frequent pattern.
Most clustering algorithms attempt to divide the data into k groups. While this at first appears a good idea (get k target groups) it rarely matches what real data contains. For example customers: why would every customer belong to exactly one audience? What if the audiences are e.g. car lovers, gun lovers, football lovers, soccer moms - are you sure you don't want to allow overlap of these groups?
Furhermore, a problem with cluster analysis is that it's incredibly easy to use badly. It does not "fail hard" - you always get a result, and you might not realize that it's a bad result...

Resources