I tried to classification problem for fun with the scikit-learn library. I got 10000x10 dimension data, and I found very weird phenomenon (for me).
pca = PCA(n_components = 2)
ss = StandardScaler()
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.8
X = ss.fit_transform(X)
in this case, i got a wonderfull explained_variance_ratio_ almost 99%. but when I apply scaling first, suddely PCA's performence is dropped drastically and explained_variance_ratio decreased to 20%.
pca = PCA(n_components = 2)
ss = StandardScaler()
X = ss.fit_transform(X)
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.2
What makes this difference? Standard Scaler is just rescaling process, so I suppose no information loss. Can I apply the PCA before for visualizing conveniency? Or I must select Standardization for mathematical insurance?
Suppose, you have two features A and B that measure distance and both are in metres. Feature A has a greater range of numbers in it (suppose, 1 - 1000) as compared to a Feature B , which has a range( suppose, 1-10).
Then, the feature A will capture greater variance in the data as compared to B, and hence it is not a good idea to scale the features in this case .
But if , the features are having two different units,(say, kg and metre), then it will be wise to scale the features.
P.S: PCA preserves those components along which there is max. variance.
Related
my anomaly detection algorithm gave me an array of predictions where all the values greater than 0 should be of the positive class (= 0) and all the other should be classified as anomalies (= 1). I built my classifier as well: (I have three datasets, the one with only non-anomaly values and the other with all anomaly values):
normal = np.load('normal_score.pkl')
anom_1 = np.load('anom1_score.pkl')
anom2_ = np.load('anom2_score.pkl')
y_normal = np.asarray([0]*len(normal)) # I know they are normal
y_anom_1 = np.asarray([1]*len(anom_1)) # I know they are anomaly
y_anom_2 = np.asarray([1]*len(anom_2)) # I know they are anomaly
score = np.concatenate([normal, anom_1, anom_2])
y = np.concatenate([y_normal, y_anom_1, y_anom_2])
auc = roc_auc_score(y, score)
fpr, tpr, thresholds = roc_curve(y, score)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
The AUC score I get is 0.02 and the plot looks like:
From what I understood this result is great because I should just reverse the labels to make it almost 0.98, but my question is: is there a way to specify it and automatically reverse it through a function?
The values in my normal score data are all in the range (21;57) and the anomalies values are in the range (-1090; -1836) so it should be easy to spot them.
"I should just reverse the labels to make it almost 0.98"
That's not how it should be done. It is because if you can predict "normal", let's say, with 95% confidence, you can not infer from this that you can also predict "anomaly" with the same confidence.
It becomes crucial in case of heavily imbalanced data which is probably the case here.
You should define which of these two you want to predict with high confidence and what are the target prediction metrics. For example, if you have a target on the precision and recall for predicting the "anomaly" then that should be your class "1" and calculate the metrics accordingly, and vice versa.
When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.
I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy
I know that a decision tree doesn't get affected by scaling the data but when I scale the data within my decision tree it gives me a bad performance (bad recall, precision and accuracy)
But when I don't scale all the performance metrics the decision tree gives me an amazing result. How can this be?
Note: I use GridSearchCV but I don't think that the cross validation is the reason for my problem. Here is my code:
scaled = MinMaxScaler()
pca = PCA()
bestK = SelectKBest()
combined_transformers = FeatureUnion([ ("scale",scaled),("best", bestK),
("pca", pca)])
clf = tree.DecisionTreeClassifier(class_weight= "balanced")
pipeline = Pipeline([("features", combined_transformers), ("tree", clf)])
param_grid = dict(features__pca__n_components=[1, 2,3],
features__best__k=[1, 2,3],
tree__min_samples_split=[4,5],
tree__max_depth= [4,5],
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid,scoring='f1')
grid_search.fit(features,labels)
With the scale function MinMaxScaler() my performance is:
f1 = 0.837209302326
recall = 1.0
precision = 0.72
accuracy = 0.948148148148
But without scaling:
f1 = 0.918918918919
recall = 0.944444444444
precision = 0.894736842105
accuracy = 0.977777777778
I am not familiar with scikit-learn, so excuse me if I misunderstand something.
First of all, does PCA standardize features? If it does not, it will give different results for scaled and non-scaled input.
Second, due to the randomness in splitting the samples, CV may give different results on each run. This will affect the results especially for small sample size. In addition, in case you have small sample size, the results may not be that different after all.
I have the following suggestions:
Scaling can be treated as an additional hyperparameter, which can be optimized by CV.
Perform an extra CV (called nested CV) or hold-out to estimate performance. This is done by keeping a test set, selecting your model using CV on the training data and then evaluate its performance on the test set (in case of nested CV you do this repeatedly for all folds and average the performance estimates). Of course, your final model should be trained on the whole dataset. In general, you should not use the performance estimate of the CV used for model selection, as it will be overly optimistic.
As part of my assignment, I am working on couple of datasets, and finding their training errors with linear Regression. I was wondering whether the standardization has any effect on the training error or not? My correlation, and RMSE is coming out to be equal for datasets before and after the standardization.
Thanks,
It is easy to show that for linear regression it does not matter if you just transform input data through scaling (by a; the same applies for translation, meaning that any transformation of the form X' = aX + b for real a != 0,b have the same property).
X' = aX
w = (X^TX)X^Ty
w' = (aX^TaX)^-1 aX^Ty
w' = 1/a w
Thus
X^Tw = 1/a aX^T w = aX^T 1/a w = X'^Tw'^T
Consequently the projection, where the error is computed is exactly the same before and after scaling, so any type of loss function (independent on x) yields the exact same results.
However, if you scale output variable, then errors will change. Furthermore, if you standarize your dataset in more complex way then by just multiplying by a number (for example - by whitening or by nearly any rotation) then your results will depend on the preprocessing. If you use regularized linear regression (ridge regression) then even scaling the input data by a constant matters (as it changes the "meaning" of regularization parameter).
When doing regression or classification, what is the correct (or better) way to preprocess the data?
Normalize the data -> PCA -> training
PCA -> normalize PCA output -> training
Normalize the data -> PCA -> normalize PCA output -> training
Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.
You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:
>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;
If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:
>> wts=pca(X)
wts =
0.6659 0.7461
-0.7461 0.6659
If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:
>> Y = X;
>> Y(:,1) = 100 * Y(:,1);
However, we now find that the principal components are aligned with the coordinate axes:
>> wts=pca(Y)
wts =
1.0000 0.0056
-0.0056 1.0000
To resolve this, there are two options. First, I could rescale the data:
>> Ynorm = bsxfun(#rdivide,Y,std(Y))
(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).
We now get sensible results from PCA:
>> wts = pca(Ynorm)
wts =
-0.7125 -0.7016
0.7016 -0.7125
They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.
The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:
>> wts = pca(Y,'corr')
wts =
0.7071 0.7071
-0.7071 0.7071
In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).
You need to normalize the data first always. Otherwise, PCA or other techniques that are used to reduce dimensions will give different results.
Normalize the data at first. Actually some R packages, useful to perform PCA analysis, normalize data automatically before performing PCA.
If the variables have different units or describe different characteristics, it is mandatory to normalize.
the answer is the 3rd option as after doing pca we have to normalize the pca output as the whole data will have completely different standard. we have to normalize the dataset before and after PCA as it will more accuarate.
I got another reason in PCA objective function.
May you see detail in this link
enter link description here
Assuming the X matrix has been normalized before PCA.