Need a way in Pandas to perform a robust standard deviation - standard-deviation

I need pandas to calculate a robust standard deviation
I'm performing outlier analysis on electrical measurements in python today and refactoring the code in a pandas environment. An issue that I have is in calculating standard deviation. If the outliers are present in the population when calculating std, the resulting value is too large and is due to the presence of the outliers. In my original python code, I've written robust mean an standard deviation functions to get back to a more normal population in order to calculate the outlier limits. Note, I also use this normalized population to calculate skewness and kurtosis because they are highly affected by outliers.
What I've been looking at is normalizing the population by using a 95% quantile of the data set and calculating from there for outlier limits. Does anyone know if anyone else in the pandas community has worked on robust statistics functions. If not, I'll forge ahead.
df["#18.1355"].describe()
count 2694.000000
mean 1.808318
std 6.426645
min 0.920686
25% 1.357991
50% 1.521781
75% 1.801604
max 334.196900
Name: #18.1355, dtype: float64
Note that the far outlier in the max value.
The standard deviation of a normalized population for the above measurement is ~0.8

This answer is not specific to pandas, but have you considered using the biweight midvariance? (see for an example of implementation http://docs.astropy.org/en/stable/api/astropy.stats.biweight_midvariance.html)

Related

How to replace the anomalous data in time-series analysis?

I applied an isolation forest algorithm to identify the anomalous data in my time series. Now I want to replace those outliers before feeding them into a machine learning model. How can we replace those outliers in time series analysis?
It actually depends on the kind of data and what you want to do. Consider two scenarios that are time-dependent.
Predicting the target variable depending on sensor measurements.
In this case, you can neglect the sensor transmission errors to create a cleaner dataset for the other algorithms to use.
Fraud Detection.
In this case, you want to detect the pattern when the anomaly will be created so you can't drop or replace the outlier because you are analyzing the outlier itself.
There is a forecast package in R tsclean(). The tsclean() function will fit a robust trend using loess (for non-seasonal series), or robust trend and seasonal components using STL (for seasonal series).
For non-seasonal time series, outliers are replaced by linear interpolation. For seasonal time series, the seasonal component from the STL fit is removed and the seasonally adjusted series is linearly interpolated to replace the outliers, before adding back the trend and seasonal components to the result.

Feature engineering gaussian distributed input

I am designing a NN classifier where most of the input features are estimations of gaussian distributions. I.e. one feature has a mu and a sigma value.
The classifier has about 30 input features, 60 if you consider each mu and sigma their own feature.
The number of outputs are 15, i.e. there are 15 possible classifications.
I have about 50k examples to use for training/verification.
I can think of a few different scenarios of how to transform these features into something useful but I am not clever enough to come to any conclusions on how they would impact my results.
First scenario is to just scale and blindly pass each mu and sigma individually. I don't really see how sigma would help the classifier in this case, since it's just a measure of uncertainty. Optimally this would lead to slightly "fuzzier" classifications which possibly could be used for estimating some certainty metric of a classification result.
Second scenario is to generate more test cases by drawing a value from the gaussian of each each of the 30 input features, and then normalizing these random values. This would give me more training data, which could be useful.
As I side note I have the possibility to get more data (about 50k examples more) but I am not sure how accurate that data is so I would like to try with this smaller set first to see if it converges.
The question is: Is there any consensus or interesting paper in the community, describing how to deal with estimated uncertainty in input features?
Thanks!
P.S. Sorry for my bad wording, ML is not my professional domain nor is English my native language.

Getting a low ROC AUC score but a high accuracy

Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.
I use pandas to select some columns:
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
I fill in NaN values with 0:
df = df.fillna({'ARR_DEL15': 0})
Make sure the categorical columns are marked with the 'category' data type:
df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')
Then call get_dummies() from pandas:
df = pd.get_dummies(df)
Now I train and test my data set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)
train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]
test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]
lr.fit(train_set_x, train_set_y)
Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583
probabilities = lr.predict_proba(test_set_x)
roc_auc_score(test_set_y, probabilities[:, 1])
Is there any reason why the ROC AUC is much lower than what the score method provides?
To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.
[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).
Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).
The point to take home is that:
when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds
Given these clarifications, your particular example provides a very interesting case in point:
I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?
Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).
(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.
[...]
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
Emphasis mine - see also On the dangers of AUC...
I don't know what exactly AIR_DEL15 is, which you use as your label (it is not in the original data). My guess is that it is an imbalanced feature, i.e there are much more 0's than 1's; in such a case, accuracy as a metric is not meaningful, and you should use precision, recall, and the confusion matrix instead - see also this thread).
Just as an extreme example, if 87% of your labels are 0's, you can have a 87% accuracy "classifier" simply (and naively) by classifying all samples as 0; in such a case, you would also have a low AUC (fairly close to 0.5, as in your case).
For a more general (and much needed, in my opinion) discussion of what exactly AUC is, see my other answer.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

Why should we perform cosine normalization for SVM feature vectors?

I was recently playing around with the well known movie review dataset used in binary sentiment analysis. It consists of 1,000 positive and 1,000 negative reviews. While exploring various feature-encodings with unigram features, I noticed that all previous research publications normalize the vectors by their Euclidean norm in order to scale them to unit-length.
In my experiments using Liblinear, however, I found that such length-normalization decreases the classification accuracy significantly. I studied the vectors, and I think this is the reason: the dimension of the vector space is, say, 10,000. As a result, the Euclidean norm of the vectors is very high compared to the individual projections. Therefore, after normalization, all the vectors get very small numbers on each axis (i.e., the projection on an axis).
This surprised me, because all publications in this field claim that they perform cosine normalization, whereas I found that NOT normalizing yields better classification.
Thus my question: is there any specific disadvantage if we don't perform cosine normalization for SVM feature vectors? (Basically, I am seeking a mathematical explanation for this need for normalization).
After perusing the manual of LibSVM, I realize why the normalization was yielding much lower accuracy when compared to not normalizing. They recommend scaling the data to a [0,1] or [-1,1] interval. This is something I had not done. Scaling up will resolve the issue of having too many data points very close to zero, while retaining the advantages of length-normalization.

Resources