Cross-Validation Performance Aggregation with Undefined Values - machine-learning

I wonder what is the correct way to calculate average performance over several folds in cross-validation.
For example, I have 5 folds of F1 with values
[0.5 0.3 0.25 null 0.7]
What's is the average F1 of this system?
I could take null as 0 or just output null as an average result.
Alternatively, I can take only defined four values and divide by 4, but this is not correct either, because if there is some system that did 0.1 on this fold, it's performance will be poorer that the one with null, however, 0.1 is much better that null.

It really depends on the context. (In the following I'm including references to numpy just for future references for those using it.)
If the null occurred because the cv-fold was somehow undefined for the problem, then you could ignore it (e.g., by calling np.nanmean. Presumably, for "real life", you just wouldn't have a dataset equivalent to such a fold.
If the null occurred because the predictor utterly failed for this fold, then the result could either be (it's a matter of your interpretation):
nan, because the overall predictor behavior is undefined (in this case, you might just use np.mean).
The average with the worst case possible (indeed 0 for the f1 score), if you'd spot for a given set that it's malfunctioning, and output just some arbitrary result (in this case, you might use np.nan_to_num).
By far, the best thing you could do is figure out the reason for this value, and then eliminate it. This should ideally just never happen, and probably should be considered a bug; before solving the bug, just consider your estimator unsuitable for performance estimation.

Related

Force a neural network to have 0-sum outputs

I have a pytorch neural net with n-dimensional output which I want to have 0-sum during training (my training data, i.e. the true outputs, have 0 sum). Of course I could just add a line computing the sum s and then subtract s/n from each element of the output. But this way, the network would be driven even less to actually finding outputs with zero sum, as this would get taken care of anyways (I've been getting worse test results with this approach). Also, as the true outputs in the training data have 0 sum, obviously the network converges to having almost 0 sum outputs, but not quite. Hence, I was wondering whether there is a smart way to force the network to have outputs that sum to 0, without just brute-force subtracting the sum in the end (which would corrupt learning outputs to have sum 0)? I.e. some sort of solution directly incorporated in the network? (Probably there isn't, at least I couldn't think of any...)
Your approach with "explicitly substracting the mean" is the correct way. The same way we use softmax to nicely parametrise distributions, and you could complain that "this makes the network not learn about probability even more!", but in fact it does, it simply does so in its own, unnormalised space. Same in your case - by subtracting the mean you make sure that you match the target variable while allowing your network to focus on hard problems, and not waste its compute on having to learn that the sum is zero. If you do anything else your network will literally have to learn to compute the mean somewhere and subtract it. There are some potential corner cases where there might be some deep representational reason for mean to be zero that could be argues for, but these cases are rare enough that chances that this is actually happening "magically" in the network are zero (and if you knew it was happening there would be better ways of targeting it than by zero ensuring).
What happens if you add an explicit loss?
pred = model(input)
original_loss = criterion(pred, target)
# add this loss
zero_sum_loss = pred.mean() ** 2
loss = original_loss + weight * zero_sum_loss
loss.backward()
optim.step()
# ...

Sigmoid output - can it be interpreted as probability?

Sigmoid function outputs a number between 0 and 1. Is this a probability or is it merely a 'yes or no' depending on whether it's above or below 0.5?
Minimal example:
Cats vs dogs binary classification. 0 is cat, 1 is dog.
Can I perform the following interpretation of the sigmoid output values:
0.9 - it's most certainly a dog
0.52 - it's more likely to be a dog than a cat, but still quite unsure
0.5 - completely undecided, could be either a cat or a dog
0.48 - it's more likely to be a cat than a dog, but still quite unsure
0.1 - it's most certainly a cat
Or would this be the right way to interpret the results:
0.9 - it's a dog
0.52 - it's a dog
0.5 - completely undecided, could be either a cat or a dog
0.48 - it's a cat
0.1 - it's a cat
Note how in first case we utilise the numeric value to also express probabilities, while in the second case we completely ignore the probability interpretation and collapse the answers to binary. Which is correct? Can you explain why?
Background context, feel free to skip this:
I've found a number of sources that suggest that yes, sigmoid output can be interpreted as probability:
Source yes 1 - (...) sigmoid(z) will yield a value (a probability) between 0 and 1.
Source yes 2 - The "output" must come from a function that satisfies the properties of a distribution function in order for us to interpret it as probabilities. (...) The "sigmoid function" satisfies these properties.
Source yes 3 - tf.sigmoid(logits) gives you the probabilities.
And a number of sources that suggest contrary, that sigmoid output cannot be interpreted as probabilities:
Source no 1 - (...) the raw values cannot necessarily be interpreted as raw probabilities!
Source no 2 - Sigmoid (...) is not a probability distribution function
Source no (and also yes) 3 - the short answer is no, however, depending on the loss you use, it may be closer to truth than you may think.
(bonus questions, answer to win a car!) Why are there so many contradicting answers? What do these answers differ in? I find it unlikely that it's just a lot of people being completely wrong about it - I'm thinking they're just talking about different cases or some different fundamental assumptions. What's the difference that I'm missing?
I know I can just use a softmax. I also know that sigmoid can be used for non-exclusive multi-class classification (Source multi 1, Source multi 2, Source multi 3) - although even then it's unclear whether such multiple sigmoids output probabilities of various classes or again simply a 'yes or no', but for multiple classes. In my case though, I'm interested in exclusive two-class (binary) classification, and whether sigmoid can be used to determine its probabilities, or should two-class softmax be used.
A sigmoid function is not a probability density function (PDF), as it integrates to infinity. However, it corresponds to the cumulative probability function of the logistic distribution.
Regarding your interpretation of the results, even though the sigmoid is not a PDF, given that its values lie in the interval [0,1], you can still interpret them as a confidence index. With that in mind, I would say that your first interpretation is the most appropriate one, although you are free to implement whichever classifier suits your purposes better.
I think the contradiction between your provided links comes from a semantic definition of probability vs an intuitive one. I think the intuitive interpretation of "an output closer to 1 is more likely to be correct" is the right intuition, but that the number isn't a direct correlation with the probability. For example, we couldn't say that a 1 is twice as likely as .5 to be a dog.
There are problems like overfitting that make the purely mathematics probability viewpoint incorrect. However, since you have to pick one of the two options for your program, it makes sense to interpret the result as the binary greater or less than .5 approach, or maybe you should try allowing an adjustable margin of error (for example, .5 +/- x is undecided).

Do I need to add ReLU function before last layer to predict a positive value?

I am developing a model using linear regression to predict the age. I know that the age is from 0 to 100 and it is a possible value. I used conv 1 x 1 in the last layer to predict the real value. Do I need to add a ReLU function after the output of convolution 1x1 to guarantee the predicted value is a positive value? Currently, I did not add ReLU and some predicted value becomes negative value like -0.02 -0.4…
There's no compelling reason to use an activation function for the output layer; typically you just want to use a reasonable/suitable loss function directly with the penultimate layer's output. Specifically, a RELU doesn't solve your problem (or at most only solves 'half' of it) since it can still predict above 100. In this case -predicting a continuous outcome- there's a few standard loss functions like squared error or L1-norm.
If you really want to use an activation function for this final layer and are concerned about always predicting within a bounded interval, you could always try scaling up the sigmoid function (to between 0 and 100). However, there's nothing special about sigmoid here - any bounded function, ex. any CDF of a signed, continuous random variable, could be similarly used. Though for optimization, something easily differentiable is important.
Why not start with something simple like squared-error loss? It's always possible to just 'clamp' out-of-range predictions to within [0-100] (we can give this a fancy name like 'doubly RELU') when you need to actually make predictions (as opposed to during training/testing), but if you're getting lots of such errors, the model might have more fundamental problems.
Even for a regression problem, it can be good (for optimisation) to use a sigmoid layer before the output (giving a prediction in the [0:1] range) followed by a denormalization (here if you think maximum age is 100, just multiply by 100)
This tip is explained in this fast.ai course.
I personally think these lessons are excellent.
You should use a sigmoid activation function, and then normalize the targets outputs to the [0, 1] range. This solves both issues of being positive and with a limit.
You can easily then denormalize the neural network outputs to get an output in the [0, 100] range.

Catboost: what are reasonable values for l2_leaf_reg?

Running catboost on a large-ish dataset (~1M rows, 500 columns), I get:
Training has stopped (degenerate solution on iteration 0, probably too small l2-regularization, try to increase it).
How do I guess what the l2 regularization value should be? Is it related to the mean values of y, number of variables, tree depth?
Thanks!
I don't think you will find an exact answer to your question because each data-set is different one from another.
However, based on my experience values form a range between 2 and 30, is a good starting point.

scikit-learn RandomForestClassifier produces 'unexpected' results

I'm trying to use sk-learn's RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset's load_svmlight_file. It produces a sparse matrix of 'feature values' (1.177.245 * 40) and one array of 'target classes' (1s and 0s, 1.177.245 of them). I don't know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.
As the sk-learn's RFC doesn't accept sparse matrices, I convert the sparse matrix to a dense array (if I'm saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.
When I initiate the classifier and start fitting it to the data, it takes this long:
[Parallel(n_jobs=40)]: Done 1 out of 40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done 40 out of 40 | elapsed: 27.2min finished
(is that output right? Those 963 minutes take about 2 and a half...)
I then dump it using joblib.dump.
When I re-load it:
RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
criterion=gini, max_depth=None, max_features=auto,
min_density=0.1, min_samples_leaf=1, min_samples_split=1,
n_estimators=1500, n_jobs=40, oob_score=False,
random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
verbose=1)
And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get "unexpected" results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.
Now I have no reason to believe anything is wrong with what's happening, it's just that I get weird results, and furthermore I think it's all done awfully quick. It's probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours...
Can anyone enlighten me whether I have any reason to believe something is not working the way it's supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.
Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).
Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.

Resources