Question about a new type of confidence interval - machine-learning

I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation

The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.

Related

Is Choosing a Model based on F1 Score (Computed at threshold = 0.5) equivalent to choosing a model based on Area Under Precision Recall Curve?

https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc provides a good summary on Accuracy vs AUROC vs F1 vs AUPR.
When comparing performances of different models over the same dataset, depending on the use case one might choose Accuracy, AUROC, AUPR, or F1.
One thing I'm not quite clear on though is: "does choosing based on F1 (harmonic mean between precision and recall) over a threshold of 0.5 result in the same choice compared to choosing based on Area Under PR Curve?"
If so, why?
It is most certainly not, due to a very simple and fundamental reason: AUC scores (either of ROC or PR curves) actually give the performance of the model averaged over a whole range of thresholds; looking closely at the linked document, you'll notice the following regarding the PR AUC (emphasis in the original):
You can also think of PR AUC as the average of precision scores
calculated for each recall threshold. You can also adjust this
definition to suit your business needs by choosing/clipping recall
thresholds if needed.
and you may use PR AUC
when you want to choose the threshold that fits the business problem
The moment you choose any specific threshold (in precision, recall, F1 etc), you have left the realm of the AUC scores (ROC or PR) altogether - you are in a single point on the curve, and the average area under the curve is no longer useful (or even meaningful).
I have argued elsewhere why the AUC scores can be misleading, in the sense that most people think they give something else than what they actually give, i.e. the performance of the model over a whole range of thresholds, while what one is going to deploy (and is thus interested on its performance) will necessarily involve a specific threshold indeed.

What is a Distance Sensitive Data how it Differs from other Data? Any Examples will be helpful

i was reading about Classification Algorithm KNN and came across with one term Distance Sensitive Data. I was not able to Found what exactly is Distance Sensitive Data wha are it's classifications, How to say if our Data is Distance-Sensitive or Not?
Suppose that xi and xj are vectors of observed features in cases i and j. Then, as you probably know, kNN is based on distances ||xi-xj||, such as the Euclidean one.
Now if xi and xj contain just a single feature, individual's height in meters, we are fine, as there are no other "competing" features. Suppose that next we add annual salary in thousands. Consequently, we look at distances between vectors like (1.7, 50000) and (1.8, 100000).
Then, in the case of the Euclidean distance, clearly salary feature dominates height and it's almost like we are using the salary feature alone. That is,
||xi-xj||2 ≈ |50000-100000|.
However, if the two features actually have similar importance, then we are doing a poor job. It is even worse if salary is actually irrelevant and we should be using height alone. Interestingly, under weak conditions, our classifier still has nice properties such as universal consistency even in such bad situations. The problem is that in finite samples the performance is our classifier is very bad so that the convergence is very slow.
So, as to deal with that, one may want to consider different distances, such that do something about the scale. Commonly people standardize (set the mean to zero and variance to 1) each feature, but that's not a complete solution either. There are various proposals what could be done (see, e.g., here).
On the other hand, algorithms based on decision trees do not suffer from this. In those cases we just look for a point where to split the variable. For instance, if salary takes values in [0,100000] and the split is at 40000, then Salary/10 would be slit at 4000 so that the results would not change.

When should one set the staircase is True when decaying the learning rate in TensorFlow?

Recall that when exponentially decaying the learning rate in TensorFlow one does:
decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)
the docs mention this staircase option as:
If the argument staircase is True, then global_step /decay_steps is an
integer division and the decayed learning rate follows a staircase
function.
when is it better to decay every X number of steps and follow at stair case function rather than a smoother version that decays more and more with every step?
The existing answers didn't seem to describe this. There are two different behaviors being described as 'staircase' behavior.
From the feature request for staircase, the behavior is described as being a hand-tuned piecewise constant decay rate, so that a user could provide a set of iteration boundaries and a set of decay rates, to have the decay rate jump to the specified value after the iterations pass a given boundary.
If you look into the actual code for this feature pull request, you'll see that the PR isn't related much to the staircase option in the function arguments. Instead, it defines a wholly separate piecewise_constant operation, and the associated unit test shows how to define your own custom learning rate as a piecewise constant with learning_rate_decay.piecewise_constant.
From the documentation on decaying the learning rate, the behavior is described as treating global_step / decay_steps as integer division, so for the first set of decay_steps steps, the division results in 0, and the learning rate is constant. Once you cross the decay_steps-th iteration, you get the decay rate raised to a power of 1, then a power of 2, etc. So you only observe decay rates at the particular powers, rather than smoothly varying across all the powers if you treated the global step as a float.
As to advantages, this is just a hyperparameter decision you should make based on your problem. Using the staircase option allows you hold a decay rate constant, essentially like maintaining a higher temperature in simulated annealing for a longer time. This can allow you explore more of the solution space by taking bigger strides in the gradient direction, at the cost of possible noisy or unproductive updates. Meanwhile, smoothly increasing the decay rate power will steadily "cool" the exploration, which can limit you by making you stuck near a local optimum, but it can also prevent you from wasting time with noisily large gradient steps.
Whether one approach or the other is better (a) often doesn't matter very much and (b) usually needs to be specially tuned in the cases when it might matter.
Separately, as the feature request link mentions, the piecewise constant operation seems to be for very specifically tuned use cases, when you have separate evidence in favor of a hand-tuned decay rate based on collecting training metrics as a function of iteration. I would generally not recommend that for general use.
Good question.
For all I know it is preference of the research group.
Back from the old times, it was computationally more efficient to reduce the learning rate only every epoch. That's why some people prefer to use it nowadays.
Another, hand-wavy, story that people may tell is it prevents from local optima. By "suddenly" changing the learning rate, the weights might jump to a better bassin. (I don;t agree with this, but add it for completeness)

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?
You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.
To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).
The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Is Maximum Likelihood Estimation (MLE) a parametric approach?

There are to main probabilistic approaches to novelty detection: parametric and non-parametric. Non-para assumes the distribution or density function derived from the training data, like Kernel Density Estimation (e.g.:Parzen Windows), while para approach assumes the data is from a known distribution.
I am not familiar with the parametric approach. Could anyone show me some well known algorithms? By the way, if the MLE is a kind of parametric approach (the density curve is known, and then find the parameter corresponding to the maximum value)?
Yes, MLE is by definition a parametric approach. You are estimating the parameters to a distribution, which maximizes the probability of observation of the data. Between, a non parametric approach generally means infinite number of parameters rather than an absence of parameters. For example a dirichlet process.
I think you're confused about a lot of terminology here.
Maximum likelihood makes sense for a paraemtric model (say a Gaussian distribution) because the number of parameters is fixed a priori and so it makes sense to ask what the 'best' estimates are. Maximum likelihood gives you one (of many) possible answers.
One way of understanding non-parametric methods is that the number of parameters they have grows as the number of samples increases. Thus a KDE has one parameter for each data point, plus a bandwidth parameter. What would MLE mean here? You increase the number of parameters as more data arrives, and thus there's no sense of best value for a fixed set of parameters.

Resources