MCMC when there are only has minor variations in likelihood - mcmc

When implementing MCMC the resulting likelihood change when altering parameters is small and so the acceptance rate is very high. How is it possible to cope with this?
I have tried altering the standard deviation for accepting point but this does not work as any points within the range of possible parameter values all cause similar likelihoods and hence acceptance. Expanding the range of possible value for the parameters is undesirable.
For example the range of a parameter could be 0 to 1 and would produce a log-likelihood ranging from -10.6 to -11.2. Taking the two extremes the acceptance ratio would be,
exp(-10.6 - -11.2)=0.5488.
Due to this the vast majority of points are being accepted. Is there some method by which this can be overcome for cases when working with gibbs sampling for 3 parameters?

Related

Machine learning algorithm to predict/find/converge to correct parameters in mathematical model

I am currently trying to find a machine learning algorithm that can predict about 5 - 15 parameters used in a mathematical model(MM). The MM has 4 different ordinary differential equations(ODE) and a few more will be added and thus more parameters will be needed. Most of the parameters can be measured, but others need to be guessed. We know all the 15 parameters, but we want the computer to guess 5 or even 10. To test if parameters are correct, we fill in the parameters in the MM and then calculate the ODEs with a numerical method. Subsequently we calculate the error between the calculations of the model with the parameters we know(and want to guess) and the calculated values of the MM for which we guessed the parameters. Calculating the values of the models ODEs is done multiple times, the ODEs represent one minute in real time and we calculate for 24 hours, thus 1440 calculations.
Currently we are using a particle filter to gues the variables, this works okay but we want to see if there are any better methods out there to gues parameters in a model. The particle filter takes a random value for a parameter which lies between a range we know about the parameter, e.g. 0,001 - 0,01. this is done for each parameter that needs to be guessed.
If you can run a lot of full simulations (tens of thousands) you can try black-box optimization. I'm not sure if black-box is the right approach for you (I'm not familiar with particle filters). But if it is, CMA-ES is a clear match here and easy to try.
You have to specify a loss function (e.g. the total sum of square errors for a whole simulation) and an initial guess (mean and sigma) for your parameters. Among black-box algorithms CMA-ES is a well-established baseline. It is hard to beat if you have only few (at most a few hundreds) continuous parameters and no gradient information. However anything less black-box-ish that can e.g. exploit the ODE nature of your problem will do better.

Question about a new type of confidence interval

I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.

How to identify the modes in a (multimodal) continuous variable

What is the best method for finding all the modes in a continuous variable? I'm trying to develop a java or python algorithm for doing this.
I was thinking about using kernel density estimation, for estimating the probability density function of the variable. After, the idea was to identify the peaks in the probability density function. But I don't now if this makes sense and how to implement this in a concrete code in Java or Python.
Any answer to the question "how many modes" must involve some prior information about what you consider a likely answer, and any result must be of the form "p(number of modes = k | data) = nnn". Given such a result, you can figure out how to use it; there are at least three possibilities: pick the one with greatest probability, pick the one that minimizes some cost function, or average any other results over these probabilities.
With that prologue, I'll recommend a mixture density model, with varying numbers of components. E.g. mixture with 1 component, mixture with 2 components, 3, 4, 5, etc. Note that with k components, the maximum possible number of modes is k, although, depending on the locations and scales of the components, there might be fewer modes.
There are probably many libraries which can find parameters for a mixture density with a fixed number of components. My guess is that you will need to bolt on the stuff to work with the posterior probability of the number of components. Without looking, I don't know a formula for the posterior probability of the number of modes, although it is probably straightforward to work it out.
I wrote some Java code for mixture distributions; see: http://riso.sourceforge.net and look for the source code. No doubt there are many others.
Follow-up questions are best directed to stats.stackexchange.com.

When should one set the staircase is True when decaying the learning rate in TensorFlow?

Recall that when exponentially decaying the learning rate in TensorFlow one does:
decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)
the docs mention this staircase option as:
If the argument staircase is True, then global_step /decay_steps is an
integer division and the decayed learning rate follows a staircase
function.
when is it better to decay every X number of steps and follow at stair case function rather than a smoother version that decays more and more with every step?
The existing answers didn't seem to describe this. There are two different behaviors being described as 'staircase' behavior.
From the feature request for staircase, the behavior is described as being a hand-tuned piecewise constant decay rate, so that a user could provide a set of iteration boundaries and a set of decay rates, to have the decay rate jump to the specified value after the iterations pass a given boundary.
If you look into the actual code for this feature pull request, you'll see that the PR isn't related much to the staircase option in the function arguments. Instead, it defines a wholly separate piecewise_constant operation, and the associated unit test shows how to define your own custom learning rate as a piecewise constant with learning_rate_decay.piecewise_constant.
From the documentation on decaying the learning rate, the behavior is described as treating global_step / decay_steps as integer division, so for the first set of decay_steps steps, the division results in 0, and the learning rate is constant. Once you cross the decay_steps-th iteration, you get the decay rate raised to a power of 1, then a power of 2, etc. So you only observe decay rates at the particular powers, rather than smoothly varying across all the powers if you treated the global step as a float.
As to advantages, this is just a hyperparameter decision you should make based on your problem. Using the staircase option allows you hold a decay rate constant, essentially like maintaining a higher temperature in simulated annealing for a longer time. This can allow you explore more of the solution space by taking bigger strides in the gradient direction, at the cost of possible noisy or unproductive updates. Meanwhile, smoothly increasing the decay rate power will steadily "cool" the exploration, which can limit you by making you stuck near a local optimum, but it can also prevent you from wasting time with noisily large gradient steps.
Whether one approach or the other is better (a) often doesn't matter very much and (b) usually needs to be specially tuned in the cases when it might matter.
Separately, as the feature request link mentions, the piecewise constant operation seems to be for very specifically tuned use cases, when you have separate evidence in favor of a hand-tuned decay rate based on collecting training metrics as a function of iteration. I would generally not recommend that for general use.
Good question.
For all I know it is preference of the research group.
Back from the old times, it was computationally more efficient to reduce the learning rate only every epoch. That's why some people prefer to use it nowadays.
Another, hand-wavy, story that people may tell is it prevents from local optima. By "suddenly" changing the learning rate, the weights might jump to a better bassin. (I don;t agree with this, but add it for completeness)

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Resources