How does STAN computation time scale with number of iterations? - mcmc

I'm using MCMC to fit a bayesian mixed model. I have not a particular strong background in coding/computer science and I was wondering about something which seems very weird to me.
When I set the number of iterations for my MCMC sampling through STAN to 100 the sampling is complete in like 20" , when I set it to 1000, it takes like an hour to get to 100.
Is it something about the sampling method or what?
Thank you in advance

It has to do with how adaptation works. See the manual.

Related

Efficient way to elevate signal over time if it drops significantly over time

I am wondering what would be the most efficient way to correct signal if it drops out significantly in some period of time. Like in the figure green signal dropped out between around 16:26 and 19:16 and I would like to elevate it to the same level like before 16:26 and after 19:16 using statistics.
Please find here the figure
Thanks in advance!
Try the Bayesian blocks method, here is the paper: J. D. Scargle, J. P. Norris, B. Jackson, J. Chiang, (2012) arXiv:1207.5578.
It is rather long paper, but you can skip to the place where they included a MatLab implementation.
What it does is that it splits your time series into time blocks in which the values fluctuate around some mean, that mean being different in each block.
Once you have the blocks, the ones which are low can be scaled up to remove the drops.
NOTE: there is a parameter ncp_prior , by varying it you can change the sensitivity of the method, so that it doesn't get fooled by the fluctuations, but reproduces the drop.

How to fine tune input parameters for ALWRS Factorizer in Apache Mahout?

So I have been using Apache Mahout for building a recommendation system. I am interested in using the SVD matrix factorization method.
I would like to know how I can fine tune the input paramter for :
ALSWRFactorizer(dataModel, no_of_hidden_features, lambda, iterations)
I have tried varying the values of lamda from 0.05 - 0.065 and my recommendation scores increased and then decreased. I thus selected 0.05945 as the value where the scores had reached the peak.
Is this the only approach I can use to estimate no_of_iterations and hidden_features. (values are rising and then decreasing, I expect no-of_features to be between 20-30).
Moreover is this the right approach even?
EDIT: Well I ran a couple more tests, and I seem to have zeroed in on using 20 hidden features, lambda = 0.0595, 20 iterations.
However I'd appreciate any answers explaining how I can do it in a better way.
So I came across this paper:
Application of Dimensionality Reduction in Recommender System
In section 4.3 they have essentially followed the same steps that I have done. After spending a day or two going through google results, iterative experimentation seems to be the only answer to fine tune these paramters.
Not sure what you mean that your "scores" increased then decreased. If you are describing running a precision type cross-validation test after applying each parameter iteration then you did the right thing. The values you came up with are very close to the "rules of thumb" for ALS-WR.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

When to use geometric vs arithmetic mean?

So I guess this isn't technically a code question, but it's something that I'm sure will come up for other folks as well as myself while writing code, so hopefully it's still a good one to post on SO.
The Google has directed me to plenty of nice lengthy explanations of when to use one or the other as regards financial numbers, and things like that.
But my particular context doesn't fit in, and I'm wondering if anyone here has some insight. I need to take a whole bunch of individual users' votes on how "good" a particular item is. I.e., some number of users each give a particular item a score between 0 and 10, and I want to report on what the 'typical' score is. What would be the intuitive reasons to report the geometric and/or arithmetic mean as the typical response?
Or, for that matter, would I be better off reporting the median instead?
I imagine there's some psychology involved in what the "best" method might be...
Anyway, there you have it.
Thanks!
Generally speaking, the arithmetic mean will suffice. It is much less computationally intensive than the geometric mean (which involves taking an n-th root).
As for the psychology involved, the geometric mean is never greater than the arithmetic mean, so arithmetic is the best choice if you'd prefer higher scores in general.
The median is most useful when the data set is relatively small and the chance of a massive outlier relatively high. Depending on how much precision these votes can take, the median can sometimes end up being a bit arbitrary.
If you really really want the most accurate answer possible, you could go for calculating the arithmetic-geomtric mean. However, this involved calculating both arithmetic and geometric means repeatedly, so it is very computationally intensive in comparison.
you want the arithmetic mean. since you aren't measuring the average change in average or something.
Arithmetic mean is correct.
Your scale is artificial:
It is bounded, from 0 and 10
8.5 is intuitively between 8 and 9
But for other scales, you would need to consider the correct mean to use.
Some other examples
In counting money, it has been argued that wealth has logarithmic utility. So the median between Bill Gates' wealth and a bum in the inner city would be a moderately successful business person. (Arithmetic average would hive you Larry Page.)
In measuring sound level, decibels already normalizes the effect. So you can take arithmetic average of decibels.
But if you are measuring volume in watts, then use quadratic means (RMS).
The answer depends on the context and your purpose. Percent changes were mentioned as a good time to use geometric mean. I use geometric mean when calculating antennas and frequencies since the percentage change is more important than the average or middle of the frequency range or average size of the antenna is concerned. If you have wildly varying numbers, especially if most are similar but one or two are "flyers" (far from the range of the others) the geometric mean will "smooth" the results (not let the different ones exert a change in the results more than they should). This method is used to calculate bullet group sizes (the "flyer" was probably human error, not the equipment, so the average is ""unfair" in that case). Another variation similar to geometric mean is the root mean square method. First you take the square root of the numbers, take THAT mean, and then square your answer (this provides even more smoothing). This is often used in electrical calculations and most electical meters are calculated in "RMS" (root mean square), not average readings. Hope this helps a little. Here is a web site that explains it pretty well. standardwisdom.com

Issues in Convergence of Sequential minimal optimization for SVM

I have been working on Support Vector Machine for about 2 months now. I have coded SVM myself and for the optimization problem of SVM, I have used Sequential Minimal Optimization(SMO) by Dr. John Platt.
Right now I am in the phase where I am going to grid search to find optimal C value for my dataset. ( Please find details of my project application and dataset details here SVM Classification - minimum number of input sets for each class)
I have successfully checked my custom implemented SVM`s accuracy for C values ranging from 2^0 to 2^6. But now I am having some issues regarding the convergence of the SMO for C> 128.
Like I have tried to find the alpha values for C=128 and it is taking long time before it actually converges and successfully gives alpha values.
Time taken for the SMO to converge is about 5 hours for C=100. This huge I think ( because SMO is supposed to be fast. ) though I`m getting good accuracy?
I am screwed right not because I can not test the accuracy for higher values of C.
I am actually displaying number of alphas changed in every pass of SMO and getting 10, 13, 8... alphas changing continuously. The KKT conditions assures convergence so what is so weird happening here?
Please note that my implementation is working fine for C<=100 with good accuracy though the execution time is long.
Please give me inputs on this issue.
Thank You and Cheers.
For most SVM implementations, training time can increase dramatically with larger values of C. To get a sense of how training time in a reasonably good implementation of SMO scales with C, take a look at the log-scale line for libSVM in the graph below.
SVM training time vs. C - From Sentelle et al.'s A Fast Revised Simplex Method for SVM Training.
alt text http://dmcer.net/StackOverflowImages/svm_scaling.png
You probably have two easy ways and one not so easy way to make things faster.
Let's start with the easy stuff. First, you could try loosening your convergence criteria. A strict criteria like epsilon = 0.001 will take much longer to train, while typically resulting in a model that is no better than a looser criteria like epsilon = 0.01. Second, you should try to profile your code to see if there are any obvious bottlenecks.
The not so easy fix, would be to switch to a different optimization algorithm (e.g., SVM-RSQP from Sentelle et al.'s paper above). But, if you have a working implementation of SMO, you should probably only really do that as a last resort.
If you want complete convergence, especially if C is large, it takes a very long time.You can consider defining a large stop criterion, and give the maximum number of iterations, the default in Libsvm is 1000000, if the number of iterations is more, the time will multiply, but the loss is not worth the cost, but the result may not fully meet the KKT condition, some support vectors are in the band, non-support vectors are out of the band, but the error is small and acceptable.In my opinion, it is recommended to use other quadratic programming algorithms instead of SMO algorithm if the accuracy is higher

Resources