Term_evals when finding hyper parameters for XGBoost with #mlr3 - mlr3

I'm new to gradient boosting (XGBoost). I've read the manual for mlr3, and if I understand it right, I want to first tune my hyper parameters. I'm unsure for this how to set term_evals? In the tutorial, it uses a very small number and says for real applications a higher number is needed. how do I know whether the number I've picked is high enough?
Many thanks
I ran instance=tune(...) with different numbers of term_evals. Of course, it takes longer with more. I inspected instance$result (instance=tune(...)), but from this I have no idea how to know whether term_evals was large enough or not.

Choosing the right budget for tuning is a difficult practical challenge. You can find out more in section "6.5 When to Terminate HPO" in our hyperparameter optimization paper. You could use a time limit instead of the number of evaluation. You set a time limit that is feasible for you e.g. one day if you run it on your local machine. Usually it is better to evaluate more configurations.

Related

How determinate number of rounds in TFF context

In TFF, It is necessary to determinate number of rounds. So, to obtain optimal performance of our model, How we can know the optimal number of rounds?
TFF does not necessarily need you to specify the number of rounds for federated training beforehand. TFF is more about specifying the federated aspect of your computation (which you can essentially think of as specifying the communication), and considers actually "running" the rounds to be at the system level.
When you write TFF, generally you are writing at three levels (explanation of this statement here); the question you are asking (and every concern TFF considers a "system concern") is at the Python level. Since Python controls the actual invocation of your computation written in TFF, you can stop training with any criterion expressible in Python. E.g. if you want to monitor performance on a validation set and use that as a stopping criteria, this is entirely doable. If you have a tff.utils.IterativeProcess ip, and evaluation function eval_fn (see here for an example), this could be implemented as something like:
while True:
data = sample_client_data()
state, metrics = ip.next(state, data)
eval_metrics = eval_fn(state)
if condition(eval_metrics):
break
Abstractly: since the Python drives the experiment process, you can stop whenever you want to, based on any observable characteristic of the training procedure. Therefore you do not in fact need to know how many rounds you will be running beforehand.
A more direct answer to the original question is, I think at this point in the history of FL, not quite achievable for the general case; nobody (as far as I am aware) knows of reliable system-level settings for FL at this point. This is not surprising; it is somewhat akin to knowing beforehand how many epochs one should specify in datacenter training, which I think tends to be quite problem-dependent. FL is similar in this regard. Practically speaking, my advice tends to be: monitor performance on a validation set, run for as long as you can, and keep the state of your highest-performing model on the val set around. I think a more general answer than this may be quite difficult.

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

How to differentiate between real improvement and random noise?

I am building an automatic translator in moses. To improve its performance, I use log-linear weight optimisation. This technique has a random component, which can affect slightly the final result (but I do not know exactly how much).
Suppose that the current performance of the model is 25 BLEU.
Suppose now I modify the language model (e.g. change the smoothing), and I get a performance of 26 BLEU.
My question is: how can I know if the improvement is because the modification, or is just noise from the random component?
This is pretty much what statistics is all about. You can basically do one of the two things (from the basic set of solutions, of course there are many more advanced):
try to measure/model/quantify the effect of randomness, if you know what is causing it, you might be able to actually compute how much it can affect your model. If analytical solution is not possible, you can always train 20 models with the same data/settings, gather results and estimate noise distribution. Once you have this you can perform statistical tests to check whether the improvement is statistically significant (for example by ANOVA tests).
simpler approach (but more expensive in terms of data/time) is to simply reduce the variance by averaging. In short - instead of training one model (or evaluating model once) which has this hard to determine noise component - do it many times, 10, 20, and average the results. This way you reduce the variance of the results in your analysis. This can (and should) be combined with the previous option - since now you have 20 results per run, thus you can again use statistical testes to see whether these are significantly different things.

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Interpreting the parameters of the evaluate() function of a item-based recommender in Mahout

I am working with boolean values, trying to evaluate a recommending engine in Mahout. My questions are about the selection of the "correct" parameters of the evaluate function. Apologize in advance for the lengthy post.
IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
DataModelBuilder dataModelBuilder,
DataModel dataModel,
IDRescorer rescorer,
int at,
double relevanceThreshold,
double evaluationPercentage) throws TasteException;
1) Can you think of an example in which the following two parameters must be used:
- DataModelBuilder dataModelBuilder
- IDRescorer rescorer
2) For the double relevanceThreshold variable, I set the value GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, however, I was wondering if a "better" model could be built by setting a different value.
3) In my project, I need to recommend at most 10 items per user. Does this mean that it shouldn't make sense to set a value bigger than 10 for variable int at?
4) Given that I don't bother if I have to wait a lot for building the model, is it a good practice to set variable double evaluationPercentage equal to 1? Can you think of any case where 1 will not give the optimum model?
5) Why precision / recall (note that I am working on boolean data) increases as long as the number of recommendations (i.e. variable int at) increases (I proved that experimentally)?
6) Where does the spiting of both testing and training tests is taking place within mahout, and how could I change that percentage (unless if this is not the case for item-based recommendations)?
Accurate recommendations alone do not guarantee users of recommender systems an effective and satisfying experience, so measurements should be taken only as a reference point. That said, ideally real users would use your system against a baseline you set (like random recommendations) and do A/B test and see which has better performance. But that can be troublesome and not quite practical.
Precision and recall at N recommendations, are not a great metrics for recommenders. You are better off using a metric like AUC (area under the curve)
Have a look a the Mahout in Action book example (link)
Letting Mahout choose a threshold is fine, but it will be more computationally expensive
Yes, if you are making 10 recommendations, evaluating at 10 makes a lot of sense
Depends on the size of your data really. If using 100% (that is 1.0) is fast enough, I would use that. But if you do use something different (less), I would strongly suggest you use RandomUtils.useTestSeed(); when testing so you know the sampling will be done in the same manner every time you evaluate. (don't use it in production though)
Not sure. Depends on how your data looks like. But normally if precision increases, recall decreases and vice versa. See F1 Score (also available from Mahout IRStatistics)
For IRStatistics I'm not entirely sure where it happens (or if it happens at all). Notice it doesn't even take a % for division into training and test. Although there might be a default somewhere. If I were you I would go through the Mahout code and find out.

Resources