Select best training results from all history using ray - machine-learning

I'm trying to find optimal hyperparams with ray:
tuner = tune.Tuner(
run_config=RunConfig(stop={"score": 290},
Sometimes my model overfits and it's results become worse over time, i.e., I get something like 100, 200, 220, 140, 90, 80. Ray shows me the current "best result" in its status, but it selects the best value only from the last iterations (i.e., the best value for the mentioned result is 80).
I'm sure that results with higher values are better, so it would be nice to select best results based on the whole history, not on the last value.
Is there a way to get force it to use the whole train history for selecting the best result? Or should I interrupt training manually when I see that model is no longer improving? Or it's already saving all the results and all I need is to filter them after it finishes?
I've seen this Checkpoint best model for a trial in ray tune and have added CheckpointConfig to my code, but it seems like it doesn't help: I still see the last result

The entire history of reported metrics is being saved within each Result, and you can access it via result.metrics_dataframe. See this section of the user guide for an example of what you can access within each result.
It's possible to filter out the entire history to only show the the maximum accuracy (or another metric you define) using the ResultGrid output of The ResultGrid.get_dataframe(filter_metric=<your-metric>, filter_mode=<min/max>) API will return a DataFrame that filters out the history of reported results of each trial. See the bottom of this section for an example of doing this.


How to avoid data averaging when logging to metric across multiple runs?

I'm trying to log data points for the same metric across multiple runs (wandb.init is called repeatedly in between each data point) and I'm unsure how to avoid the behavior seen in the attached screenshot...
Instead of getting a line chart with multiple points, I'm getting a single data point with associated statistics. In the attached e.g., the 1st data point was generated at step 1,470 and the 2nd at step 2,940...rather than seeing two points, I'm instead getting a single point that's the average and appears at step 2,205.
My hunch is that using the resume run feature may address my problem, but even testing out this hunch is proving to be cumbersome given the constraints of the system I'm working with...
Before I invest more time in my hypothesized solution, could someone confirm that the behavior I'm seeing is, indeed, the result of logging data to the same metric across separate runs without using the resume feature?
If this is the case, can you confirm or deny my conception of how to use resume?
Initial run:
run = wandb.init()
wandb_id =
cache wandb_id for successive runs
Successive run:
retrieve wandb_id from cache
wandb.init(id=wandb_id, resume="must")
Is it also acceptable / preferable to replace 1. and 2. of the initial run with:
wandb_id = wandb.util.generate_id()
It looks like you’re grouping runs so that could be why it’s appearing as averaging across step - this might not be the case but it’s worth trying. Turn off grouping by clicking the button in the centre above your runs table on the left - it’s highlighted in purple in the image below.
Both of the ways you’re suggesting resuming runs seem fine.
My hunch is that using the resume run feature may address my problem,
Indeed, providing a cached id in combination with resume="must" fixed the issue.
Corresponding snippet:
import wandb
# wandb run associated with evaluation after first N epochs of training.
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id, project="alrichards", name="test-run-3/job-1", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 20}, step=1)
# wandb run associated with evaluation after second N epochs of training.
wandb.init(id=wandb_id, resume="must", project="alrichards", name="test-run-3/job-2", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 10}, step=5)

Build vocab in doc2vec

I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?
lst_doc = doc.translate(str.maketrans('', '', string.punctuation))
target_data = word_tokenize(lst_doc)
train_data = list(read_data())
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
train_vocab = model.build_vocab(train_data)
{train = model.train(train_data, total_examples=model.corpus_count,
epochs=model.epochs) }
A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.
So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.
If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.
You can also examine the state of the model and its various internal properties after the code has run.
For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).
Other comments:
It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.
Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.
only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

How does Machine Learning algorithm retain learning from previous execution?

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.
Author is using following function for dividing Tran and Test split,
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
>>> len(test_set)
Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.
Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?
The author is also providing a few solutions,
One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).
Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.
The author is giving an example,
You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Sachin Rastogi: I am not able to understand this solution. Could you please help?
For me, these are the answers:
The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.
This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.
As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.
Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).
The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.
I would also suggest to see here for an explanation from the author:

Why does ALS.trainImplicit give better predictions for explicit ratings?

Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing.
I am using PredictionIO 0.9.6 and the Recommendation template for collaborative filtering. The ratings in my data set are numbers between 1 and 10. When I first trained a model with defaults from the template (using ALS.train), the predictions were horrible, at least subjectively. Scores ranged up to 60.0 or so but the recommendations seemed totally random.
Somebody suggested that ALS.trainImplicit did a better job, so I changed src/main/scala/ALSAlgorithm.scala accordingly:
val m = ALS.trainImplicit( // instead of ALS.train
ratings = mllibRatings,
rank = ap.rank,
iterations = ap.numIterations,
lambda = ap.lambda,
blocks = -1,
alpha = 1.0, // also added this line
seed = seed)
Scores are much lower now (below 1.0) but the recommendations are in line with the personal ratings. Much better, but also confusing. PredictionIO defines the difference between explicit and implicit this way:
explicit preference (also referred as "explicit feedback"), such as
"rating" given to item by users. implicit preference (also referred
as "implicit feedback"), such as "view" and "buy" history.
By default, the recommendation template uses ALS.train() which expects explicit rating values which the user has rated the item.
Is the documentation wrong? I still think that explicit feedback fits my use case. Maybe I need to adapt the template with ALS.train in order to get useful recommendations? Or did I just misunderstand something?
A lot of it depends on how you gathered the data. Often ratings that seem explicit can actually be implicit. For instance, assume you give the option of allowing users to rate items that they have purchased / used before. This means that the very fact that they have spent their time evaluating that particular item means that the item is of a high quality. As such, items of poor quality are not rated at all because people do not even bother to use them. As such, even though the dataset is intended to be explicit, you may get better results because if you consider the results to be implicit. Again, this varies significantly based on how the data is obtained.
The explict data (like ratings) normally comes with bias - people go and rate a product because they like it! Think about your experience shopping and then rating on :-)
On the contrary, implict info often can truly reflect user's favor on a product, like viewing duration, comment length, etc. Even a like/dislike is better that rating because it provides a very simple 'bad' option without bothering a user to think "if I should give a 3, 3.5, or 4?".
