Python DEAP - Getting the Pareto Front for every generation - deap

I am using DEAP to run a multi objective optimization using eaSimple. The code returns the ParetoFront() after the last generation.
Is there any way to get a set of ParetoFront() for each generation? I would like to see the evolution of the fronts with every generation.

Simply run one generation at a time. Each time, run the algorithm on the population that was output by the previous run.
Something like
ngen = 50
pop = toolbox.population(n=100)
for i in range(ngen):
pop, logbook = algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.2, ngen=1)
You just need to add whatever you are doing with the Pareto front to the above code.

Related

How to avoid data averaging when logging to metric across multiple runs?

I'm trying to log data points for the same metric across multiple runs (wandb.init is called repeatedly in between each data point) and I'm unsure how to avoid the behavior seen in the attached screenshot...
Instead of getting a line chart with multiple points, I'm getting a single data point with associated statistics. In the attached e.g., the 1st data point was generated at step 1,470 and the 2nd at step 2,940...rather than seeing two points, I'm instead getting a single point that's the average and appears at step 2,205.
My hunch is that using the resume run feature may address my problem, but even testing out this hunch is proving to be cumbersome given the constraints of the system I'm working with...
Before I invest more time in my hypothesized solution, could someone confirm that the behavior I'm seeing is, indeed, the result of logging data to the same metric across separate runs without using the resume feature?
If this is the case, can you confirm or deny my conception of how to use resume?
Initial run:
run = wandb.init()
wandb_id = run.id
cache wandb_id for successive runs
Successive run:
retrieve wandb_id from cache
wandb.init(id=wandb_id, resume="must")
Is it also acceptable / preferable to replace 1. and 2. of the initial run with:
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id)
It looks like you’re grouping runs so that could be why it’s appearing as averaging across step - this might not be the case but it’s worth trying. Turn off grouping by clicking the button in the centre above your runs table on the left - it’s highlighted in purple in the image below.
Both of the ways you’re suggesting resuming runs seem fine.
My hunch is that using the resume run feature may address my problem,
Indeed, providing a cached id in combination with resume="must" fixed the issue.
Corresponding snippet:
import wandb
# wandb run associated with evaluation after first N epochs of training.
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id, project="alrichards", name="test-run-3/job-1", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 20}, step=1)
wandb.finish()
# wandb run associated with evaluation after second N epochs of training.
wandb.init(id=wandb_id, resume="must", project="alrichards", name="test-run-3/job-2", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 10}, step=5)
wandb.finish()

Dask - Understanding diagnostics - memory:list

I am working on some fairly complex application that is making use of Dask framework, trying to increase the performance. To that end I am looking at the diagnostics dashboard. I have two use-cases. On first I have a 1GB parquet file split in 50 parts, and on second use case I have the first part of the above file, split over 5 parts, which is what used for the following charts:
The red node is called "memory:list" and I do not understand what it is.
When running the bigger input this seems to block the whole operation.
Finally this is what I see when I go inside those nodes:
I am not sure where I should start looking to understand what is generating this memory:list node, especially given how there is no stack button inside the task as it often happens. Any suggestions ?
Red nodes are in memory. So this computation has occurred, and the result is sitting in memory on some machine.
It looks like the type of the piece of data is a Python list object. Also, the name of the task is list-159..., so probably this is the result of calling the list Python function.

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

XCode 6 Playground Measuring Code Performance

Is there any quick way of evaluating the performance / runtime of a certain code part written in the new XCode 6 playground?
I want to start learning Swift by comparing different coding styles for certain solutions and their impact on the code performance.
We strongly discourage using playgrounds to measure performance, at least using time as your measure of performance. By far the majority of the time taken during a playground is the logging of results to display in the sidebar; the actual time your code takes doesn't contribute as much. So the runtime of your code in a playground will mostly depend on how many lines of code are run / results are logged.
If you want to do performance measurements, check out the XCTest framework. You can create a test bundle for your swift code.
One thing you can measure in a playground is the number of times your lines of code are run. So if, for example, you're trying to measure the algorithmic complexity of some code, you could do that based on how many times it needs to run lines of code to e.g. complete a sort, or whatever it is you're trying to do. Lines of code that are run more than once displays the number of times they are run in the results sidebar.
I built this little tool that allows you to have performance testing in your Playground.
I'll continue to update and enhance it, but for now, it'll give you the basic ability to measure how long a function takes to run.
https://github.com/sebastienpeek/swift-performance
I have found one (maybe not so elegant) solution:
var start = TickCount()
var implicitInteger = 0
for (var i = 1; i < 500; i++) {
implicitInteger += i;
}
var end = TickCount()
var dur = end - start
The variable 'dur' gives you the ticks your code needed to execute.

Tesseract Appears to be learning characters as you perform more OCRs, how do I save the learning data between uses?

I have a particular set of 10 images to perform OCRs. They are all digits; somewhat short, about 20 digits in each image. There is one particular image, if I run it first, it will have some mismatches; however, if I run other tests first, then come back to that one, all characters match.
I am inclined to conclude that Tesseract is learning the characters as more OCR operations are performed, which makes me very happy. Now the question is, if it's possible, for me to save the learning data, so Tesseract would know to pick it up the next time I use it?
You can set classify_save_adapted_templates to 1 in your Tesseract config file to save the adapted templates and set classify_use_pre_adapted_templates to 1 to load the templates next time you run Tesseract
The code that specifies the behavior of these options is here:
http://code.google.com/p/tesseract-ocr/source/browse/trunk/classify/classify.cpp?r=570

Resources