Dask - Possible to assign dask_key_name to dask dataframe tasks? - dask

In the course of debugging issues, I've found it hard to decipher exactly which tasks are causing problems. I've used the 'dask_key_name' kwarg successfully in delayed tasks to assign a human-readable name to the key for those delayed tasks (based on the documentation here: https://docs.dask.org/en/latest/delayed-api.html). I've tried to do the following in the hopes that it would do the same for the read_parquet tasks, but it appears it still uses a hashed value to create the key (e.g., ('read-parquet-ed9e6c4c474e851e176e7eafb8753490', 5)).
item = 'custom_string'
self.all_pfs_dict['read'][item] = dd.read_parquet(item_to_read, index=False, gather_statistics=False, dask_key_name=item + '-read')
Am I doing something wrong or is there an alternative way to name dask dataframe tasks?

There is no way to rename dataframe tasks like this today.

I have previously had a similar question, but it does not seem to support such a thing except from_pandas() method.
from_pandas() has the name parameter to set name, but others like read_parquet() don't.
So if you want to do it, you need to change the Dask code linked above.

Related

Same name among FreeRTOS tasks

I'm a newbie of FreeRTOS. But I don't think it's well documented. As in xTaskCreate() :
pcName A descriptive name for the task. This is mainly used to facilitate debugging, but can also be used to obtain a task handle.
The maximum length of a task’s name is set using the configMAX_TASK_NAME_LEN parameter in FreeRTOSConfig.h.
Is a task must be associated with a name?
What happened if pcName is NULL?
What happened if I created multiple tasks with the same name?
Should I maintain the mapping between task's handle and name? Or FreeRTOS maintains this relation?
In a word, it's not clear about the relation of task's handle and it's name in official document.
No it does not need a name
If pcName is NULL it simply does not have a name. Nothing happens.
You will get several tasks with the same name so you cannot identify them by name. The behaviour of xTaskGetHandle is then undefined (as per the documentation).
FreeRTOS handles this.
The documentation also states this function takes a long time to complete and should be used sparingly. Personally I don't think there is any reason to use this at all. Just use task names for debug purposes only (useful in a debugger or when using Percipio Trace)

NiftyNet 'evaluation' action output is incorrect

I'm trying to use the new 'evaluation' action after inference to generate some metrics for my output. However, the .csv files just show scores of '0' for average_distance and '1' for Jaccard and Dice for each of my data volumes. I can't seem to find any documentation for the evaluation action, so I'm not sure what I'm doing wrong. Also, the --dataset_to_infer=Validation option doesn't seem to work, both inference and evaluation are being applied to all data rather than just the validation set.
Thanks!
For the evaluation issue, we're working on the documentation. The dataset_to_infer option is only tested for the applications in NiftyNet/niftynet/application; applications from the model zoo are not upgraded to support it yet (please file an issue with more details https://github.com/NifTK/NiftyNet/issues if you believe it's a bug).
For the time being, pointing directly to the inference result in the config.ini worked for me.
e.g.
[inferred]
csv_file = model_dir/save_seg_dir/inferred.csv
I believe this file is not found currently and then evaluation defaults to comparing labels to labels. See the issue on GitHub.

Cache XMLProvider generated model(s)

Using XMLProvider from the FSharp.Data package like:
type internal MyProvider = XmlProvider<Sample = "C:\test.xml">
The test.xml file contains a total of 151,838 lines which makes up 15 types.
Working in the same project as the type declaration MyProvider is a pain, as it seems the XmlProvider is triggered everytime I hit CTRL+SPACE (Edit.CompleteWord) - and therefore regenerates all the models, which can take up to 10sec.
Is there any known work around, or setting to cache the generated models from XmlProvider?
I'm afraid F# Data does not currently have any caching mechanism for the inferred schema. It sounds like something that should not be too hard to add - if anyone is interested in contributing, please open an issue on GitHub to start the discussion!
My recommendation for the time being would be to try to simplify the sample XML, so that it is shorter and contains just a few representative records of all the different kinds.

How do I create a single script file for when I do and don't want to collect TensorBoard statistics?

I want to have a single script, that either collects tensorboard data or not, depending on how I run it. I am aware that I can pass flags to tell my script how I want it to be run. I could even hard code it in the script and just manually change the script.
Either solution has a bigger problem. I find myself having to write an if statement everywhere on my script when I want the summary writer operations to be ran or not. For example I find that I would have to do something like:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
and then depending on the value of tb_sys_arg run the summaries or not, as in:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
else:
train_writer = tf.train.SummaryWriter(tensorboard_data_dump_train, sess.graph)
this seems really silly to me. I'd rather not have to do that. Is this the right way to do this? I just don't want to collect statistics each time I run my main script but I also don't want to have two separate scripts either.
As an anecdotical story, few months ago I started using TensorBoard and it seems I have been running my main file as follow:
python main.py —logdir=/tmp/mdl_logs
so that it collects tensorboard data. But realized that I don't think I need that last flag to collect tensorboard data. Its been so long that now I forget if I actually need that. I've been reading the documentation and tutorials but it seems I don't need that last flag (its only needed to run the web app as in tensorboard --logdir=path/to/log-directory, right?) Have I been doing this wrong all this time?
You can launch Supervisor without "summary" service, so it won't run the summary nodes, see "Launching fewer services" section of the Supervisor docs -- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard6/tf.train.Supervisor.md#launching-fewer-services

Processing multiline events from a text file in Dataflow

I am attempting to build a dataflow pipeline to process a text file which contains events that span multiple lines. The dataflow SDK TextIO class assumes each line is a new event.
My plan is to create a new TextReader and register it with the DataPipelineRunner. This new reader will know how to aggregate the multiple lines into a single line.
I am pretty sure that this approach will work but I am wondering if this is the right way to do it or if there is a simpler solution?
The text I am trying to parse is:
==============> len:45 pktype:4 mtype:2
SYMBOL: USOCSTIA151632.00
OPEN_INT: 212
PR_OPEN_INTEREST: 212
TIME_STAMP: 04/10/2015 06:30:17:420 val:1428661817
The result should be the last 4 lines concatenated together and the first line dropped.
Best regards,
Peter
Note that TextReader is an internal implementation detail class, so subclassing it would be highly discouraged and challenging to do properly.
The recommended way to define a new file-based format like yours is to subclass FileBasedSource using the user-defined source API.
In your case, I would recommend to base your class on the LineIO example from documentation, and wrap the LineReader defined there into your own class which would use LineReader as a helper for reading individual lines, but:
In startReading() it would skip until the line starting with "====>"
In readNextRecord() it would read lines until the next "====>" and bundle them into a single record.
Please make sure to carefully read the documentation to FileBasedSource and FileBasedReader: the parallelization mechanism relies on the consistency properties described there, which your format has to satisfy, for ensuring that records are not duplicated or omitted on the boundaries between adjacent processing shards. XmlSource tests are a good example of how to unit-test these properties.
Please tell us how it goes and report back with any problems or questions - we are very interested in feedback on this API.

Resources