Dask dashboard profile tab usage (a.k.a. Flame graph) - dask

How do you interpret this piece of tetris to study the usage of your application?

The display is a flame graph corresponding to statistical profiling of the workers. The whole CPU effort is the x-axis of the set of colourful blocks, and as you travel upwards, you are moving deeper in the call stack, e.g.,
a block, Function A, that takes up 50% of the horizontal space used up 50% of the CPU time across all the Dask threads in the cluster
let's say that above that block, two blocks each take up 20% of the whole, and the rest of the lower block is not covered: the time spent calling function A consisted of time to call these two lower-level functions, plus a bit of internal time within Function A.
You can get information about the function corresponding to each block by mousing over.
Note that some function call stacks can be very deep, e.g., pandas processing.
You can also select which submitted function you are looking at the profile for (top), or select from the overall CPU utilisation timeline (bottom).

Related

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Does autodiff in tensorflow work if I have a for loop involved in constructing my graph?

I have a situation where I have a batch of images and in each image I have to perform some operation over a tiny patch in that image. Now the problem is the patch size is variable in each image in the batch. So this implies that I cannot vectorize it. I could vectorize by considering the entire range of pixels in an image but my patch size per image is really a small fraction and I don't want to waste my memory here by performing the operation and storing the results for all the pixels in each image.
So in short, I need to use a loop. Now I see that tensorflow has just a while loop defined and no for loops. So my question is , if I use a plain python style for loop for performing operations over my tensor , will the autodiff fail to calculate gradients in my graph?
Tensorflow does not know (thus does not care) how the graph has been constructed, you can even write each node by hand as long as you use proper functions to do so. So in particular for loop has nothing to do with TF. TF while loop on the other hand gives you ability to express dynamic computations inside the graph, so if you want to process data in a sequence and only need a current one in the memory - only while loop can achieve that. If you create a huge graph by hand (through the loop) it will be always executed, and everything stored in memory. As long as this fits on your machine you should be fine. Another thing is the dynamic length, if sometimes you need to run a loop 10 times, and sometimes 1000, you have to use tf.while_loop, you cannot do this with for loop (unless you create separate graphs for each possible length).

Wrapper for slow indicator for backtesting

If a technical indicator works very slow, and I wish to include it in an EA ( using iCustom() ), is there a some "wrapper" that could cache the indicator results to a file based on the particular indicator inputs?
This way I could get a better speed next time when I backtest it using the same set of parameters, since the "wrapper" could read the result from file rather than recalculate the result from the indicator.
I heard that some developers did that for their needs in order to speed up backtesting, but as far as i know, there's no publicly available solution.
If I had to solve this problem, I would create a class with two fields (datetime and indicator value, or N buffers of the indicator), and a collection class similar to CArrayObj.mqh but with an option to apply binary search, or to start looking for element from a specific index, not from the very beginning of the array.
Recent MT4 Builds added VERY restrictive conditions for Indicators
In early years of MT4, this was not so cruel as it is these days.
FACT#1: fileIO is 10.000x ~ 100.000x slower than memIO:
This means, there is no benefit from "pre-caching" values to disk.
FACT#2: Processing Performance has HARD CEILING:
All, yes ALL, Custom Indicators, that are being used in MetaTrader4 Terminal ( be it directly in GUI, or indirectly, via Template(s) or called via iCustom() calls & in Strategy Tester via .tpl + iCustom() ) ALL THESE SHARE A SINGLE THREAD ...
FACT#3: Strategy Tester has the most demanding needs for speed:
Thus - eliminate all, indeed ALL, non-core indicators from tester.tpl template and save it as "blank", to avoid any part of such non-core processing.
Next, re-design the Custom Indicator, where possible, so as to avoid any CPU-ops & MEM-allocation(s), that are not necessary.
I remember a Custom Indicatore designs with indeed deep-convolutions, which could have been re-engineered so as to keep just a triangular sparse-matrix with necessary updates, that has increased the speed of Indicator processing more than 10.000x, so code-revision is the way.
So, rather run a separate MetaTrader4 Terminal, just for BackTesting, than having to wait for many hours just due to un-compressible nature of numerical processing under a traffic-jam congestion in the shared use of the CustomIndicator-solo-Thread that none scheduling could improve.
FACT#4: O/S can increase a process priority:
Having got to the Devil's zone, it is a common practice to spin-up the PRIO for the StrategyTester MT4, up to the "RealTime PRIO" in the O/S tools.
One may even additionally "lock" this MT4-process onto a certain CPU-core(s) and setup all other processes with adjacent CPU-core-AFFINITY, so that these two distinct groups of processes do not jump one to the other group's CPU-core(s). Hard, but if squeezing the performance to the bleeding edge, this is a must.

How does OpenTSDB downsample data

I have a 2 part question regarding downsampling on OpenTSDB.
The first is I was wondering if anyone knows whether OpenTSDB takes the last end point inclusive or exclusive when it calculates downsampling, or does it count the end data point twice?
For example, if my time interval is 12:30pm-1:30pm and I get DPs every 5 min starting at 12:29:44pm and my downsample interval is summing every 10 minute block, does the system take the DPs from 12:30-12:39 and summing them, 12:40-12:49 and sum them, etc or does it take the DPs from 12:30-12:40, then from 12:40-12:50, etc. Yes, I know my data is off by 15 sec but I don't control that.
I've tried to calculate it by hand but the data I have isn't helping me. The numbers I'm calculating aren't adding up to the above, nor is it matching what the graph is showing. I don't have access to the system that's pushing numbers into OpenTSDB so I can't setup dummy data to check.
The second question is how does downsampling plot its points on the graph from my time range and downsample interval? I set downsample to sum 10 min blocks. I set my range to be 12:30pm-1:30pm. The graph shows the first point of the downsampled graph to start at 12:35pm. That makes logical sense.I change the range to be 12:24pm-1:29pm and expected the first point to start at 12:30 but the first point shown is 12:25pm.
Hopefully someone can answer these questions for me. In the meantime, I'll continue trying to find some data in my system that helps show/prove how downsampling should work.
Thanks in advance for your help.
Downsampling isn't currently working the way you expect, although since this is a reasonable and commonly made expectations, we are thinking of changing this in a later release of OpenTSDB.
You're assuming that if you ask for a "10 min sum", the data points will be summed up within each "round" (or "aligned") 10 minute block (e.g. 12:30-12:39 then 12:40-12:49 in your example), but that's not what happens. What happens is that the code will start a 10-minute block from whichever data point is the first one it finds. So if the first one is at time 12:29:44, then the code will sum all subsequent data points until 600 seconds later, meaning until 12:39:44.
Within each 600 second block, there may be a varying number of data points. Some blocks may have more data points than others. Some blocks may have unevenly spaced data points, e.g. maybe all the data points are within one second of each other at the beginning of the 600s block. So in order to decide what timestamp will result from the downsampling operation, the code uses the average timestamp of all the data points of the block.
So if all your data points are evenly spaced throughout your 600s block, the average timestamp will fall somewhere in the middle of the block. But if you have, say, all the data points are within one second of each other at the beginning of the 600s block, then the timestamp returned will reflect that by virtue of being an average. Just to be clear, the code takes an average of the timestamps regardless of what downsampling function you picked (sum, min, max, average, etc.).
If you want to experiment quickly with OpenTSDB without writing to your production system, consider setting up a single-node OpenTSDB instance. It's very easy to do as is shown in the getting started guide.

How does retiming work in systolic arrays?

How does retiming work in systolic arrays (used in signal processors)? I read that there is some notion of negative delay which is used, but how can a delay be negative and if that is just an abstraction then how does it help?
The basic model of retiming is that you have wavefronts of registers interconnected by a bunch of combinational logic, and you are improving the timing or area of the resulting circuit by repositioning the registers at different points in the circuit such that every path through the logic still goes through the same number of registers. For a simple example, lets say that you have an AND gate feeding a register, the longest path to the input of the register is 12ns, the longest path from the output of the register is 6ns, the delay of the AND gate is 3ns, and you need to get the clock cycle time down to 10ns. You could achieve this by deleting the register and replacing it with two registers, one at each input of the AND gate, clocked by the same clock as the original register. Now you have reduced the longest input path to 9ns, expanded the output path to 9ns, and met your clock cycle objective. In effect, you have added -3ns to the effective arrival time at the register (and added +3 ns to the effective output time).
A modified version of Leiserson and Saxe's original paper on retiming is available here. Wikipedia has a decent, though short, article on the subject with a few links. If you have access to IEEE Xplore or the ACM Digital Library, a search through the proceedings of the Design Automation Conference or the International Conference on Computer-Aided Design looking for retiming should yield lots of articles - this has been an active research area for years.

Resources