I have a LeafSystem (controller) with two output ports, each of which depend on the solution to the same MathematicalProgram. My initial idea was to solve the program and store the solution as a discrete state which the output port callbacks can access and copy appropriately.
My interpretation of the documentation (https://drake.mit.edu/doxygen_cxx/group__discrete__systems.html) and what I see when implementing this, however, is that the output callbacks use the discrete state before the PerStepDiscreteUpdateEvent.
Now for my questions -
Is this behavior that I've described above consistent with how the Simulator handles update events or am I missing something there?
Is there a way to update the discrete state before the output calculation and have the updated state be used in the output?
Is there a different design that would be more appropriate here?
The simple solution to your problem is cache entry.
Declare a cache entry that does your mathematical program work and updates the associated cache entry (it stores the results). When each output port is evaluated, they both "Eval" the cache entry and draw whatever data they need from the stored result. Then, no matter which port is evaluated first, the second one will always benefit from the pre-computation.
You can look at the cache entry notes for more detail.
Related
I'm trying to log data points for the same metric across multiple runs (wandb.init is called repeatedly in between each data point) and I'm unsure how to avoid the behavior seen in the attached screenshot...
Instead of getting a line chart with multiple points, I'm getting a single data point with associated statistics. In the attached e.g., the 1st data point was generated at step 1,470 and the 2nd at step 2,940...rather than seeing two points, I'm instead getting a single point that's the average and appears at step 2,205.
My hunch is that using the resume run feature may address my problem, but even testing out this hunch is proving to be cumbersome given the constraints of the system I'm working with...
Before I invest more time in my hypothesized solution, could someone confirm that the behavior I'm seeing is, indeed, the result of logging data to the same metric across separate runs without using the resume feature?
If this is the case, can you confirm or deny my conception of how to use resume?
Initial run:
run = wandb.init()
wandb_id = run.id
cache wandb_id for successive runs
Successive run:
retrieve wandb_id from cache
wandb.init(id=wandb_id, resume="must")
Is it also acceptable / preferable to replace 1. and 2. of the initial run with:
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id)
It looks like you’re grouping runs so that could be why it’s appearing as averaging across step - this might not be the case but it’s worth trying. Turn off grouping by clicking the button in the centre above your runs table on the left - it’s highlighted in purple in the image below.
Both of the ways you’re suggesting resuming runs seem fine.
My hunch is that using the resume run feature may address my problem,
Indeed, providing a cached id in combination with resume="must" fixed the issue.
Corresponding snippet:
import wandb
# wandb run associated with evaluation after first N epochs of training.
wandb_id = wandb.util.generate_id()
wandb.init(id=wandb_id, project="alrichards", name="test-run-3/job-1", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 20}, step=1)
wandb.finish()
# wandb run associated with evaluation after second N epochs of training.
wandb.init(id=wandb_id, resume="must", project="alrichards", name="test-run-3/job-2", group="test-run-3")
wandb.log({"mean_evaluate_loss_epoch": 10}, step=5)
wandb.finish()
I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.
I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/
I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);
I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.