Avoid reshuffling when using state or timers? - google-cloud-dataflow

I find myself in situations writing Beam pipelines where I want to use state or timers, where the data may already be sharded a certain way by a previous GroupByKey which I do not want to disturb. But the API says state or timers requires KV inputs to the PTransform. Perhaps stateful/timerful transforms do a GroupByKey internally.
Is there a way to use state/timers without resharding/reshuffling?
Here is a concrete use case: I am running into a performance problem when implementing my own metrics collection instead of using the beam built-in Stackdriver metrics, where the system lag of transforms involving shuffling starts shooting up and in general does not recover after some time.
Here is the relevant code where I funnel into one key the metric values from various places where the data is potentially sharded differently, simply because I needed to use timers.
metricsFromVariousPlaces
.apply(Flatten.pCollections())
.apply(WithKeys.of(null.asInstanceOf[Void]))
.apply("write influx points",
new InfluxSinkTransform(...)
InfluxSinkTransform requires timers in order to flush writes to InfluxDB in a timely fashion.
I understand this causes reshuffling because now the data is all under one shard. I expect this reshuffling to be expensive and hope to avoid it if possible.
I tried preserving the keys from the previous transform, but looks like there is still shuffling:
"stage_id": "S68",
"stage_name": "F503",
"fused_steps": [
...
{
"name": "s29.org.apache.beam.sdk.values.PCollection.<init>:402#b70c45c110743c2b-write-streaming-shuffle430",
"originalTransform": "snapshot/MapElements/Map.out0",
"userName": "snapshot/MapElements/Map.out0/FromValue/WriteStream"
}
]
...
{
"stage_id": "S74",
"stage_name": "F509",
"fused_steps": [
{
"name": "s29.org.apache.beam.sdk.values.PCollection.<init>:402#b70c45c110743c2b-read-streaming-shuffle431",
"originalTransform": "snapshot/MapElements/Map.out0",
"userName": "snapshot/MapElements/Map.out0/FromValue/ReadStream"
},
{
"name": "s30",
"originalTransform": "snapshot/write influx points/ParDo(Do)",
"userName": "snapshot/write influx points/ParDo(Do)"
}
]

There is no way to use state or timers without inducing shuffling, even when the keys stay unchanged.

Related

What should I use as the Key for GroupIntoBatches.withShardedKey

I want to batch the calls to an external service in my streaming dataflow job for unbounded sources. I used windowing + attach a dummy key + GroupByKey as below
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach arbitrary key
.apply("attach-arbitrary-key", ParDo.of(new MySink.AttachDummyKey()))
// 3. group by key
.apply(GroupByKey.create())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
This implementation caused performance issues as I attached a dummy key to all the messages that caused the transform to not execute in parallel at all. After reading this answer, I switched to GroupIntoBatches transform as below.
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach sharding key
.apply("attach-sharding-key", ParDo.of(new MySink.AttachShardingKey()))
// 3. group by key into batches
.apply("group-into-batches",
GroupIntoBatches.<String, MessageWrapper>ofSize(1000)
.withMaxBufferingDuration(Duration.standardSeconds(5);)
.withShardedKey())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
The document states that withShardedKey increases parallelism by spreading one key over multiple threads but the question is what would be a good key when using withShardedKey?
If this truly is runner-determined sharding, would it make sense to use a single dummy key? Or the same problem would occur just like GroupByKey? Currently I do not have a good key to use, I was thinking of creating a hash based on some fields of the message. If I do pick a key that could evenly distribute the traffic, would it still make sense to use withShardedKey? Or it might cause each shard not to include enough data that GroupIntoBatches may not actually be useful?
Usually the key would be a natural key, but since you mentioned that there's no such key, I feel there are a few trade-offs to consider.
You can apply a static key, but the parallelism will just depend on the number of threads (GroupIntoBatches semantic) which is runner specific:
Outputs batched elements associated with sharded input keys. By default, keys are sharded to such that the input elements with the same key are spread to all available threads executing the transform. Runners may override the default sharding to do a better load balancing during the execution time.
If your pipeline can afford more calls (with eventually not full batches, depending on the distribution), applying a random key (using a small range - would have to try an ideal balance) instead of static may provide better guarantees.
I recommend watching this session which provides some relevant information: Beam Summit 2021 - Autoscaling your transforms with auto-sharded GroupIntoBatches

Can Dask computational graphs keep intermediate data so re-compute is not necessary?

I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.

Guarantee Print Order After Parallelism

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?
Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.
Q : Is there a known term for my problem?
Yes, there is. A con·​tra·​dic·​tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

How does a data model affect neo4j write performance with CYPHER?

I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:
I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.
In Solr I have documents that essentially look like this:
{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}
Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.
My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.
Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).
My goal was to load up Neo4J using this model, regardless of how naive it might be.
Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):
{
"query" :
"CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);",
"params": {
"lat": 40.1021614075,
"SST": 6.521100044250488,
"meta": "KCEJ_20140418v20001_1430",
"lon": -70.8780212402,
"time": 1397883480
}
}
Note that I have actually removed quite a few parameters for testing neo4j.
Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:
~20 seconds using REST API
~45 seconds using BOLT
~70 seconds using py2neo
I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!
I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.
Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:
-increasing the open file limit from 1024 to 40000
-using ext4, and tweaking it as documented here
-increasing the page cache size to 16 GB (my system has 32)
So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:
CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)
as well as on my time index like so:
MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;
These simple queries would take literally several minutes just return possibly a few nodes.
In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).
At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.
My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.
Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?
As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?
Any advice, tips, comments are welcome. Thank you in advance.
EDIT:
As suggested in the comments, I have changed the behavior of my loading script.
Previously I was using python in this manner:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
statement = build_statement(tuple)
session.run(statement)
session.close()
With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.
My modified approach:
transaction = ''
for tuple in mydata:
statement = build_statement(tuple)
transaction += ('\n' + statement)
with session.begin_transaction() as tx:
tx.run(transaction)
session.close()
I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).

Resources