Chaining another transform after DataStoreIO.Write - google-cloud-dataflow

I am creating a Google dataflow pipeline, using Apache Beam Java SDK. I have a few transforms there, and I finally create a collection of Entities ( PCollection< Entity > ) . I need to write this into the Google DataStore and then, perform another transform AFTER all entities have been written. (such as broadcasting the IDs of the saved objects through a PubSub Message to multiple subscribers).
Now, the way to store a PCollection is by:
entities.DatastoreIO.v1().write().withProjectId("abc")
This returns a PDone object, and I am not sure how I can chain another transform to occur after this Write() has completed. Since DatastoreIO.write() call does not return a PCollection, I am not able to further the pipeline. I have 2 questions :
How can I get the Ids of the objects written to datastore?
How can I attach another transform that will act after all entities are saved?

We don't have a good way to do either of these things (returning IDs of written Datastore entities, or waiting until entities have been written), though this is far from the first similar request (people have asked for this for BigQuery, for example) and we're thinking about it.
Right now your only option is to wait until the entire pipeline finishes, e.g. via pipeline.run().waitUntilFinish(), and then doing what you wanted in your main program (e.g. you can run another pipeline).

Related

Saving data accross API's in Gramex

I am calling the same database query in multiple form handlers, I want to access the data once for processing and store it to use them across multiple form handlers.
Formhandler, caches the data after your first query so essentially you are not querying to the DB if your query remains same.
And if you are firing the same query through multiple formhandlers you could essentially write a transform function which can do all the different processing after fetching the data (Formhandler will take care of caching and you will not query from different patterns).
/dataapi?mode=getsalesdata&otherparams=.......
/dataapi?mode=getavgsales&otherparams=........
You could also use query function in formhandler to control the dynamic behaviour of your query.
Provide some more details around the use-case to have a tailored response.

Is Reactor Context used only for statically initialised data?

Consider following 4 lines of code:
Mono<Void> result = personRepository.findByNameStartingWith("Alice")
.map(...)
.flatMap(...)
.subscriberContext()
Fictional Use Case which I hope you will immediately map to your real task requirement:
How does one adds "Alice" to the context, so that after .map() where "Alice" is no longer Person.class but a Cyborg.class (assuming irreversible transformation), in .flatMap() I can access original "Alice" Person.class. We want to compare the strength of "Alice" person versus "Alice" cyborg inside .flatMap() and then send them both to the moon on a ship to build a colony.
I've read about 3 times:
https://projectreactor.io/docs/core/release/reference/#context
I've read dozen articles on subscriberContext
I've looked at colleague code who uses subscriberContext but only for Tracing Context and MDM which are statically initialised outside of pipelines at the top of the code.
So the conclusion I am coming to is that something else was named as "context" , what majority can't use for the overwhelming use case above.
Do I have to stick to tuples and wrappers? Or I am totally dummy and there is a way. I need this context to work in entirely opposite direction :-), unless "this" context is not the context I need.
I will await for Reactor developers attention (or later than that go to GitHub to raise an issue with the conceptual naming error, if I am correct) but in the meantime. I believed that Reactor Context could solve this:
What is the efficient/proper way to flow multiple objects in reactor
But what it actually reminds is some kind of mega closure over reactive pipeline propagating down->up and accepting values from outside in an imperative way, which IMO is a very narrow and limited use case to be called a "context", which will confuse more people to come.
Context and subscribeContext in posts you refer to are indeed one and the same...
The goal of the Context is more along the lines of attaching some information to a given subscription.
This works because upon subscription, a chain of Subscriber is constructed to "materialize" the processing, and by nature each given operator (or step) as a reference to its downstream in order to be able to push data to it.
As a result, it can also query it for its view of what the current subscription Context is, hence the down-to-up approach.

Get all subnode keys and values from zookeeper

I am attempting to implement zookeeper as a shared state engine for an application I am creating in erlang. The structure for the state would be like the following:
/appRoot
/parent1:{json}
/child1:{json}
/child2:{json}
/parent2:{json}
/child1:{json}
/child2:{json}
I would like to be able to have a single method that returned all parent nodes when provided /appRoot along with it's data. Say in a list of tuples [{parent1,{json}}, {parent2:{json}}]. Or, if provided /appRoot/parent1 a list of it's subnodes with the data. So far, all I see is a way to get the keys (getChildren) then recursively retrieve the data with the key. It seems like I should be able to just make one call to do this.
I am currently using the ezk erlang client library. If anybody knows a better solution, that would be appreciated as well.
TIA
AFAIK there is no way to atomically list node children along with its data in zookeeper wire protocol. Seems, there is only way to do this is following:
List all node children and monitor children changes.
For each child node read it data and monitor data changes.
Update corresponding values on each children change happened after you started traversal.
This can be easily done over ezk, but I've not provided any example code because it heavily depends on guarantees of atomicity that your app logic demanded.

Core Data multiple context uniqueness

I am using the core data stack shown in the image below. I want to design a structure where objects can be created in both worker contexts.
What I am observing in the setup is if both contexts try to create the same object (for a unique key) at around the same time, db ends up in creating two rows for the table. Is there a way to solve this? Thanks in advance for your response.
The only way you can ensure uniqueness would be to have a coordinating object that all contexts turn to to verify their operation (a "uniqueness enforcer" if you will).
The general algorithm is described HERE, however you fall under the "multi-threaded/context" category and this will complicate things.
In a multi-threaded environment, your enforcer would have to perform a save to the store (using its own managed object context) before returning results to the calling object.
The general flow would be (no cache version):
A context request object for keys from the enforcer
The enforcer issue the request "under lock" (either locking an actual lock or using a serial dispatch queue)
the enforcer query the store for existing objects
create objects for missing keys and save them
you might want to mark the objects as stubs, as the caller might not eventually save and it will give you a flag to ignore them in your fetch requests in your views
build the results array with the objects he created
the results might be NSManagedObjectIDs or imported objects in the caller context otherwise you risk cross context access of managed objects

Breeze projection query from already-loaded entity

If I use breeze to load a partial entity:
var query = EntityQuery.from('material')
.select('Id, MaterialName, MaterialType, MaterialSubType')
.orderBy(orderBy.material);
return manager.executeQuery(query)
.then(querySucceeded)
.fail(queryFailed);
function querySucceeded(data) {
var list = partialMapper.mapDtosToEntities(
manager, data.results, entityNames.material, 'id');
if (materialsObservable) {
materialsObservable(list);
}
log('Retrieved Materials from remote data source',
data, true);
}
...and I also want to have another slightly different partial query from the same entity (maybe a few other fields for example) then I'm assuming that I need to do another separate query as those fields weren't retrieved in the first query?
OK, so what if I want to use the same fields retrieved in the first query (Id, Materialname, MaterialType, MaterialSubType) but I want to call those fields different names in the second query (Materialname becomes just "name", MaterialType becomes "masterType" and so on) then is it possible to clone the partial entity I already have in memory (assuming it is in memory?) and rename the fields or do I still need to do a completely separate query?
I think I would "union" the two cases into one projection if I could afford to do so. That would simplify things dramatically. But it's really important to understand the following point:
You do not need to turn query projection results into entities!
Backgound: the CCJS example
You probably learned about the projection-into-entities technique from the CCJS example in John Papa's superb PluralSight course "Single Page Apps JumpStart". CCJS uses this technique for a very specific reason: to simplify list update without making a trip to the server.
Consider the CCJS "Sessions List" which is populated by a projection query. John didn't have to turn the query results into entities. He could have bound directly to the projected results. Remember that Knockout happily binds to raw data values. The user never edits the sessions on that list directly. If displayed session values can't change, turning them into observable properties is a waste of CPU.
When you tap on a Session, you go to a Session view/edit screen with access to almost every property of the complete session entity. CCJS needs the full entity there so it looks for the full (not partial) session in cache and, if not found, loads the entity from the server. Even to this point there is no particular value in having previously converted the original projection results into (partial) session entities.
Now edit the Session - change the title - and save it. Return to the "Sessions List"
Question
How do you make sure that the updated title appears in the Sessions List?
If we bound the Sessions List HTML to the projection data objects, those objects are not entities. They're just objects. The entity you edited in the session view is not an object in the collection displayed in the Sessions List. Yes, there is a corresponding object in the list - one that has the same session id. But it is not the same object.
Choices
#1: Refresh the list from the server by reissuing the projection query. Bind directly to the projection data. Note that the data consist of raw JavaScript objects, not entities; they are not in the Breeze cache.
#2: Publish an event after saving the real session entity; the subscribing "Sessions List" ViewModel hears the event, extracts the changes, and updates its copy of the session in the list.
#3: Use the projection-into-entity technique so that you can use a session entity everywhere.
Pros and Cons
#1 is easy to implement. But it requires a server trip every time you enter the Sessions List view.
One of the CCJS design goals was that, once loaded, it should be able to operate entirely offline with zero access to the server. It should work crisply when connectivity is intermittent and poor.
CCJS is your always-ready guide to the conference. It tells you instantly what sessions are available, when and where so you can find the session you want, as you're walking the halls, and get there. If you've been to a tech conference or hotel you know that the wifi is generally awful and the app is almost useless if it only works when it has direct access to the server.
#1 is not well suited to the intended operating environment for CCJS.
The CCJS Jumpstart is part way down that "server independent" path; you'll see something much closer to a full offline implementation soon.
You'll also lose the ability to navigate to related entities. The Sessions List displays each session's track, timeslot and room. That's repetitive information found in the "lookup" reference entities. You'll either have to expand the projection to include this information in a "flattened" view of the session (fatter payload) or get clever on the client-side and patch in the track, timeslot and room data by hand (complexity).
#2 helps with offline/intermittent connectivity scenarios. Of course you'll have to set up the messaging system, establish a protocol about saved entities and teach the Sessions List to find and update the affected session projection object. That's not super difficult - the Breeze EntityManager publishes an event that may be sufficient - but it would take even more code.
#3 is good for "server independence", has a small projection payload, is super-easy, and is a cool demonstration of breeze. You have to manage the isPartial flag so you always know whether the session in cache is complete. That's not hard.
It could get more complicated if you needed multiple flavors of "partial entity" ... which seems to be where you are going. That was not an issue in CCJS.
John chose #3 for CCJS because it fit the application objectives.
That doesn't make it the right choice for every application. It may not be the right choice for you.
For example, if you always have a fast, low latency connection, then #1 may be your best choice. I really don't know.
I like the cast-to-entity approach myself because it is easy and works so well most of the time. I do think carefully about that choice before I make it.
Summary
You do not have to turn projection query results into entities
You can bind to projected data directly, without Knockout observable properties, if they are read-only
Make sure you have a good reason to convert projected data into (partial) entities.
CCJS has a good reason to convert projected query data into entities. Do you?

Resources