Dynamic query usage whole streaming with google dataflow? - google-cloud-dataflow

I have a dataflow pipeline that is set up to receive information(JSON) and transform it into a DTO and then insert it into my database. This works great for insert, but where I am running into issues is with handling delete records. With the information I am receiving there is a deleted tag in the JSON to specify when that record is actually being deleted. After some research/experimenting, I am at a loss as whether or not this is possible.
My question: Is there a way to dynamically choose(or change) what sql statement the pipeline is using, while streaming?

To achieve this with Dataflow you need to think more in terms of water flowing through pipes than in terms of if-then-else coding.
You need to classify your records into INSERTs and DELETEs and route each set to a different sink that will do what you tell them to. You can use tags for that.
In this pipeline design example, instead of startsWithATag and startsWithBTag you can use tags for Insert and Delete.

Related

Is there a way in SumoLogic to store some data and use it in queries?

I have a list of IPs that I want to filter out of many queries that I have in sumo logic. Is there a way to store that list of IPs somewhere so it can be referenced, instead of copy pasting it in every query?
For example, in a perfect world it would be nice to define a list of things like:
things=foo,bar,baz
And then in another query reference it:
where mything IN things
Right now I'm just copying/pasting. I think there may be a way to do this by setting up a custom data source and putting the IPs in there, but that seems like a very round-about way of doing it, and wouldn't help to re-use parts of a query that aren't data (eg re-use statements). Also their template feature is about parameterizing a query, not re-use across many queries.
Yes. There's a notion of Lookup Tables in Sumo Logic. Consult:
https://help.sumologic.com/docs/search/lookup-tables/create-lookup-table/
for details.
It allows to store some values (either manually once, or in a scheduled way as as a result of some query) with | save operator.
And then you can refer to these values using | lookup which is conceptually similar to SQL's JOIN.
Disclaimer: I am currently employed by Sumo Logic.

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

Compare two PCollections for removal

Every day latest data available in the CloudSQL table, so while writing data into another CloudSQL table, I need to compare the existing data and perform the actions like, remove the deleted data and update the existing data and insert new data.
Could you please suggest best way to do this scenario using Dataflow pipeline (preferable Java).
One thing I identified that using upsert function in CloudSQL, we could do the insert/update the records with the help of jdbc.JdbcIO. But I do not know how to identified collection for removal.
You could read the old and new tables and do a Join followed by a DoFn that compares the two and only outputs changed elements, which can then be written wherever you like.

How do I write to BigQuery a schema computed during execution of the same Dataflow pipeline?

My scenario is a variation on the one discussed here:
How do I write to BigQuery using a schema computed during Dataflow execution?
In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.
For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).
Is this possible? If so, what's the best approach?
My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.
If you use BigQuery.Write to create the table then the schema needs to known when the table is created.
Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.
You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.
Alternatively you create a custom sink
to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.

Querying temporal data in Neo4j

There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html

Resources