Compare two PCollections for removal - google-cloud-dataflow

Every day latest data available in the CloudSQL table, so while writing data into another CloudSQL table, I need to compare the existing data and perform the actions like, remove the deleted data and update the existing data and insert new data.
Could you please suggest best way to do this scenario using Dataflow pipeline (preferable Java).
One thing I identified that using upsert function in CloudSQL, we could do the insert/update the records with the help of jdbc.JdbcIO. But I do not know how to identified collection for removal.

You could read the old and new tables and do a Join followed by a DoFn that compares the two and only outputs changed elements, which can then be written wherever you like.

Related

Is there any standard stored procedure to capture table refresh details in snowflake

I am trying to log table refresh details in snowflake DWH
Details include below
Batch Date, Source Table Name, Target Table Name, rows loaded, timestamp, status, err.Message.
Is there any standard SQL\Snowflake stored procedure which can be useful as common one for entire DWH to trace\audit table refresh details and log them into single table.
I have the variables which captures Batchdate, target table name, source table name, etc...
If I get standard stored procedure which can log start of the activity and end of the activity, that really helpful.
Regards,
Srinivas
If you are looking for some ideas moving forward, here are a couple of things that can help you out:
Query History is useful, but hard to filter. If you use a query_tag in your batch processes, then you can reference query_history for information.
In addition, if you want to capture information as its running, you could use Streams and Tasks on your tables to capture counts of updates/inserts/deletes, etc. for each batch in the background.
There is no standard stored procedure that you can leverage within Snowflake to query this information, but there is a lot of data available in the snowflake.account_usage share.
Not sure what exactly you're trying to achieve here, but
you can use last_altered on a table to see when the data was last modified
you can filter the query_history view to see what queries modified the table:
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
You can take advantage of Snowflake Streams https://docs.snowflake.com/en/sql-reference/sql/create-stream.html
When you create a stream, you point it to a target table. So, your stream, records changes produces on the target table (INSERTS, UPDATES and DELETES) between two points in time.
You can use your stream as any table to select over it, to look for changes.
What's great about streams is that after a succesfully DML operation is done by using data from any stream, the stream is purged, so when you query against it, it'll be empty.
Use them free of guilty, since streams donĀ“t duplicate your data, they just storage the offset and the CDC, so data remains on your table.
Some useful guides: it generates something related you need
- Part 1: https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-1/
- Part 2: https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-2/

"Transactional safety" in influxDB

We have a scenario where we want to frequently change the tag of a (single) measurement value.
Our goal is to create a database which is storing prognosis values. But it should never loose data and track changes to already written data, like changes or overwriting.
Our current plan is to have an additional field "write_ts", which indicates at which point in time the measurement value was inserted or changed, and a tag "version" which is updated with each change.
Furthermore the version '0' should always contain the latest value.
name: temperature
-----------------
time write_ts (val) current_mA (val) version (tag) machine (tag)
2015-10-21T19:28:08Z 1445506564 25 0 injection_molding_1
So let's assume I have an updated prognosis value for this example value.
So, I do:
SELECT curr_measurement
INSERT curr_measurement with new tag (version = 1)
DROP curr_mesurement
//then
INSERT new_measurement with version = 0
Now my question:
If I loose the connection in between for whatever reason in between the SELECT, INSERT, DROP:
I would get double records.
(Or if I do SELECT, DROP, INSERT: I loose data)
Is there any method to prevent that?
Transactions don't exist in InfluxDB
InfluxDB is a time-series database, not a relational database. Its main use case is not one where users are editing old data.
In a relational database that supports transactions, you are protecting yourself against UPDATE and similar operations. Data comes in, existing data gets changed, you need to reliably read these updates.
The main use case in time-series databases is a lot of raw data coming in, followed by some filtering or transforming to other measurements or databases. Picture a one-way data stream. In this scenario, there isn't much need for transactions, because old data isn't getting updated much.
How you can use InfluxDB
In cases like yours, where there is additional data being calculated based on live data, it's common to place this new data in its own measurement rather than as a new field in a "live data" measurement.
As for version tracking and reliably getting updates:
1) Does the version number tell you anything the write_ts number doesn't? Consider not using it, if it's simply a proxy for write_ts. If version only ever increases, it might be duplicating the info given by write_ts, minus the usefulness of knowing when the change was made. If version is expected to decrease from time to time, then it makes sense to keep it.
2) Similarly, if you're keeping old records: does write_ts tell you anything that the time value doesn't?
3) Logging. Do you need to over-write (update) values? Or can you get what you need by adding new lines, increasing write_ts or version as appropriate. The latter is a more "InfluxDB-ish" approach.
4) Reading values. You can read all values as they change with updates. If a client app only needs to know the latest value of something that's being updated (and the time it was updated), querying becomes something like:
SELECT LAST(write_ts), current_mA, machine FROM temperature
You could also try grouping the machine values together:
SELECT LAST(*) FROM temperature GROUP BY machine
So what happens instead of transactions?
In InfluxDB, inserting a point with the same tag keys and timestamp over-writes any existing data with the same field keys, and adds new field keys. So when duplicate entries are written, the last write "wins".
So instead of the traditional SELECT, UPDATE approach, it's more like SELECT A, then calculate on A, and put the results in B, possibly with a new timestamp INSERT B.
Personally, I've found InfluxDB excellent for its ability to accept streams of data from all directions, and its simple protocol and schema-free storage means that new data sources are almost trivial to add. But if my use case has old data being regularly updated, I use a relational database.
Hope that clear up the differences.

Is there any way to update multiple child values at once?

I need to set all the child distances to 0 (check photo of Firebase db below) in 1 setting. Is there any way I can do this? The usual update function for Firebase generally works for only one userID.
To write a value, the client must specify the complete path to that value in the database. Firebase does not support the equivalent of SQL's update queries.
So you will need to first load the data, and then update each child. You can perform those updates in a big batch if you want, using multi-location updates. For more on those, see the blog post introducing them and the answer here: Firebase - atomic write of multiple values to multiple locations

Avoiding round-trips when importing data from Excel

I'm using EF 4.1 (Code First). I need to add/update products in a database based on data from an Excel file. Discussing here, one way to achieve this is to use dbContext.Products.ToList() to force loading all products from the database then use db.Products.Local.FirstOrDefault(...) to check if product from Excel exists in database and proceed accordingly with an insert or add. This is only one round-trip.
Now, my problem is there are two many products in the database so it's not possible to load all products in memory. What's the way to achieve this without multiplying round-trips to the database. My understanding is that if I just do a search with db.Products.FirstOrDefault(...) for each excel product to process, this will perform a round-trip each time even if I issue the statement for the exact same product several times ! What's the purpose of the EF caching objects and returning the cached value if it goes to the database anyway !
There is actually no way to make this better. EF is not a good solution for this kind of tasks. You must know if product already exists in database to use correct operation so you always need to do additional query - you can group multiple products to single query using .Contains (like SQL IN) but that will solve only check problem. The worse problem is that each INSERT or UPDATE is executed in separate roundtrip as well and there is no way to solve this because EF doesn't support command batching.
Create stored procedure and pass information about product to that stored procedure. The stored procedure will perform insert or update based on the existence of the record in the database.
You can even use some more advanced features like table valued parameters to pass multiple records from excel into procedure with single call or import Excel to temporary table (for example with SSIS) and process them all directly on SQL server. As last you can use bulk insert to get all records to special import table and again process them with single stored procedures call.

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources