Consume only data i need in Task - Snowflake - stored-procedures

I have 3 streams each for 3 different tables that are loaded batch once per day. Perhaps any day one of the tables doesn't load any data but the others yes.
I have a task that is triggered when all 3 streams have data and with a cursor and a for loops over each process date (column), and triggers a stored procedure ONLY if in the 3 streams there is data on that same process date.
All that is working fine, but the problem is in the case when for one date there is data for each stream but for another date theres data for only two of the streams. The thing is that when de SP is called, the data for the process date that is in the 3 streams is loaded in the final table (with a merge into) and the data for the other process date that is only present in two of the streams isn't, BUT the data of the 3 streams is deleted!
How can i resolve this? Copying the info of the streams in a persistent table and working with this ones? I was hopping that streams has a feature that i wasn't aware of.
Detailed above in the post

Related

Why does my infulxdb insert a new row of data only every 10 seconds?

I wrote a timed task in c# to insert one row of data per second,
but I found that only one row of data is inserted every 10 seconds.
I also noticed that new insert requests within 10 seconds will only update the same row of data and not insert a new one.
What is the setting that causes this and how do I change it?
The version of influxdb is 2.2, I downloaded it from the website and started it directly without changing any configuration.
You are probably using query creator which aggregates data (prepares query with aggregation). Example setting in InfluxDB v2 web GUI:
Setting period to 1s or writing your own query without any aggregation should solve your problem.
What is more: writing data with the same tag keys to InfluxDB, with the same timestamp and the same value field name will overwrite existing value in InfluxDB. So described behaviour is normal.

Acumulate delayed data and trigger them after 5 minutes

I have a Google Dataflow job that reads data from PubSub, aggregates de data and in the end, sends the data to an InflluxDB. What I want to achieve is to aggregate the data in windows of 1 minute but to have only an entry in the DB for each minute. The problem is that I want to allow lateness data so I need to accumulate the data during a period of 5 minutes and then to send to the DB a unique entry.
Is it possible? I tried to do that with the below code, but I don't get what I want:
input.apply(Window
.<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.withAllowedLateness(Duration.standardMinutes(5))
.discardingFiredPanes()
I already collaborated on a similar question. You can use .triggering(Never.ever()) to omit sending the ON TIME panes. Then, as you are already doing, set the allowed lateness to 5 minutes for late records.
It's also important to change the Window.ClosingBehavior to FIRE_ALWAYS. This way we account for the case where there is no late data but we haven't emitted the on-time records. Once the window is closed it will always emit a final pane with PaneInfo.isLast set to true.
So, for your case, the code would be something like:
input.apply(Window
.<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Never.ever())
.withAllowedLateness(Duration.standardMinutes(5), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()

How to get the daily data of "Microsoft.VSTS.Scheduling.CompletedWork"?

We need to get the daily data from the "Microsoft.VSTS.Scheduling.CompletedWork"field (which is detailed in Workload, scheduling and time tracking field references). However I get data from the Analysis database and found that it only records one last new data,and can't get the historical data.
For example the task of ID 3356, who's "CompletedWork" is 3 hours in 2016/8/4, and I get the exact 3 hours-data from the Analysis database in the second day, 2016/8/5, as the pictures in this post show.
Then on the 2016/8/5, I update the "CompletedWork" from 3 hours to 4 hours and I get the exact 4 hours-data from the Analysis database in the second day, 2016/8/6. However the 3 hours-data of 2016/8/4 is lost. Well, How can I get the historical data of "Microsoft.VSTS.Scheduling.CompletedWork"?
First of all, it's important to understand that the CompletedWork is a cumulatieve data field. So when one user enters 3 and another enters 4, the total number of hours worked on the field is 4 not 7.
The warehouse has a granularity of a day and keeps that data int he cube, though the relational warehouse tables will store all the changes to the reportable fields on a per-revision bases. You can't easily query this data using the qube or Excel Power Pivot and they're lost in the Dim* and fact* tables, but you can write a SQL query against tfs_warehouse and iterate through the tables containing the workitem data (tbl_workitems[are|were|latest]). This is much slower and much harder to build unfortunately.
Your other alternative is to use the TFS Client Object Model and query the WorkItemStore object directly. You'll be able to query all work items of interest and iterate through them and their revisions. The API for workitems is relatively easy to use and is well documented.
If you're on TFS 2015 you can also use the new REST api to query workitem data and revisions.

Is there a way to tell Google Cloud Dataflow that the data coming in is already ordered?

We have an input data source that is approximately 90 GB (it can be either a CSV or XML, it doesn't matter) that contains an already ordered list of data. For simplicity, you can think of it as having two columns: time column, and a string column. The hundreds of millions of rows in this file are already ordered by the time column in ascending order.
In our Google cloud DataFlow, we have modeled each row as an element in our Pcollection, and we apply DoFn transformations to the string field (e.g. count the number of characters that are uppercase in the string etc.). This works fine.
However, we then need to apply functions that are supposed to be calculated for a block of time (e.g. five minutes) with a one minute overlap. So, we are thinking about using a sliding windowing function (even though the data is bounded).
However, the calculations logic that needs to be applied over these five-minute windows assumes that the data is ordered logically ( i.e. ascending) by the time field. My understanding is that even when using these windowing functions, one cannot assume that within each window the P collection objects are ordered in any way – so one would need to manually iterate through every P collection and reorder them, right? However, this seems like a huge waste of computational power, since the incoming data already contains ordered data. So is there a way to teach/inform Google cloud data flow that the input data is ordered and so to maintain that order even within the windows?
On a minor note, I had another question: my understanding is that if the data source is unbounded, there is never a "overall aggregation" function that would ever execute, as it never really make sense (since there is no end to the incoming data); however, if one uses a windowing function for bounded data, there is a true end state which corresponds to when all the data has been read from the CSV file. Therefore, is there a way to tell Google cloud data flow to do a final calculation once all the data has been read in, even though we are using a windowing function to divide the data up?
SlidingWindows sounds like the right solution for your problem. The ordering of the incoming data is not preserved across a GroupByKey, so informing Dataflow of that would not be useful currently. However, the batch Dataflow runner does already sort by timestamp in order to implement windowing efficiently, so for simple windowing like SlidingWindows, your code will see the data in order.
If you want to do a final calculation after doing some windowed calculations on a bounded data set, you can re-window your data into the global window again, and do your final aggregation after that:
p.apply(Window.into(new GlobalWindows()));

iOS and Mysql Events

I'm working on an app that connects to a mysql backend. It's a little simliar to snapchat in that once the current user gets the pics from the users they follow and see them they can never again see these pics. However, I can't just delete the pics from the database, the user who uploaded the pic still needs to see them. So I've come up with an interesting design and I want to know if its good or not.
When uploading the pic I would also create a mysql event that would run the same time exactly one day after the pic was uploaded deleting itself. If I have people uploading pics all the time events would be created all the time. How does this effect the mysql database. Is this even scalable?
No, not scalable: Deleting of single records is quick, however if your volume increases, you run into trouble. You do however have a classic case for using partitioning:
Create table your_images (insert_date DATE,some_image BLOB, some_owner INT)
ENGINE=InnoDB /* row_format=compressed key_block_size=4 */
PARTITION BY RANGE COLUMNS (insert_date)
PARTITION p01 VALUES LESS THAN ('2015-07-12'),
PARTITION p02 VALUES LESS THAN ('2015-07-03'),
PARTITION p0x VALUES LESS THAN (ETC),
PARTITION p0n VALUES LESS THAN (MAXVALUE));
You can then insert just as you are used to, drop the partitions once per day (using 1 event for all your data), and create new partitions also once per day (using the same event which is dropping your old partitions).
To make certain a photo lives for 24 hours (minimum), the partition cleanup has to occur with a 1 day delay (So cleanup the day before yesterday, not yessterday itself).
A date filter in your query getting the image from the database is still needed to prevent the images from older then a day being displayed.

Resources