How to stream output from a continous view in pipelinedb? - stream

I have setup pipelinedb and it works great! I would like to know if its possible to stream data out of a continuous view after the value in the view has been updated? That is, have some external process act on changes to a view.
I wish to stream metrics generated from the views into a dashboard, and I do not want to use polling the db to achieve this.

As of 0.9.5, continuous triggers have been removed in favour of using output streams and continuous transforms. (First suggested by DidacticTactic). The output of a continuous view is essentially a stream, which means you can create continuous views or transforms based on it.
Simple Example:
First create a stream and continuous view.
CREATE STREAM s (
x int
);
CREATE CONTINUOUS VIEW hourly_cv AS
SELECT
hour(arrival_timestamp) AS ts,
SUM(x) AS sum
FROM s GROUP BY ts;
Every continuous view now has a output stream. You can create a transform based on the output of the view using output_of. In the transform you have access to the tuples old and new which represent the old values and new values respectively. (0.9.7 has a third called delta) So you can create a transform that uses the output of 'hourly_cv' like so:
CREATE CONTINUOUS TRANSFORM hourly_ct AS
SELECT
(new).sum
FROM output_of('hourly_cv')
THEN EXECUTE PROCEDURE update();
In this example I'm calling update which we still need to define. It needs to be a function that returns a trigger.
CREATE OR REPLACE FUNCTION update()
RETURNS trigger AS
$$
BEGIN
// Do anything you want here.
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
I found the 0.9.5 release notes blog post helpful to understand output streams and why continuous triggers are no more.

Check out the sections in our technical docs on output streams and continuous transforms for help on how to do this, and feel free to ping us in our Gitter channel if you need help beyond what you find in the docs.

I feel like a bit of an idiot trying to figure out what the answer of this could be like using the tools Didactic provided. Maybe I am blind but I have still not found a way. I found the 9.3 version of the DB which included continuous triggers but this has since been removed and I don't wish to switch to an older version of the DB.
This is a bit sad but I suppose it was moved out of the open source version of the project to accommodate the Real Time analytics dashboard project that the same company provides.
Either way. I solved this issue by using a stored procedure. It's probably slightly inefficient compared to what a built-in function would provide but I am hitting the DB a few thousand time a minute and my VM CPU and RAM just yawn at me.
CREATE OR REPLACE FUNCTION all_insert(text,text)
RETURNS void AS
$BODY$
DECLARE
result text;
BEGIN
INSERT INTO all_in (streamid, generalinput) values($1, $2);
SELECT array_to_json(array_agg(json_build_object('streamId', streamid, 'total', count)))::text into result from totals;
PERFORM pg_notify('totals', result);
END
$BODY$
LANGUAGE plpgsql;
So my insert and notify are done by querying this single stored procedure. Then my application simply has to listen for PSQL db notify events and handle them appropriately. In the example above, the application would receive a JSON object with the particular stream id and the total associated with it.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

How do I get the TimeFrame for an open order in MT mq4?

I'm scanning through the order list using the standard OrderSelect() function. Since there is a great function to get the current _Symbol for an order, I expected to find the equivalent for finding the timeframe (_Period). However, there is no such function.
Here's my code snippet.
...
for (int i=orderCount()-1; i>=0; i--) {
if (OrderSelect(i, SELECT_BY_POS, MODE_TRADES)) {
if (OrderMagicNumber()==magic && OrderSymbol()==_Symbol ) j++;
// Get the timeframe here
}
}
...
Q: How can I get the open order's timeframe given it's ticket number?
In other words, how can I roll my own OrderPeriod() or something like it?
There is no such function. Two approaches might be helpful here.
First and most reasonable is to have a unique magic number for each timeframe. This usually helps to avoid some unexpected behavior and errors. You can update the input magic number so that the timeframe is automatically added to it, if your input magic is 123 and timeframe is M5, the new magic number will be 1235 or something similar, and you will use this new magic when sending orders and checking whether a particular order is from your timeframe. Or both input magic and timeframe-dependent, if you need that.
Second approach is to create a comment for each order, and that comment should include data of the timeframe, e.g. "myRobot_5", and you parse the OrderComment() in order to get timeframe value. I doubt it makes sense as you'll have to do useless parsing of string many times per tick. Another problem here is that the comment can be usually changed by the broker, e.g. if stop loss or take profit is executed (and you need to analyze history), and if an order was partially closed.
One more way is to have instances of some structure of a class inherited from CObject and have CArrayObj or array of such instances. You will be able to add as much data as needed into such structures, and even change the timeframe when needed (e.g., you opened a deal at M5, you trail it at M5, it performs fine so you close part and virtually change the timeframe of such deale to M15 and trail it at M15 chart). That is probably the most convenient for complex systems, even though it requires to do some coding (do not forget to write down the list of existing deals into a file or deserialize somehow in OnDeinit() and then serialize back in OnInit() functions).

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

Apache Beam - Delta between windows

Apologies, in trying to be concise and clear my previous description of my question turned into a special case of the general case I'm trying to solve.
New Description
I'm trying to Compare the last emitted value of an Aggregation Function (Let's say Sum()) with a each element that I aggregate over in the current window.
Worth noting, that the ideal (I think) solution would include
The T2(from t-1) element used at time = t is the one that was created during the previous window.
I've been playing with several ideas/experiments but I'm struggling to find a way to accomplish this in a way is elegant and "empathetic" to Beam's compute model (which I'm still trying to fully Grock after many an article/blog/doc and book :)
Side inputs seem unwieldy because It looks like I have to shift the emitted 5M#T-1 Aggregation's timestamp into the 5M#T's window in order to align it with the current 5M window
In attempting this with side inputs (as I understand them), I ended up with some nasty code that was quite "circularly referential", but not in an elegant recursive way :)
Any help in the right direction would be much appreciated.
Edit:
Modified diagram and improved description to more clearly show:
the intent of using emitted T2(from t-1) to calculate T2 at t
that the desired T2(from t-1) used to calculate T2 is the one with the correct key
Instead of modifying the timestamp of records that are materialized so that they appear in the current window, you should supply a window mapping fn which just maps the current window on to the past one.
You'll want to create a custom WindowFn which implements the window mapping behavior that you want paying special attention to overriding the getDefaultWindowMappingFn function.
Your pipeline would be like:
PCollection<T> mySource = /* data */
PCollectionView<SumT> view = mySource
.apply(Window.into(myCustomWindowFnWithNewWindowMappingFn))
.apply(Combine.globally(myCombiner).asSingletonView());
mySource.apply(ParDo.of(/* DoFn that consumes side input */).withSideInputs(view));
Pay special attention to the default value the combiner will produce since this will be the default value when the view has had no data emitted to it.
Also, the easiest way to write your own custom window function is to copy an existing one.

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Resources