Pentaho: block this step until steps finish - memory

I'm trying to use "block this step until steps finish" for a transformation but it seems to not work:
On the way it appears on the picture, it is assumed that "total EPOs. DAT, VSE, ESP" shoudn't be run until "Filtrar GESTIONADO ny" and "Select values Kibana 2" haven't finished, am I right? If not, how can I get such purpose?
Thank you.

ALL steps in a transformation start running at initialization. Then they either start processing their input or wait for rows to come in.
A "Block this step..." step does NOT prevent the next step from running, it only blocks rows going through to that step. This does exactly what you expect for steps that need incoming rows (like a Text File Output or Database Lookup) but doesn't do anything for steps that generate new rows from an input source.
Your next step after the block looks like a Text File or CSV input. That step will just start reading the file right away and generate rows.
With a Text File Input (perfectly usable for most CSV files) you can tell it to accept a filename from an incoming field. That way it will wait until the blocking step allows the single row with the filename to pass.

Related

Beam CoGroupByKey with fixed window and event time based trigger generates random elements

I have a pipeline in Beam that uses CoGroupByKey to combine 2 PCollections, first one reads from a Pub/Sub subscription and the second one uses the same PCollection, but enriches the data by looking up additional information from a table, using JdbcIO.readAll. So there is no way there would be data in the second PCollection without it being there in the first one.
There is a fixed window of 10seconds with an event based trigger like below;
Repeatedly.forever(
AfterWatermark.pastEndOfWindow().withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(40))
).withLateFirings(AfterPane.elementCountAtLeast(1))
);
The issue I am seeing is that when I stop the pipeline using the Drain mode, it seems to be randomly generating elements for the second PCollection when there has not been any messages coming in to the input Pub/Sub topic. This also happens randomly when the pipeline is running as well, but not consistent, but when draining the pipeline I have been able to consistently reproduce this.
Please find the variation in input vs output below;
You are using a non-deterministic triggering, which means the output is sensitive to the exact ordering in which events come in. Another way to look at this is that CoGBK does not wait for both sides to come in; the trigger starts ticking as soon as either side comes in.
For example, lets call your PCollections A and A' respectively, and assume they each have two elements a1, a2, a1', and a2' (of common provenance).
Suppose a1 and a1' come into the CoGBK, 39 seconds passes, and then a2 comes in (on the same key), another 2 seconds pass, then a2' comes in. The CoGBK will output ([a1, a2], [a1']) when the 40-second mark hits, and then when the window closes ([], [a2']) will get emitted. (Even if everything is on the same key, this could happen occasionally if there is more than a 40-second walltime delay going through the longer path, and will almost certainly happen for any late data (each side will fire separately).
Draining makes things worse, e.g. I think all processing time triggers fire immediately.

Netlogo Behaviorspace How to save data not per tick but based on reporter

I have a netlogo model, for which a run takes about 15 minutes, but goes through a lot of ticks. This is because per tick, not much happens. I want to do quite a few runs in an experiment in behaviorspace. The output (only table output) will be all the output and input variables per tick. However, not all this data is relevant: it's only relevant once a day (day is variable, a run lasts 1095 days).
The result is that the model gets so slow running experiments via behaviorspace. Not only would it be nicer to have output data with just 1095 rows, it perhaps also causes the experiment to slow down tremendously.
How to fix this?
It is possible to write your own output file in a BehaviorSpace experiment. Program your code to create and open an output file that contains only the results you want.
The problem is to keep BehaviorSpace from trying to open the same output file from different model runs running on different processors, which causes a runtime error. I have tried two solutions.
Tell BehaviorSpace to only use one processor for the experiment. Then you can use the same output file for all model runs. If you want the output lines to include which model run it's on, use the primitive behaviorspace-run-number.
Have each model run create its own output file with a unique name. Open the file using something like:
file-open (word "Output-for-run-" behaviorspace-run-number ".csv")
so the output files will be named Output-for-run-1.csv etc.
(If you are not familiar with it, the CSV extension is very useful for writing output files. You can put everything you want to output on a big list, and then when the model finishes write the list into a CSV file with:
csv:to-file (word "Output-for-run-" behaviorspace-run-number ".csv") the-big-list
)

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

Kill logstash when finished parsing

I'm only outputting my parsed data into a mongodb from logstash, but it there any way to tell when the logs are finished parsing, so that I can kill logstash? As a lot of logs are being processed, I cannot stdout my data.
Since you are using a file input, there should be a .sincedb file somewhere. That file keeps track of how many lines have already been parsed. As far as I understand it, it is structured this way:
INODE_NUMBER CURRENT_LINE_NUMBER
The inode number identifies a file (so if you are parsing several files or if your file is being rolled over, there will be several lines). The other number is like a bookmark for logstash to remember what it already read (in case you would proceed the same file in several times). So basically, when this number stops moving up, this should mean that logstash is done parsing the file.
Alternatively if you have no multiline filter set up, you could simply compare the number of lines the file has to the number of records in mongodb.
Third possibility, you can setup another output, not necessarily stdout, this could be for example a pipe to a script that will simply drop the data and print a message when it got nothing new after some time, or some other alternative, see the docs.

Parsing file in parallel

I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
So lines starting with an '>' are header lines containing an identifier for the sequence following the identifier.
I suppose you load the entire file to memory but after this i am having trouble finding a way to process these data.
The problem is: Threads can not start at an arbitrary position because they could cut sequences this way.
Does someone has any experience in parsing files in parallel when the lines depend on each other? Any idea is appreciated.
Should be easy enough, since the dependence of lines on each other is very simple in this case: just make the threads start in an arbitrary position and then just skip the lines until they get to one that starts with a '>' (i.e. starts a new sequence).
To make sure no sequence gets processed twice, keep a set of all sequence IDs that have been processed (or you could do it by line number if the sequence IDs aren't unique, but they really should be!).
Do a preprocessing step, walk through the data once, and determine all valid start points. Let's call these tasks. Then you can simply use a worker-crew model, where each worker repeatedly asks for a task (a starting point), and parses it.

Resources