Question about SPSS modeler (There is an obstacle for make the stream run automatically) - spss

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!

In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.

I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

Related

How to efficiently extract all values in a large master table that start with a specified string

I'm designing the UI for a Lua program, one element of which requires the user to either select an existing value from a master table or create a new value in that table.
I would normally use an IUP list with EDITBOX = "YES".
However, the number of items that the user can select may run into many hundreds or possibly thousands, and the performance when populating the list in iup (and also selecting from it) is unacceptably slow. I cannot control the number of items in the table.
My current thinking is to create a list with an editbox, but without any values. As the user types into the editbox (after perhaps 2-3 characters) the list would populate with the subset of table items that start with the characters typed. The user could then select an item from the list or keep typing to narrow the options or create a new item.
For this to work, I need to be able to create a new table with the items from the master table that start with the entered characters.
One option would be to iterate through the master table using the Penlight 'startswith' function to create the new table:
require "pl.init"
local subtable = {} --empty result table
local startstring = "xyz" -- will actually be set by the iup control
for _, v in ipairs (mastertable) do
if stringx.startswith(v, startstring) then
table.insert(subtable,v)
end
end
However, I'm worried about the performance of doing that if the master table is huge. Is there a more efficient way to code this, or a different way I could implement the UI?
There are various approaches you can take to improve the big-O performance of your prefix search, at the cost of increased code complexity; that said, given the size of your dataset (thousands of items) and the intended use (triggered by user interaction, rather than e.g. game logic that needs to run every frame), I think a simple linear search over the options is almost certainly going to be fast enough.
To test this theory, I timed the following code:
local dict = {}
for word in io.lines('/usr/share/dict/words') do
table.insert(dict, word)
end
local matched = {}
local search = "^" .. (...)
for _,word in ipairs(dict) do
if word:match(search) then
table.insert(matched, word)
end
end
print('Found '..#matched..' words.')
I used /usr/bin/time -v and tried it with both lua 5.2 and luaJIT.
Note that this is fairly pessimistic compared to your code:
no attempt made to localize library functions that are repeatedly called, or use # instead of table.insert
timing includes not just the search but also the cost of loading the dictionary into memory in the first place
string.match is almost certainly slower than stringx.startswith
dictionary contains ~100k entries rather than the "hundreds to thousands" you expect in your application
Even with all those caveats, it costs 50-100ms in lua5.2 and 30-50ms in luaJIT, over 50 runs.
If I use os.clock() to time the actual search, it consistently costs about 10ms in lua5.2 and 3-4 in luajit.
Now, this is on a fairly fast laptop (Core i7), but also non-optimized code running on a dataset 10-100x larger than you expect to process; given that, I suspect that the naïve approach of just looping over the entries calling startswith will be plenty fast for your purposes, and result in code that's significantly simpler and easier to debug.

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Apache Beam - Delta between windows

Apologies, in trying to be concise and clear my previous description of my question turned into a special case of the general case I'm trying to solve.
New Description
I'm trying to Compare the last emitted value of an Aggregation Function (Let's say Sum()) with a each element that I aggregate over in the current window.
Worth noting, that the ideal (I think) solution would include
The T2(from t-1) element used at time = t is the one that was created during the previous window.
I've been playing with several ideas/experiments but I'm struggling to find a way to accomplish this in a way is elegant and "empathetic" to Beam's compute model (which I'm still trying to fully Grock after many an article/blog/doc and book :)
Side inputs seem unwieldy because It looks like I have to shift the emitted 5M#T-1 Aggregation's timestamp into the 5M#T's window in order to align it with the current 5M window
In attempting this with side inputs (as I understand them), I ended up with some nasty code that was quite "circularly referential", but not in an elegant recursive way :)
Any help in the right direction would be much appreciated.
Edit:
Modified diagram and improved description to more clearly show:
the intent of using emitted T2(from t-1) to calculate T2 at t
that the desired T2(from t-1) used to calculate T2 is the one with the correct key
Instead of modifying the timestamp of records that are materialized so that they appear in the current window, you should supply a window mapping fn which just maps the current window on to the past one.
You'll want to create a custom WindowFn which implements the window mapping behavior that you want paying special attention to overriding the getDefaultWindowMappingFn function.
Your pipeline would be like:
PCollection<T> mySource = /* data */
PCollectionView<SumT> view = mySource
.apply(Window.into(myCustomWindowFnWithNewWindowMappingFn))
.apply(Combine.globally(myCombiner).asSingletonView());
mySource.apply(ParDo.of(/* DoFn that consumes side input */).withSideInputs(view));
Pay special attention to the default value the combiner will produce since this will be the default value when the view has had no data emitted to it.
Also, the easiest way to write your own custom window function is to copy an existing one.

Delete variables based on the number of observations

I have an SPSS file that contains about 1000 variables and I have to delete the ones having 0 valid values. I can think of a loop with an if statement but I can't find how to write it.
The simplest way would be to use the spssaux2.FindEmptyVars Python function like this:
begin program.
import spssaux2
spssaux2.FindEmptyVars(delete=True)
end program.
If you don't already have the spssaux2 module installed, you would need to get it from the SPSS Community website or the IBM Predictive Analytics site and save it in the python\lib\site-packages directory under your Statistics installation.
Otherwise, the VALIDATEDATA command, if you have it, will identify the variables violating such rules as maximum percentage of missing values, but you would have to turn that output into a DELETE VARIABLES command. You could also look for variables with zero missing values using, say, DESCRIPTIVES and select out the ones with N=0.
If you've never worked with python in SPSS, here's a way to get the job done without it (not as elegant, but should do the job):
This will count the valid cases in each variable, and select only those that have 0 valid cases. Then you'll manually copy the names of these variables into a syntax command that will delete them.
DATASET NAME Orig.
DATASET DECLARE VARLIST.
AGGREGATE /OUTFILE='VARLIST'/BREAK=
/**list_all_the_variable_names_here = NU(*FirstVarName to *LastVarName).
DATASET ACTIVATE VARLIST.
VARSTOCASES /MAKE NumValid FROM *FirstVarName to *LastVarName/INDEX=VarName(NumValid).
SELECT IF NumValid=0.
EXECUTE.
Pause here to copy the remaining names in the list and complete the syntax, then continue:
DATASET ACTIVATE Orig.
DELETE VARIABLES *paste_here_all_the_remaining_variable_names_from_varlist .
Notes:
* I put stars where you have to replace my text with your variable names.
** If the variables are neatly named like Q1, Q2, Q3 .... Q1000, you can use the "FirstVarName to LastVarName" form (Q1 to Q1000) instead of listing all the variable names.
BTW it is of course possible to do this completely automatically without manually copying those names (using only syntax, no Python), but the added complexity is not worth bothering with for a single use...

How to combining two files and creating a report with matched fields in COBOL

I have two files :
first file contains jobname and start time which looks like below:
ZPUDA13V STARTED - TIME=00.13.30
ZPUDM00V STARTED - TIME=03.26.54
ZPUDM01V STARTED - TIME=03.26.54
ZPUDM02V STARTED - TIME=03.26.54
ZPUDM03V STARTED - TIME=03.26.56
and the second file contains jobname and Endtime which looks like below:
ZPUDA13V ENDED - TIME=00.13.37
ZPUDM00V ENDED - TIME=03.27.38
ZPUDM01V ENDED - TIME=03.27.34
ZPUDM02V ENDED - TIME=03.27.29
ZPUDM03V ENDED - TIME=03.27.27
Now I am trying to combine these two files to get the report like JOBNAME START TIME ENDTIME.I have used ICETOOL to get the report If I get JOBNAME START TIME ,ENDTIME is SPACES .If I get Endtime ,JOBNAME START TIME gets spaces.
Please let me know how to code the outrec fields as I have coded with almost all possibilites to get the desired one.But still my output is not the same as I required
I have no idea what ICETOOL is (nor the inclination to even look it up in Google :-) but this is a classic COBOL data processing task.
Based on your simple data input, the algorithm would be:
for every record S in startfile:
for every record E in endfile:
if S.jobnname = E.jobname:
ouput S.jobname S.time E.time
next S
endif
endfor
endfor
However, you may need to take into account the fact that:
multiple jobs of the same name may run during the day (multiple entries in the file).
multiple jobs of the same name may run at the same time.
You could get around the first problem by ensuring the E record was the one immediately following the S record (based on time). The second problem is a doozy.
If you're running on z/OS (and you probably are, given the job names), have you considered using information from the SMF records to do this collection and analysis. I'm pretty certain SMF type 30 records hold everything you need.
And assuming this is a mainframe question, here's a shameless plug for a book one of my friends at work has written, check out What On Earth is a Mainframe? by David Stephens (ISBN-13 = 978-1409225355).
I know, i'm toooo late with my resolution, but may be helpful for new comers to stackoverflow
You can make use of JOINKEYS of DFSORT using JCL.
JOINKEYS F1 FIELDS=(01,08,CH,A)
JOINKEYS F2 FIELDS=(01,08,CH,A)
REFORMAT FIELDS=(F1:01,33,F2:25,08)
SORT FIELDS=COPY
OUTREC FIELDS=(01,08,25,08,34,08)
the outrec will hold the data as you need!

Resources