How to schedule SPSS Modeler streams to run automatically? - spss

I'm very new to SPSS Modeler and previously used SAS Guide & Miner to schedule batches and scheduled runs.
We currently manually run streams weekly then export them into Excel to use for reporting.
Is there any way to schedule these streams to run automatically?
Thank you.

ModelerBatch can be used only if you have the SPSS Modeler Server installed. I use the task scheduler from Windows but first you have to write an archive .bat
Example of .bat:
call "C:\Program Files\IBM\SPSS\Modeler\18.1\bin\clemb.exe" -stream "E:\path\stream.str" -execute -log "E:\path\stream.str.log"
The .bat will call SPSS Modeler and the stream selected. After that, you can schedule the task (https://msdn.microsoft.com/en-us/library/windows/desktop/aa446802(v=vs.85).aspx).

there is a product called ModelerBatch which could meet your demand precisely. Just check it on google.

Related

ordering message pub/sub GCP

I am new to Dataflow and pub-sub tools in GCP.
Need to migrate current on prem process to GCP.
Current Process is as follows:
We have two types of data feeds
Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
Separate ETL jobs are there to process full feed. ETL job process
full feed and create load ready files and all tables will be truncate
and re-load.
Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
At the end of ETL process load the load ready files into tables (Mainframe)
I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.
Initially I thought below design.
Pub/sub -> DataFlow -> mySQL/BigQuery
Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.
I have two questions:
Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
Is this design is suitable for the above requirement or is there any better solution in GCP
is there any alternative for DataFlow?
Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.
As was mentioned by #guillaume blaquiere, Beta product launching phase brings some restrictions but they are mostly related to the product support:
At beta, products or features are ready for broader customer testing
and use. Betas are often publicly announced. There are no SLAs or
technical support obligations in a beta release unless otherwise
specified in product terms or the terms of a particular beta program.
The average beta phase lasts about six months.
Commonly, Cloud Pub/Sub message ordering feature works as intended, once you have something for developers attention it is highly appreciated to send a report via Google Issue tracker.

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?
Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.
For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.
Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

iOS - Parse.com Export Data automation

I am using Parse.com as the backend for my iOS app. Parse has a big Export Data button for backing up your database that will send an email with a zip containing each table and its data in JSON format. That's great, but is there any way to automate this task? I want to be able to do this every night, and I know you can use Background Jobs for automated tasks, but is it possible to hook into this particular feature? I couldn't find an answer on Parse's forums and it didn't turn up anything except old threads talking about how this feature was on the horizon.
The best I can work out, without Parse providing a true way of achieving this, is to have a job creating File objects in a "backup" table. And then use an external service (with the REST API) to pull this out into S3 or similar.
It's not ideal, but it would work. Also, it will count against your API requests so you may want to optimise with the updated flag.
What I do for this issue is I am running a simple Windows Server in the AWS EC2 to run a scheduled task.
Create simple bat file to run a command node parse-backup.js
Create basic scheduled task using windows provided scheduler and run bat file
You can use this node code. https://github.com/mkim871/parse-node-backup

Possibilities to accept input for COBOL batch program

I have a batch COBOL program which needs input in the form of a flat file. It is working when I FTP a single file to the batch using a software.
Problem is that , in the final solution , many concurrent users are needed to access the batch program together or individually. For example lets say 10 users need to run the batch.
They can FTP all of the files to a shared directory from where the Mainframe can access the file.
Now the problem comes as to
How the Mainframe Job can be triggered?
since there will be 10 or more files , the JOB needs to run each one of them individually and generate a report.
How should the file names be? for example if two files have same name they will get overridden when they are FTP into the shared directory in the first place. On the other hand if the file names are unique , Mainframe will not be able to differentiate between them .
The user will recieve the report through E-Mail its coded in the Batch program, the ID will be present in the input Flat file.
Previously the CICS functionality was done through excel macro(Screen scrapping). The whole point of this exercise is to eliminate the CICS usage to reduce MIPS
Any help is appreciated.
Riffing off what #SaggingRufus said, if you have Control-M for scheduling you can use CTMAPI to set an auto-edit variable to the name of your file and then order a batch job. You could do this via a web service in CICS using the SPOOLWRITE API to submit the job, or you could try FTPing to the JES spool.
#BillWoodger is absolutely correct, get your production scheduling folks and your security folks involved. Don't roll your own architecture, use what your shop has decided is right for it.

Mule ESB - Clear Memory of a batch process

My Scenario is I have 4 batches in a my Mule flow. One of the batch say batch 1 loads records of say 10,000 but then i decided to force stop the batch. Now i decided to run batch No. 2 in the same XML. The batch 2 runs but the batch 1 records which were earlier loaded also gets run. Is this a bug or is there are a configuration to prevent this.
Are you running the batch on Studio?
If yes, go to the Run Configurations on studio. Look for the configuration of your project and scroll down to 'Clear Application Data', set this to 'Always'.

Resources