How to prioritize one package over other inside for each loop container in ssis? - foreach

I have two Execute package components inside a for each loop container in my main package, I want one package to execute before the other one always. How could I achieve this?
The script contains the number in file name. If filename has 1 then package 1 executes if 2 then Package 2 & so on. But I want package 3 to execute always before any package regardless of file number for that specific case.

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Drake Installation Freeze

I am trying to install the python-binding of drake. After make --j it freezes. I believe I have done everything correctly for the previous steps. Can anyone help? I am running on Ubuntu 18.04 with python 3.6.9.
Thank you in advance. It looks like this.
Frozen Terminal
Use make (no -j flag) or make -j1 because bazel (which is called internally during the build) handles the parallelism of the build (and of tests) and will set the number of jobs to the number of cores by default (appears to be 8 in your case).
To adjust the parallelism to reduce the number of jobs to less than the number of cores, create a file named user.bazelrc at the root of the repository (same level as the WORKSPACE file) with the content
test --jobs=N
for some N less than the number of cores that you have.
See also https://docs.bazel.build/versions/master/guide.html#bazelrc.
From the screen shot, it doesn't look like the drake build system is doing anything wrong. But make -j is probably trying to do too many things in parallel. Try starting with -j4 and if it still freezes, go down to 2, etc.
Possibly out of memory..
A hacky solution is to change the CMakeLists.txt file to set the max number of jobs bazel uses by adding --jobs N (where N is the number of jobs you allow concurrently) after ${BAZEL_TARGETS} like so
ExternalProject_Add(drake_cxx_python
SOURCE_DIR "${PROJECT_SOURCE_DIR}"
CONFIGURE_COMMAND :
BUILD_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
build
${BAZEL_ARGS}
${BAZEL_TARGETS}
--jobs 1
BUILD_IN_SOURCE ON
BUILD_ALWAYS ON
INSTALL_COMMAND
${BAZEL_ENV}
"${Bazel_EXECUTABLE}"
${BAZEL_STARTUP_ARGS}
run
${BAZEL_ARGS}
${BAZEL_TARGETS}
--
${BAZEL_TARGETS_ARGS}
USES_TERMINAL_BUILD ON
USES_TERMINAL_INSTALL ON
)

Netlogo Behaviorspace How to save data not per tick but based on reporter

I have a netlogo model, for which a run takes about 15 minutes, but goes through a lot of ticks. This is because per tick, not much happens. I want to do quite a few runs in an experiment in behaviorspace. The output (only table output) will be all the output and input variables per tick. However, not all this data is relevant: it's only relevant once a day (day is variable, a run lasts 1095 days).
The result is that the model gets so slow running experiments via behaviorspace. Not only would it be nicer to have output data with just 1095 rows, it perhaps also causes the experiment to slow down tremendously.
How to fix this?
It is possible to write your own output file in a BehaviorSpace experiment. Program your code to create and open an output file that contains only the results you want.
The problem is to keep BehaviorSpace from trying to open the same output file from different model runs running on different processors, which causes a runtime error. I have tried two solutions.
Tell BehaviorSpace to only use one processor for the experiment. Then you can use the same output file for all model runs. If you want the output lines to include which model run it's on, use the primitive behaviorspace-run-number.
Have each model run create its own output file with a unique name. Open the file using something like:
file-open (word "Output-for-run-" behaviorspace-run-number ".csv")
so the output files will be named Output-for-run-1.csv etc.
(If you are not familiar with it, the CSV extension is very useful for writing output files. You can put everything you want to output on a big list, and then when the model finishes write the list into a CSV file with:
csv:to-file (word "Output-for-run-" behaviorspace-run-number ".csv") the-big-list
)

Delete variables based on the number of observations

I have an SPSS file that contains about 1000 variables and I have to delete the ones having 0 valid values. I can think of a loop with an if statement but I can't find how to write it.
The simplest way would be to use the spssaux2.FindEmptyVars Python function like this:
begin program.
import spssaux2
spssaux2.FindEmptyVars(delete=True)
end program.
If you don't already have the spssaux2 module installed, you would need to get it from the SPSS Community website or the IBM Predictive Analytics site and save it in the python\lib\site-packages directory under your Statistics installation.
Otherwise, the VALIDATEDATA command, if you have it, will identify the variables violating such rules as maximum percentage of missing values, but you would have to turn that output into a DELETE VARIABLES command. You could also look for variables with zero missing values using, say, DESCRIPTIVES and select out the ones with N=0.
If you've never worked with python in SPSS, here's a way to get the job done without it (not as elegant, but should do the job):
This will count the valid cases in each variable, and select only those that have 0 valid cases. Then you'll manually copy the names of these variables into a syntax command that will delete them.
DATASET NAME Orig.
DATASET DECLARE VARLIST.
AGGREGATE /OUTFILE='VARLIST'/BREAK=
/**list_all_the_variable_names_here = NU(*FirstVarName to *LastVarName).
DATASET ACTIVATE VARLIST.
VARSTOCASES /MAKE NumValid FROM *FirstVarName to *LastVarName/INDEX=VarName(NumValid).
SELECT IF NumValid=0.
EXECUTE.
Pause here to copy the remaining names in the list and complete the syntax, then continue:
DATASET ACTIVATE Orig.
DELETE VARIABLES *paste_here_all_the_remaining_variable_names_from_varlist .
Notes:
* I put stars where you have to replace my text with your variable names.
** If the variables are neatly named like Q1, Q2, Q3 .... Q1000, you can use the "FirstVarName to LastVarName" form (Q1 to Q1000) instead of listing all the variable names.
BTW it is of course possible to do this completely automatically without manually copying those names (using only syntax, no Python), but the added complexity is not worth bothering with for a single use...

Being clever when copying artifacts with Jenkins and multi-configurations

Suppose that I have a (fictional) set of projects: FOO and BAR. Both of these projects have some sort of multi-configuration option.
FOO has a matrix on axis X which takes values in { x1, ..., xn } (so there are n builds of FOO). BAR has a matrix on axis Y which takes values in { y1, ..., ym } (so there are m builds of BAR).
However, BAR needs to copy some artifacts from FOO. It turns out that Y is a strictly finer partition than n. For example, X might take the values { WINDOWS, LINUX } and Y might be { WINDOWS_XP, WINDOWS_7, DEBIAN_TESTING, FEDORA } or whatever.
Is it possible to get BAR to do some sort of table lookup to work out what configuration of FOO it needs when it copies artifacts across? I can easily write a shell script to spit out the mapping, but I can't work out how to invoke it when Jenkins is working out what it needs to copy.
At the moment, a hacky solution is to have two axes on FOO, on for X and one for Y, and then filter out combinations that don't make sense. But the resulting combination filter is ridiculous and the matrix is very sparse. Yuck.
A solution that I don't like is to parametrise FOO on Y instead: this would be a huge waste of compile time. And, worse, the generated artefacts are pretty big, so even if you did some sort of caching, you'd still have to keep unnecessary copies floating around.
Can't say I fully understand the intricacies if your matrices, but I think I can help you with your actual question
"I can easily write a shell script to spit out the mapping, but I can't work out how to invoke it when Jenkins is working out what it needs to copy"
The Archive the artifacts and Copy artifacts from another project post-build actions can take java style wildcards, like module/dist/**/*.zip as well as environment variables/parameters, like ${PARAM} for the list or artifacts. You can use commas , to add more artifacts.
The on-page help for Copy artifacts from another project states how to copy artifacts of a specific matrix configuration: To copy from a particular configuration, enter JOBNAME/AXIS=VALUE, this is for the Project Name attribute. That project name attribute can also contain params as ${PARAM}
So, in your BAR job, have a Copy Artifacts build step, with Project Name being FOO/X=${mymapping}. What this will do is: every time a configuration of BAR is run, it will copy artifacts only from FOO with configuration of X=${mymapping}.
Now you need to set the value of ${mymapping} dynamically every time BAR is run. A simple script like this may do the trick:
[[ ${Y:0:7} == "WINDOWS" ]] && mymapping=WINDOWS || mymapping=LINUX
Finally, you need to use EnvInject plugin to make this variable available to the rest of the build steps, including the Copy Artifacts step.
So, every time BAR configuration runs, it will look at its own configuration axis Y, and if that axis starts with WINDOWS, it will set the ${mymapping} to WINDOWS, else set it to LINUX. This ${mymapping} is then made available to the rest of the build steps. When Copy Artifacts build step is executed, it will only copy artifacts from FOO where the X axis matches ${mymapping} (i.e. either WINDOWS or LINUX).
Full Setup
Install EnvInject plugin.
In BAR job configuration, tick Prepare an environment for the run (part of EnvInject plugin).
Make sure both checkboxes for keeping existing variables are checked.
In Script Content copy your script:
[[ ${Y:0:7} == "WINDOWS" ]] && mymapping=WINDOWS || mymapping=LINUX
Under Build steps, configure Copy Artifacts build step.
Set Project name parameter to FOO/X=${mymapping}
Configure the rest as usual.

Resources