TravisCI environment matrix options per branch - travis-ci

I have the following in my .travis.yml file.
env:
matrix:
- BROWSER=Chrome VERSION=53.0 PLATFORM="Windows 10"
- BROWSER=Firefox VERSION=45.0 PLATFORM="Windows 7"
- BROWSER=IE VERSION=11.0 PLATFORM="Windows 8.1"
- BROWSER=Safari VERSION=9.0 PLATFORM="OS X 10.11" RESOLUTION="1280x960"
The above matrix executes four consecutive builds. I want to use all four for one branch, but only use a single one of those options for others. How can I accomplish this?
After spending a lot of time searching the web for a solution, I reached out here.

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Experiment tracking for multiple ML independent models using WandB in a single main evaluation

Can you recommend from your experience about choosing a convenient tracking experiment tool and versioning only "Multi independent models, but one input->multi-models->one output" in order to get single main evaluation and conveniently compare sub-evaluations? see a project example in the diagram.
I understand and tried to use W&B, MLFlow, DVC, Neptune.ai, DagsHub, TensorBoard for only one model, but I'm not sure one is convenient to use for multi-independent models. I also did not find it in Google for the approximate phrase "ML tracking experiment and management for multi models"
Disclaimer: I'm co-founder at Iterative, we are authors of DVC. My response doesn't come from my experience with all the tools mentioned above. I took this as an opportunity to try build a template for this use case in the DVC ecosystem and share this in case it's useful for anyone.
Here is the GitHub repo, I've built (Note: it's a template, not a real ML project, scripts are artificially simplified to show the essence of the multi model evaluation):
DVC Model Ensemble
I've put together an extensive README with a few videos of CLI, VS Code, Studio tools.
The core part of the repo is this DVC pipeline, that "trains" multiple models, collects their metrics, and then runs evaluation stage to "reduce" those metrics into the final one.
stages:
train:
foreach:
- model-1
- model-2
do:
cmd: python train.py
wdir: ${item}
params:
- params.yaml:
deps:
- train.py
- data
outs:
- model.pkl:
cache: false
metrics:
- ../dvclive/${item}/metrics.json:
cache: false
plots:
- ../dvclive/${item}/plots/metrics/acc.tsv:
cache: false
x: step
y: acc
evaluate:
cmd: python evaluate.py
deps:
- dvclive
metrics:
- evaluation/metrics.json:
cache: false
It describes how to build and connect different things in the project, also makes the project "runnable" and reproducible. It can scale to any number of models (the first foreach clause).
Please, let me know if that fits your scenario and/or you have more requirements, happy to learn mode and iterate on it :)

Jenkins plugin with viewing\aggregating possibilities depending on one of the parameters

I'm looking for plugin where I could have aggregation of settings and view for many cases, the same way it is in multi-branch pipeline. But instead of basing on various branches I want to base on one branch but varying on parameters. Below picture is from mentioned multi-branch pipeline, instead of "Branches" I'm looking for "Cases" and instead of "Name" column I need to have configurable Parameter.
Additionally to it, I need to have various Periodic build triggers in way
H 22 * * 5 %param1=value1 %param2=value3
H 22 * * 5 %param1=value2 %param2=value3
The second case could be done in standard job, but since there will be many such cases launched periodically every week or two weeks or every month, and difference in param1 is crucial and is important to have it readable and easily visible to quickly distinguish which case have failed.
I was looking for such plugin but couldn't find something like this. Maybe someone knows such plugin or way to solve it.
I have alternative of creating "super"job which in build steps would launch my current job with specific parameters. Then my readability would change from many rows to many columns since the number is over 20 it will IMHO significantly decrease readability(in column solution) and additionally not all cases would be launched with same periodicity. So there would be necessity to have some ready sets assigned by parameter, and most often the super build cases would have mostly skips in it. What would result that one might not see last result for one of the cases.
Note, that param2, has always same value for periodic launches. Other values are used only with manual trigger. Param2 can but doesn't have to be visible on "multi-branch pipeline" like solution.
I hope my explanation of issue is clear. Looking forward for answers\suggestions etc. :)

how to use startPhase in Mahout

I am running some RecommenderJob (org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) based job from Mahout 0.7 and notice that there are options like startPhase and endPhase. I am guessing these are to run only portions of the pipeline assuming you have necessary input data from prior run(s). But I am having a hard time understanding what kinds of phases there are in RecommenderJob. I am in the middle of reading the source code but it looks like it will take a while. In the meantime I am wondering if anybody can shed light on how to use these options (startPhase in particular) with RecommenderJob class?
Here is what I found:
phase 0 is about PreparePreferenceMatrixJob and it has 3 hadoop jobs:
PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
phase 1 is about RowSimilarityJob and it has 3 jobs:
RowSimilarityJob-VectorNormMapper-Reducer
RowSimilarityJob-CooccurrencesMapper-Reducer
RowSimilarityJob-UnsymmetrifyMapper-Reducer
phase 2 is about RecommenderJob and it has 3 jobs:
RecommenderJob-SimilarityMatrixRowWrapperMapper-Reducer
RecommenderJob-UserVectorSplitterMapper-Reducer
RecommenderJob-Mapper-Reducer
phase 3 is the last one and it has only one job:
RecommenderJob-PartialMultiplyMapper-Reducer
Also output from phase 1 here in RecommenderJob class is exactly the same as the output from phase 0 and 1 of ItemSimilarityJob (but the temp directory names are different).
Yes, that's correct. It's a fairly crude mechanism. Really it controls which of a series of MapReduce jobs are run. You have to read the code to know what they are, yes. They vary by job.
If I'd done it over again I would have just made it detect the presence of output to know to skip the jobs. (That's what I've done in my next-gen recommender project.)

How do I plot benchmark data in a Jenkins matrix project

I have several Jenkins matrix projects in where I output benchmark results (i.e. execution times) in a CSV file. I'd like to plot these execution times as a function of the build number, so I can see if my projects are regressing over time.
I can confirm Plot Plugin is a correct and quite useful approach. BTW, it supports CSV as well: plot configuration example
I've been using it for several years without any problem. Benchmarks results were generated as a property file. Benchmark id (series id) was used as a key and result as a value. One build produces one result for each benchmark. Having that data it is quite easy to create plot configuration ant track performance.
This may help you:
https://wiki.jenkins-ci.org/display/JENKINS/Plot+Plugin
It adds plotting capabilities to Jenkins.

Resources