Deployment of Machine Learning models - Data versioning

Deployment of Machine Learning models - Data versioning - machine-learning

Assume we have a Publish pipeline on the master branch of a repository hosted on Microsoft Azure. When a Pull Request is completed on this branch, an artefact is automatically built. When clicking on the artefact, we can see a window "Provenance" that shows the exact commit of the code used to publish the artefact (see image below).
Now assume an artefact is built using not only code, but also some heavy data. This is the case of Machine Learning models which are created with training code and training data. I want to be able to link the version of both to an artefact (the Machine Learning model) that, ideally, is created automatically after a PR on master.
Currently, I upload the artefact manually, so not only I lack the identifier of the data used to produce the ML model, but also the commit id of the code.
Is there a way on Azure to automatically produce the (heavy) artefact ? Is there a way to keep track of the id of the code and the data used to produce an artefact ?
I suppose I will need a data versioning system + a storage. Are these provided by Azure ?

Is there a way on Azure to automatically produce the (heavy) artefact ? Is there a way to keep track of the id of the code and the data used to produce an artefact ? I suppose I will need a data versioning system + a storage. Are these provided by Azure ?
If I understand you correct, you could try to use the Universal Package to pack those json files as artifact.
As test, we could use the Universal packages task to create and publish the artifact:
After the build complete, we could get the artifact in our feed:
You could check this document Publish and download Universal Packages in Azure Pipelines and Universal packages with Azure DevOps Artifacts for some details.

Related

What is the appropriate location of Azure Devops/TFS Release "support files" (Powershell, Configuration etc.)?

On TFS 2018.2, I am building a release pipeline implying the use of :
Applications configuration files
PowerShell scripts
HTML/Markdown templates (for release notes)
My applications configuration files are located on a net share for now and that works fine but I would like to version them later on.
I was about to store other files on my existing TFVC repository but I did not find a way to get them (with their directory) without adding the entire repository as a release input artifact.
I do not want to add them to my build artifacts since these files will be used for all my releases, no matter the applications I am building.
What is the recommended way
to store these files
to get them on release execution ?
I have been tempted to use the library but I feel this would be a misuse of it since it has been designed for secure files...

The correct solution to this problem is something you've already hit upon: Add them to your build artifacts. In fact, it's better than pulling them from a separate repo for a very important reason:
Your deployment scripts are going to evolve along with your application. You lose the connection between "this version of the application was deployed with these particular scripts" if the scripts come from a separate location.
You have a lot of options to control the circumstances under which they get pubilshed/downloaded:
You can use conditions on the publish artifacts tasks to control when they get published
You can use artifact filters on the release definition to control when they get downloaded as part of a release

Partial deployment using VSTS builds - source control strategy

I am building a VSTS build pipeline for continues integration and deployment of a MVC web project. My client wants 0 down time in case of continues deployment so we have considered restructuring the source control strategy and split the single code repository to following:
Core features
Feature 1
Feature 2 .....
Feature n
We are planning to keep features as child branches of Core feature and place individual build templates for each of the branch and sub-branches. So the ideal scenario is that if there is any change in core feature branch, the build should be deployed with full code ( branch + sub-branches) but if only 1 feature branch is changed, the continues deployment will be executed only for that branch or the feature in the branch.
So the questions which need some guidance are: -
Is the idea of feature branching is fine and can be used on production?
The .Net MVC application is n-tier application which has web tier, service and repository tiers. Shall I split the service and repository layers also in the core and feature branches to make it separated?
If I split the service and repository, how should the communication happened between the different features:
Via service to service calling? Like if feature 1 requires some functionality of feature 2, the feature 1 service calls feature 2 service and merge the result to send it to feature 1 GUI?
Feature 1 repository calls feature 2 repository, but this approach will bring dependency of feature 1 on feature 2 means if feature 2 is down at the time of deployment, feature 1 is also experience errors.
Splitting repository to several features is a good idea?
Thanks

Splitting repository to several features is ok, because they could be used in other apps (e.g. mobile app)
I recommend that you can consider VSTS Packages feature or other 3rd package feed. The workflow:
Push changes to server > Trigger CI build> Pack and publish package to VSTS feed by using NuGet task
Install necessary packages to the web project and coding.
Push changes of web project to server > Trigger CI build with current installed package (Do not update package)
Update necessary package to the web project for new feature
Push changes of web project to server > Trigger CI build

How to create complex value stream with multiple pipelines with Jenkins WorkFlow

How do you implement a complex value stream with multiple pipelines in Jenkins WorkFlow? Similar like you can do with Go CD: How do I do CD with Go?: Part 2: Pipelines and Value Streams.
For a distributed system I would like to have each dev team and operation team to start with their own delivery pipeline. One change needs to trigger only the pipeline of the team that made the change. It needs to trigger a new pipeline that needs to take the latest successful artifacts from each of the team's pipelines and move on from there. This mean that the artifacts from the other teams were not rebuild or retested as they were not changed. And after the Fan In we can run a set of automated tests to verify the correct behaviour of the distributed system with the change.
In the documentation I only find you can pull from multiple VCS's but I assume everything is then build and tested with every change. Which is something I want to avoid.
If each delivery pipeline is in it's own Jenkins Job. How can I visualize the complete pipeline and what is the best way to pull in the last successful artifacts or version from the other pipelines?

There is no direct equivalent in Jenkins for value streams, and Workflow jobs do not behave any differently in that respect: you can have upstream jobs and downstream jobs correlated with triggers (in this case the build step, or the core ReverseBuildTrigger), and use (for example) the Copy Artifact plugin to transfer artifacts to downstream builds. Similarly, you could use an external repository manager as the “source of truth” and define job triggers based on snapshots pushed to the repository.
That said, part of the purpose of Workflow is to avoid the need for complex job chains in most situations¹, since it is usually easier to reason about, debug, and customize a single script with standard control flow operators and local variables than to manage a set of interdependent jobs. If the main problem with a single flow is that you need to avoid rebuilding unmodified parts, one solution would be to use something like JENKINS-30412 to check the changelog of particular repository checkouts and skip build steps if empty. I think there would be more features needed to make such a system work in the general case that workspaces are clobbered or discarded by other builds.
¹One case where you definitely need separate jobs is that for security reasons the teams contributing to different projects must not be able to see one another’s sources or build logs.

Assuming that each of your dev teams works on a different module of your project and „One change needs to trigger only the pipeline of the team that made the change“ I'd use Git Submodules:
Submodules allow you to keep a Git repository as a subdirectory of another Git repository.
with one repo, that becomes a submodule of a main module repo, for each team. This will be transparent to the teams since they just work on their designated repos only.
The main module is also the aggregator project for your module projects in terms of the build tool. So, you have the options:
to build each repo/pipeline individually or
to build the whole (main) project at once.
A build pipeline that comprises one or more build jobs is associated to every team/repo/module.
The main pipeline is merely a collection of downstream jobs which represent the starting points of the team/repo/module pipelines.
The build triggers can be any of manually, timed or on source changes.
A decision has also to be made:
whether you version your modules individually, such that other modules depend on release versions only.
Advantage:
Others rely on released, usually more stable versions.
Modules can decide which version of a dependency they want to use.
Disadvantages:
Releases have to be prepared for each module.
It may take longer until the latest changes are available to others.
Modules have to decide which version of a dependency they want to use. And they have to adapt it every time they need functionality added in a newer version.
or whether you use one version for the entire project (which is inherited by the modules then): ...-SNAPSHOT during the development cycle, a release version when releasing the project.
In this case, if there are modules that are essential for others, e.g. a core module, a successful build of it should trigger a build of the dependent modules, as well, so that incompatibilities are recognized as early as possible.
Advantages:
Latest changes are immediately available to others.
A release is prepared for the whole project only once it is to be delivered.
Disadvantages:
Latest changes immediately available to others may introduce not so stable (snapshot) code.
Re „How can I visualize the complete pipeline“
I'm not aware of any plugin that can do this with Workflows at the moment.
There's the Build Graph View Plugin which originally has been created for Build Flows, but it's more than two years old now:
Downstream builds are identified by DownStreamRunDeclarer extension point.
Default one is using Jenkins dependencyGraph and UpstreamCause and as such can detect common build chain.
build-flow plugin is contributing one to render flow execution as a graph
some Jenkins plugins may later contribute dedicated solutions.
(You know, „may“ and „later“ often become will not and never in development. ;)
There's the Build Pipeline Plugin but it apparently is also not suitable for Workflows:
This plugin provides a Build Pipeline View of upstream and downstream connected jobs [...]
Re „way to pull in the last successful artifacts“
Apparently it's not that smooth with Gradle:
By default, Gradle does not define any repositories.
I'm using Maven and there exist local and remote repositories where the latter can also be:
[...] internal repositories set up on a file or HTTP server within your company, used to share private artifacts between development teams and for releases.
Have you considered using a binary repository manager like Artifactory or Nexus?

From what I have seen, people are moving towards smaller, independent pieces of code delivery rather than monolithic deployments. But clearly, there will still be dependencies between different components. At the very least, for example, if you had one script that provisioned your infrastructure and another that built and deployed your app, you would want to be sure your infrastructure update script was run before your app deployment. On the other hand, your infrastructure does not depend on deploying your app code - it can be updated at its own pace, so long as it ideally passes some testing.
As mentioned in another post, you really have two options to accomplish this dependency:
Have a single pipeline (workflow script) that checks out code from both repos and puts them through the same pipeline simultaneously. Any change to one requires the full boat pipeline for everything.
Have two pipelines and this would allow each to go at its own pace independent of what the other does. This isn't a problem for the infrastructure code, but it very well could be for the app code. If you pushed your app code to production without the infrastructure update having happened first, the results may not be pleasant.
What I've started to do with Jenkins Workflow is establish a dependency between my flows. Basically, I declare that one flow is dependent on a particular version (in this case, simply BUILD_NUM) and so before I do a production deploy I verify that the last successful build of the other pipeline has completed first. I'm able to do this using the Jenkins API as part of my flow script that waits for that build or greater to succeed, like so
import hudson.EnvVars
import hudson.model.*
int indepdentBuildNum = 16
waitUntil{
verifyDependentPipelineCompletion("FLDR_CM/WorkflowDepedencyTester2", indepdentBuildNum)
}
boolean verifyDependentPipelineCompletion(String jobName, int buildNum){
def hi = jenkins.model.Jenkins.instance
Item dep2 = hi.getItemByFullName(jobName)
hi = null
def jobs = dep2.getAllJobs().toArray()
def onlyJob = jobs[0] //always 1 job...I think?
def targetedBuild = onlyJob.getLastSuccessfulBuild()
EnvVars me = targetedBuild.getCharacteristicEnvVars()
def es = me.entrySet()
int targetBuildNum = 0;
def vars = es.iterator()
while(vars.hasNext()){
def envVar = vars.next()
if(envVar.getKey().equals("BUILD_ID")){
targetBuildNum = Integer.parseInt(envVar.getValue())
}
}
if (buildNum > targetBuildNum) {
return false
}
return true
}
Disclaimer that I am just beginning this process so I do not have much real-world experience with this yet, but will update this thread if I have more relevant information. Any feedback welcome.

TFS and storing binary files

Our project group stored binary files of the project that we are working on in SVN repository for over a year, in the end our repository grew out of control, taking backups of SVN repo became impossible at one point since each binary that is checked in is around 20 MB.
Now we switched to TFS,we are not responsible for backing the repository up, our IT tream takes care of it and we have more network and storage capacity for backups because of that but we want to decide what to do with the binaries. As far as I know TFS stores deltas and for binary files but deltas will be huge, but we might end up reaching our disk space quota one day, so I would like to plan things better from the start, I don't want to get caught up in a bad situation when it's too late to fix the problem.
I would prefer not keeping builds in the source control but our project group insists to keep a copy of every binary for reproducing the problems that we see in the production system, I can't get them to get the source code from TFS, build it and create the binary, because it is not straightforward according to them.
Does TFS offer a better build versioning method? If someone can share some insight I'd really be grateful.

As a general rule you should not be storing build output in TFS. Occasionally you may want to store binaries for common libraries used by many applications but tools such as nuget get around that.
Build output has a few phases of its life and each phase should be stored in a separate place. e.g.
Build output: When code is built (by TFS / Jenkins / Hudson etc.) the output is stored in a drop location. This storage should be considered volatile as you'll be producing a lot of builds, many of which will be discarded.
Builds that have been passed to testers: These are builds that have passed some very basic QA e.g. it compiles, static code analysis tools are happy, unit tests pass. Once a build has been deemed good enough to be given to test it should be moved from the drop location to another area. This could be a network share (non production as the build can be reproduced) there may be a number of builds that get promoted during the lifetime of a project and you will want to keep track of what versions the testers are using in each environment.
Builds that have passed test and are in production: Your test team deem the build to be of a high enough quality to ship. As part of your go live process, you should take the build that has been signed off by test and store it in a 3rd location. In ITIL speak this is a Definitive Media Library. This can be a simple file share, but it should be considered to be "production" and have the same backup and resilience criteria as any other production system.
The DML is the place where you store the binaries that are in production (and associated configuration items such as install instructions, symbol files etc.) The tool producing the build should also have labelled the source in TFS so that you can work out what code was used to produce the binary. Your branching strategy will also help with being able to connect the binary to the code.
It's also a good idea to have a "live like" environment, this should be separate from your regular dev and test environments. As the name suggests it contains only the code that has been released to production. This enables you to quickly reproduce bugs in production

Two methods that may help you:
Use Team Foundation Build System. One of the advantages is that you can set up retention periods for finished builds. For example, you can order TFS to store the 10 latest successful builds, and the two latest failed ones. You can also tell TFS to store certain builds (e.g. "production builds"/final releases) indefinitely. These binaries folders can of course also be backed up externally, if needed.
Use a different collection for your binaries, with another (less frequent) backup schedule. TFS needs to backup whole collections, but by separating data that doesn't change as frequently as the source you can lower the backup cost. This of course depends on the frequency you are required to have the binaries backed up.

You might want to look into creating build definitions in TFS to give your project group an easy 'one button' push to grab the source code from a particular branch and then build it and drop it to a location. That way they get to have their binaries, and you don't have to source control them.
If you are using a branching strategy where you create Release or RTM branches when you push something to production, then you can point your build definitions at those branches and they can manually trigger them from the TFS portal or from within Visual Studio.

Build Pipelines in TFS

In 2009, there was a SO question on the same topic.
I'm wondering if later versions of Team Foundation Server are better at longer build pipelines. Refer features of Jenkins, TeamCity, ThoughtWorks' Go (my employer).
The visualizations of the build pipelines are important to me, as well as the notification about individual stages passing or failing. That and the eminent clone-ability of say a 'trunk' pipeline into one for a release branch as that branch leaps into being.
Secondly, a personal holy-grail is the CI server storing its config in the SCM that's holding the buildable thing itself, and even picking up on the creation of branches silently to provision new pipelines; Can TFS be configured to store the CI definitions/scripts in its SCM side rather than its accompanying SqlServer?

TFS build consists of three components:
The build definition - stored on the SQL server data tier.
The build workflow - a XAML file stored in the source control.
The supporting MSBuild scripts - usually contains user defined actions, also stored in the source control.
As the build progresses, you can see visualization of the build steps and you also get a different log for the main build and the MSBuild output.
The build definition in TFS is merely a collection of build settings, similar to CC.Net's config file and TeamCity's build configuration tab which both stored on the file system as well. Assuming there's a backup plan on the database you don't really need to store the build defintions on the source control, but if you must it's possible by exporting the tbl_BuildDefinition table.
The TFS Power Tools adds cloning functionality for build definitions.
There's no OOTB support for provisioning build definitions from a new branch though it's fairly feasible using the TFS-API.

Bit late to the party, but just don't bother with TFS if you want advanced build pipeline automation. It simply doesn't cut it.
I have used Jenkins and TFS both extensively. Tfs is just. pure. crap. Here's why.
No down/up stream build.
No piepline/orchestraion build. (like jenkins)
Obscure ways of adding build steps and falls back to using MsBuild.
Slow and still polls the source control.
Ties you to MsTest.
And please don't point me to "Oh look you can do everything if you write a custom activity". I'm not wasting time doing development for a closed source, sub-par platform. If I am going to contribute something, it's to a FREE. OPEN SOURCE platform.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart