TFS Build Agent - Waiting for an agent to be requested

TFS Build Agent - Waiting for an agent to be requested - tfs

I am in the process of testing a TFS 2013 to TFS 2018 onprem upgrade. I have installed 2018.1 on a new system (and upgraded a copy of my TFS databases). I have installed a build agent on a new host which shows up under Agent Queues (as online and enabled).
I'm now trying to create a build. I set things up as I feel they should be and it sits at this screen:
Build
Waiting for an available agent
Console
Waiting for an agent to be requested
The VSTS Agent service is running on the build agent system. so I feel that is OK. I'm somewhat at a loss. Any assistance is appreciated.

Just try the below items to narrow down the issue:
Check the build definition requirements (Demands section) and the agent offering. Make sure it has the required capabilities installed on the agent machine.
When a build is queued, the system sends the job only to agents that have the capabilities demanded by the build definition.
Check if the service "Visual Studio Team Foundation Background Job Agent" is running on the TFS application tier server.
If it's not started, just start the service.
If the status is Running, just try to Restart the service.
Make sure the account that the agent is run under is in the "Agent Pool Service Account" role.
Try to change a domain account which is a member of the Build Agent Service Accounts group and belongs to "Agent Pool Service Account" role, to see whether the agent would work or not.

We have just spent five days trying to diagnose this issue and believe we have finally nailed the cause (and the solution!).
TL;DR version:
We're using TFS 2017 Update 3, YMMV. We believe the problem is a result of a badly configured old version of an Elastic Search component which is used by the Code Search extension. If you do not use the Code Search feature please disable or uninstall this extension and report back - we have seen huge improvements as a result.
Detailed explanation:
So what we discovered was that MS have repurposed an Elastic Search component to provide the code search facility within TFS - the service is installed when TFS is installed if you choose to include the search feature.
For those unfamiliar with Elastic, one particularly important aspect is that it uses a multi-node architecture, shifting load between nodes and balancing the workload across the cluster and herein lies the MS Code Search problem.
The Elastic Search component installed in TFS is (badly) configured to be single node, with a variety of features intentionally suppressed or disabled. With the high water-mark setting set to 85%, as soon as the search data reaches 85% of the available disk space on the data drive, the node stops creating new indexes and will only accept data to existing indexes.
In a normal Elastic cluster, this would cause another node to create a new index to accept the new data but, since MS have lobotomised the cluster down to one node, the fall-back... is the same node - rinse and repeat.
The behaviour we saw, looking at the communications between the build agent and the build controller, suggests that the Build Controller tries to communicate with Elastic and eventually fails. Over time, Elastic becomes more unresponsive and chokes this communication which manifests as the controller taking longer and longer to respond to build requests.
It is only because we actually use Elastic Search that we were able to interpret the behaviour and logs to come to this conclusion. Without that knowledge it would be almost impossible to determine the actual cause.
How to fix this?
There are a number of ways that you can fix this:
Don't install the TFS Search feature
If you don't want to use the Code Search feature, don't install it. The problem will not occur.
Remove the TFS Search feature [what we did]
If you don't use the Code Search feature, uninstall it. The problem will go away - you can either just disable the extension in all collections or you can use the server installer to fully remove it. I have detailed instructions from MS for anyone who wants to eradicate it completely, just ask.
Point the Search feature to a properly configured, real Elastic cluster
If you use Elastic properly, rather than stuffing it in a small box on its own, the problem will not occur.
Ensure the search data disk never hits the 85% water-mark
Elastic will continue to function "properly" and should return search results as expected, within the limited parameters.
Hope this helps someone out there avoid the pain we suffered.

The TF Background Job Agent wasn't running on the application tier, because that account didn't have 'log on as a service'.

I was facing the same issue and in my case, it was resolved by restarting the TFS server (TFS is hosted locally in our office).

Related

How to organize deployment process of a product from devs to testers?

We're a small team of 4 developers and 2 testers and I'm a team lead of the team. Developers do their tasks each in separate branch. Our stack is ASP.NET MVC, ASP.NET Core, Entity Framework 6, MSSQL, IIS, Windows Server. We also use Bitbucket, Jira software to store code and manage issues.
For example, there is a task "add an about window". A developer creates branch named "add-about-window" and put all the code there. Once the task is done, I do code review and in case all was good, I merge the branch into some accumulating branch let's name it "main". As a next step, I then manually deploy the updated "main" branch to test server with installed IIS, MSSQL. Once done I notify testers to test freshly uploaded app to make sure "add about window" is done correctly and works good. If testers find a bug, I have to revert the task branch merge from "main" branch and tell the developer to fix the bug in task's branch. Once the developer fixed it, I merge the branch into "main" branch again and ask testers to check again. In the end the task branch gets deleted.
This is really inconvenient, time consuming and frustrating. I have heard about git flow (maybe this is kind of what we have now).
Ideally, I would like this process to be as this:
Each developer still do work in separate branches.
Once a task is done and all the task code is in task branch I do code review.
Once code review is done and all found issues are fixed I just click "deploy"
There is a Docker image which contains IIS, MSSQL, Windows. It also with some base version of the application we work on, fully tested and stable. Let's say it's on a state of some date, like start of the year.
The Docker image is taken and a new container starts.
This Docker container gets fully initialized and then the code from a branch gets installed on the running container.
This container then has own domain name like "proj-100.branches.ourcompany.com" ("proj-100" is task's ID in Jira) which testers can go on and test.
This would definitely decrease time I spend on deployment and also will make the process more convenient and comfortable.
Can someone advice some resources I can learn about similar deployment models? Or maybe someone can share info on this. Any info will be very appreciated.

regardless of your stack, and before talking about the solutions, what you describe is the basic use case of any CI-CD process. all the exhausting manual steps you described, can be done with any CI tool.
now, let's consider what you already have, and talk about the steps for your desired solution - you're using bitbucket, which already gives you at least steps 1 and 2 - merging only approved PRs into master/main.
step 3 is where we start the CI automation process - you define a webhook upon certain actions in the bitbucket repo, which triggers a CI job/pipeline(can be a Jenkins server, gitlab-ci, or many other CI solutions). this way, you won't even need a "deploy" button, since the merging action can trigger the job, which can automatically run unit tests, integration tests(if you define them), build artifacts/docker-images and finally deploy.
step 4 needs some basic understanding of the docker containers design - a docker image is not a VM. it has its use cases and relevant scenarios, and more importantly an advised architecture guideline to follow.
to make it short, I'll only mention the principle of separation - each service should be in a separate container. it allows upscaling and easier debugging, and much more. which means - what you need is not a docker image that contains your entire system, but an orchestration of containers, each containing an independent software unit, with a clear responsibility. and here Kubernetes comes into play.
back to the CI job - after the PR merge, the job starts, running the pre-defined unit tests, building the container, and uploading to your registry.
moving to CD - depending on your process, after the updated and tested docker images are in your registry(could be artifactory/GitLab registry/docker registry...), the CD job can take any image it needs, and deploy them in your Kubernetes cluster. and that's it! the process is done.
A word of advice - if you don't have a professional DevOps team, or a good understanding of docker, CI-CD process, and Kubernetes, and if your dev team is small(and unfortunately it seems so) you may want to consider hiring a DevOps company to build the DevOps/CICD infrastructure for you, preferably with a completely managed DevOps solution and then do a handover. everything I wrote is just the guideline and basic points, to give you the big picture. good luck!

All the other answers are here great still I would like to add my piece of advice.
Recently I was also working on a product and we were three team members. It was a node.js project. If you are on AWS then you can use the AWS pipeline. This will detect pushes from a specific GitHub branch and the changes will get deployed to the server. The pipeline service has a build stage too. You can also configure slack notifications.
But you should have at least two environments production and dev to check if deployment is working properly on dev.
AWS also has services like AWS Code Commit and AWS Code Deploy.
This is just a basic solution. You don't actually need fancy software to set up ci/cd.

This kind of setup is usually supported by a CICD tool coupled with Kubernetes.
Either an on-premise one, like Jenkins+Kubernetes, or its Jenkins Kubernetes plugin, which runs dynamic agents in a Kubernetes cluster.
You can see an example in "How to Setup Jenkins Build Agents on Kubernetes Pods" by Bibin Wilson.
Or a Cloud one, like Bitbucket pipeline deploying a containerized application to Kubernetes
In both instances, the idea remains the same: create a ephemeral execution environment (a Docker container with the right components in it) for each pushed branch, in order to execute tests.
That way, said tests can take place before any merge between a feature branch and an integration branch like main.
If the tests pass, Jenkins itself could trigger an automatic merge (assuming the feature branch was rebased first on top of the target branch, main in your case)

We have similar process in our team.
We use gitlab-ci.
Hence there are some out of docker infrastructure (nginx with test stand dns),
we just create dev1, dev2 ... stands (5 stands for team of 10 developer and more than 6 microservices). For each devX stand and each microservice we have deploy to devX button in our CI-CD. And we just reserve in slack devX for feature Y on time of tests after deploy. Whan tests are done and bugs are fixed we merge to main branch and other feature brunch can be deployed and tested on devX stand.

As a next step, I then manually deploy the updated "main" branch to test server with installed IIS, MSSQL.
Once done I notify testers to test freshly uploaded app to make sure "add about window" is done correctly and works good.
If testers find a bug, I have to revert the task branch merge from "main" branch and tell the developer to fix the bug in task's branch.
If there are multiple environments then devs could deploy themselves into them. Even a single "dev" environment they can deploy to would greatly help. The devs should be able to deploy themselves and notify the testers without going through you.
That the deploy is "manual" is suspicious. How manual? Ideally it should just be a few button clicks. Sometimes you can even have it so that pushing to a branch does a deploy (through webhooks).
You should be able to deploy from branches besides main. What that means or looks like can vary a lot but the point is that if you're forcing everything there and having to revert when it doesn't work you're creating a lot of unneeded work. Ideally there should be some way to test locally. If there really can't be then you need to at least allow a way to deploy from any branch (or something like force pushing to a branch called 'dev' or something).
From another angle, unless the application gets horribly broken you don't necessarily need to rollback changes unless a release is coming soon. You can just have it fixed in another pull request.
All in all the main problem here sounds like there's only a single environment for testing, the process to deploy to it is far too manual, and the devs have no way to deploy to it themselves. This sort of thing is a massive bottle neck. Having a burdensome process to even begin to test things takes a big toll on everyone's morale -- which can be worse than the loss in velocity. You don't necesscarily need every dev to be able to spin up as many environments as they want at the push of a button but devs do need some autonomy to be able to test.
Having the application run in Docker containers can greatly help with running it locally as well as making the deployment process simpler. I've tried to stay away from specific product suggestions because this is more of a process problem it sounds like.

Release powerapp solution to new environment with devops

I am interested in any information about or experiences with deploying PowerApps solutions to new environments within the same tenant.
In my solution I have a canvas-app and several flows between the app and sharepoint. I have used connection references to all connections (sharepoint, mail, etc.). On the devops side I have a build pipeline from my development environment, very much in line with Microsoft's recommendations for ALM. In addition, I have a release pipeline to publish the solution in another environment, e.g. a test environment. I can publish the release but when I access the solution in the new environment all flows have been turned off and all connections to sharepoint have been severed. When I inspect the flows it throws an error that it was unable to locate the connection Id. What strikes me as odd here is that the connection references that are visible in the new solution cannot be selected. However, what I can do is to add a new connection (from each flow), whereafter I can turn the flow back on and activate each of them in the canvas app.
What I am asking for here, is any documentation, guide, tutorial, help, etc. to make this release a little more automatic, so I won't have to re-add connections for every single action from each of my flows.

I think you are in luck 😊 and you should check out the latest PA community call. I think the last demo is the thing you are looking for (especially from that moment I suppose🤔) and is now one of the targets in Power Platform.
If you are considering to introduce source control as well (like git), currently there is a cool experiment going on in the community in that direction which I think is quite promising and you may check this article. But please consider this pack/unpack tool as an experiment and don't just remove the original .msapp files yet 😉.

I think I have finally found a working solution. I'll document my steps here for other ALM hopefuls.
When pushing to the target environment for the first time I need to click on each of the connection references, click on solutions layers, ) use the breadcrumb path to go one step back ] and from here I can assign the correct connection. Subsequent deployments now work without any hassle.
Also, first time deployment, I have learned cannot activate workflows. However, future deployments can activate workflows by managing the setting the the Import Solution build tool

TFS 2017 - BuildAgent in Queue is ignored

I have two BuildAgents in my Default queue. Both checkboxes for active are set, both are online and running and also shown as online and running. When I start a build on that queue, it is only sent to one Agent, never to the other. If I stop this one Agent, I get the error message, that there is no agent available. But there is!
Has anyone an idea, whats going on here?
Specs: I have an On-premise-TFS-2017 (it was upgraded from 2012). Build Agents where installed the way, that TFS-2017 describes it on the interface.

You get the message means that the build agent Capabilities don't meet the conditions which set in build definition (Demands settings) or build requirements.
Please try below things to narrow down the issue:
Check the Demands in your build definition, make sure the demands
you added are existing on build agent Capabilities. If not existing
there, just try to Add Capability for the agent manually.
Somethings the agent cannot automatically identify some components as
the system Capabilities. In this case you can try comparing the
Capabilities between the two build agents to identify the
differents. Add the missed Capability for the failed agent manually.
Then try it again.
Deploy a new agent to check that.
Reference this article : TFS/VSTS Build – System Capabilities and Demands

How to version assemblies—pre-build—based on work items

I'd like to automatically increment my assembly versions based on this ruleset:
Revision is always 0
Build is incremented when the only WIT in the release is a Bug fix
Minor is incremented when the release contains any WIT other than a Bug fix; Build is then always set to 0
Major is never automatically incremented
Naturally this will require a build step that can interact in some way with the project.
My first thought was to build a small Windows Service that utilizes the TFS SDK to construct the version number based on these rules and return it via a WCF call, etc. But I run into a problem there with a business requirement that all code and functionality must be replicated into a VSTS project as well (the customer owns the code and must be able to proceed without me). There's no installing such a service there, of course.
I then considered installing the service on his server, in turn making it available to VSTS. This would pass the Rube Goldberg test with flying colors.
Is there an easier way of accomplishing this task? One that can work in both environments?
EDIT
I found this, but it's doubtful that the TFS SDK is registered in the GAC for VSTS.
Can someone confirm? Is the TFS SDK available to build scripts running on VSTS?

Well now that didn't take long.
I found this and this for using PowerShell to query the REST API. No GAC/SDK needed.
-- EDIT -----------------
I've intentionally excluded content from the pages behind these links as the solutions provided are exceedingly complex; it's not possible to cover the concepts here in a single post. In case the pages disappear or the URLs change, here are the links at archive.org:
1. PowerShell and vNext Builds
2. VSTS/TFS REST API: The basics and working with builds and releases
In any case, the concept is popular and well-covered—in the event these two become inaccessible, there are many others available on the same subject matter. As quickly as I found these, someone could find more.

How to Sandbox Ant Builds within Hudson

I am evaluating the Hudson build system for use as a centralized, "sterile" build environment for a large company with very distributed development (from both a geographical and managerial perspective). One goal is to ensure that builds are only a function of the contents of a source control tree and a build script (also part of that tree). This way, we can be certain that the code placed into a production environment actually originated from our source control system.
Hudson seems to provide an ant script with the full set of rights assigned to the user invoking the Hudson server itself. Because we want to allow individual development groups to modify their build scripts without administrator intervention, we would like a way to sandbox the build process to (1) limit the potential harm caused by an errant build script, and (2) avoid all the games one might play to insert malicious code into a build.
Here's what I think I want (at least for Ant, we aren't using Maven/Ivy right now):
The Ant build script only has access to its workspace directory
It can only read from the source tree (so that svn updates can be trusted and no other code is inserted).
It could perhaps be allowed read access to certain directories (Ant distribution, JDK, etc.) that are required for the build classpath.
I can think of three ways to implement this:
Write an ant wrapper that uses the Java security model to constrain access
Create a user for each build and assign the rights described above. Launch builds in this user space.
(Updated) Use Linux "Jails" to avoid the burden of creating a new user account for each build process. I know little about these though, but we will be running our builds on a Linux box with a recent RedHatEL distro.
Am I thinking about this problem correctly? What have other people done?
Update: This guy considered the chroot jail idea:
https://www.thebedells.org/blog/2008/02/29/l33t-iphone-c0d1ng-ski1lz
Update 2: Trust is an interesting word. Do we think that any developers might attempt anything malicious? Nope. However, I'd bet that, with 30 projects building over the course of a year with developer-updated build scripts, there will be several instances of (1) accidental clobbering of filesystem areas outside of the project workspace, and (2) build corruptions that take a lot of time to figure out. Do we trust all our developers to not mess up? Nope. I don't trust myself to that level, that's for sure.
With respect to malicious code insertion, the real goal is to be able to eliminate the possibility from consideration if someone thinks that such a thing might have happened.
Also, with controls in place, developers can modify their own build scripts and test them without fear of catastrophe. This will lead to more build "innovation" and higher levels of quality enforced by the build process (unit test execution, etc.)

This may not be something you can change, but if you can't trust the developers then you have a larger problem then what they can or can not do to your build machine.
You could go about this a different way, if you can't trust what is going to be run, you may need a dedicated person(s) to act as build master to verify not only changes to your SCM, but also execute the builds.
Then you have a clear path of responsibilty for builds to not be modified after the build and to only come from that build system.
Another option is to firewall off outbound requests from the build machine to only allow certain resources like your SCM server, and your other operational network resources like e-mail, os updates etc.
This would prevent people from making requests in Ant to off the build system for resources not in source control.
When using Hudson you can setup a Master/Slave configuration and then not allow builds to be performed on the Master. If you configure the Slaves to be in a virtual machine, that can be easily snapshotted and restored, then you don't have to worry about a person messing up the build environment. If you apply a firewall to these Slaves, then it should solve your isolation needs.

I suggest you have 1 Hudson master instance, which is an entry point for everyone to see/configure/build the projects. Then you can set up multiple Hudson slaves, which might very well be virtual machines or (not 100% sure if this is possible) simply unprivileged users on the same machine.
Once you have this set up, you can tie builds to specific nodes, which are not allowed - either by virtual machine boundaries or by Linux filesystem permissions - to modify other workspaces.

How many projects will Hudson be building? Perhaps one Hudson instance would be too big, given the security concerns you are expressing. Have you considered distributing the Hudson instances out - one per team. This avoids the permission issue entirely.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart