Diagnosing TFS Build Hanging after 'Copy Files to Drop Location' step - tfs

I need some advice on how to diagnose a hanging build. It’s only been happening in the last week or two and I have good reason to suspect it’s something that I’ve done recently and not just a coincidence
Setup
TFS 2013
4 machine setup - 2 app tiers (in process of deprecating one of them), 1 sql server, 1 build server running 2 agents.
Build Controller is running on 2nd app tier along with the Job Agent
1st App tier is serving the website (although that machine will soon be shutdown and everything will be passed to the 2nd app tier as the machine is getting old)
Symptoms
All executed builds (doesn’t appear to matter which build process template) never get marked as done, the last step always seems to be the same step “Copy Files to Drop Location”/“Workspace and Copy Files to Drop Location”/”Copy Binaries to drop, Reset the environment” (named differently in each build template)
The files appear to be getting dropped successfully in the build drop folder
Looking at the task manager it appears that all the build processes on the build server are exited (only TFSBuildServiceHost
Builds show their normal steps/logging while executing
Primary app tier has related warnings in the event logs (see warnings below)
Recent Changes
Installed Xamarin Android/iOS on the build server
Installed a few custom built plugins for Job Agent, Message Queue, and Web Services (been using them for years just had them disabled the last few weeks due to a app tier migration)
Installed Tiago’s Task Board Enhancer (again been using this for a long time, just had it disabled recently)
About a month ago we added the 2nd app tier and moved the sql off to another machine
What I’ve Tried
Rebooting both App tiers and build server
Uninstalling Xamarin (although I suspect some parts are still floating around as the Bonjour service appears to still be installed)
Removing the custom plugins
Turned logging diagnostics right up on one of the builds – nothing particularly of interest seems to turn up
Run the Best Practice Analyzer (nothing too unusual shows up)
Multiple build process templates (defaulttemplate, defaulttemplate.11.1, tfvctemplate.12.xaml)
Multiple build definitions
Checked the event logs of both AppTiers and Build server
The Team Foundation service host request monitor has detected the
following condition: Date (UTC): 3/02/2014 12:54:06 a.m. Machine:
CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130357641583538280
Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0,
Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service
Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process
Name: w3wp Process Id: 2508 Thread Id: 2504
Detailed Message: A request for service host XXXX has been executing
for 37 seconds, exceeding the warning threshold of 30.
Request details: Request Context Details
Url: /tfs/XXXX/XXXX/_api/_build/stop?__v=4
Method: ApiBuild.stop
Parameters: uri = vstfs:///Build/Build/34064
User Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36
Unique Id: 00000000-0000-0000-0000-000000000000
The Team Foundation service host request monitor has detected the
following condition: Date (UTC): 30/01/2014 11:10:01 p.m. Machine:
CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648
Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0,
Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service
Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process
Name: w3wp Process Id: 70320 Thread Id: 14540
Detailed Message: A request for service host XXXX has been executing
for 37 seconds, exceeding the warning threshold of 30.
Request details: Request Context Details
Url: /tfs/XXXX/Build/v4.0/BuildService.asmx
Method: StopBuilds
Parameters: uris[0] = vstfs:///Build/Build/34051 uris = Count = 1
User Agent: Team Foundation (devenv.exe, 12.0.21005.1, Premium, SKU:16)
Unique Id: 4d2d3213-fd41-4c4d-8ab0-b87619c96a42
The Team Foundation service host request monitor has detected the
following condition: Date (UTC): 31/01/2014 3:14:17 a.m. Machine:
CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648
Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0,
Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service
Host: Process Details: Process Name: w3wp Process Id: 70320
Thread Id: 14540
Detailed Message: There are no active requests for service host XXXX
that exceed the warning threshold of 30.
A quick google suggests upping the timeout in the tfs registry (http://xavierdilipkumar.com/post/2013/07/04/TFS-event-7005-and-7006-warning.aspx) I've tried that and it doesn't appear to change anything.

can you look in the tfs bs logs at
Event Viewer -> Applications and Services Logs -> Microsoft -> Team Foundation Server -> Build-Services -> Operational
these timeouts generally relate to permissions. you should look for TF215106 access denied events. Although the files appear to be there, are they all the current date or are there some with different (older) dates? Also are they any alerts/steps happening when the file drop occurs?
Other than that it could be timing out because one of the dependencies is being used by another service.

You might fire up Sysinternals Process Monitor to see when the processes actually exit and what they were doing (Process Monitor monitors "real-time file system, Registry and process/thread activity").

The best course of action is to call Microsoft Support and open a Service Request. Make sure it gets priority A - your TFS production environment is not working - and be prepared to give them support and access.
The only hint from the log is that call to ApiBuild.stop. It suggests that the build workflow completed, so the code hosting it is calling back to the AT to mark the build completed. As you have no warnings from previous calls, it could be some problem at the database level. You may try activating SQL Tracing but it's not a trivial task, as you should be able to compare the trace with a working one.
Good luck

I'm reluctant to mark this as an answer because I'm not entirely sure why it worked.
Suspecting something was wrong with the build machine I created a new Build Agent on a fresh install - the hanging issue still occurred.
I then added a Build Controller to that machine and noticed that new builds using that controller would complete. This suggested that there was a communication issue between either the BA and the BC, or the BA and the primary AT.
Given that our primary AT had other issues we decided to remove it from the picture, we switched the DNS to point at the second AT and disabled all services on the old primary. Instantly builds started to complete (including the ones that had been stuck for a number of days).
I still don't know which component was broken or why, especially since it worked fine in this configuration for a month prior. I can only assume there was either another change that I am not aware of, or the corruption of the primary AT was causing bigger issues.

We were having the same problem here, the builds were kept open even after successfully passing all workflow stages.
I logged into the build machine and noticed the build controller was "running 6 builds" for some reason, even though there were no builds at all showing in the queue in Visual Studio.
After restarting the controller, the next build worked the first time.
Just wanted to let this one here as a possible answer. I'm not sure yet why the controller had those stuck builds though.

I had this issue when an activity tried to log a huge message in the build log (namely the FxCopCmd activity from the CodePlex TFS Build Extensions project).
The build agent would successfully finish the build but the controller had to chew the huge message into the build log, and it was silently crashing/hanging.
I was able to track the issue down by navigating to C:\Users\[TfsServiceAccount]\AppData\Local\Temp\BuildAgent\[AgentNumber]\Logs\[BuildNumber]\ActivityLog.xml.
The last build message was truncated and by looking at the content, I recognized the FxCop output. In my case, I just set the LogToConsole parameter to False for the FxCop activity in the build process template, and the build completed successfully.

Also appears to happen if the build agent cannot connect to the build controller server on port 9191.
Easily testable with a telnet client.
Appears that my server decided it was on an unknown network and kicked the firewall into overdrive. (The second time I got this issue, not sure if this was the reason I got it the first time but it seems reasonable).

Related

How to set up SSL certificates for containerized EventHubs message processors?

I've been writing an EventHubs message processor that just connects to EventHubs and processes messages on the EventHub. I've been developing in Visual Studio on Windows using .NET 6. Things work as expected on Windows; I can:
Connect to EventHubs
Receive messages
Do the message processing I want
Great. I then wanted to scale my message processor horizontally and decided that I would Dockerize it, and since .NET 6 runs on Linux, I would cross-compile it for Linux and eventually deploy multiple instances of my message processor on Docker Desktop as a next step. I eventually want to stick it on Kubernetes to scale up by an order of magnitude or two.
It was easy to Dockerize my Project in Visual Studio. I simply right-clicked the Project and selected Add -> Docker Support. Visual Studio detected I had Docker Desktop installed and generated all the config files I needed, and added an appropriate build configuration so that I could compile a binary, build a Docker image with it, and automatically deploy it to my local Docker Desktop instance.
.NET 6 also compiled without errors, which was great. However, when my container spins up, I get hit with the following runtime error:
System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain: PartialChain
and there is a stack trace (omitted here for brevity) stemming from something in the EventHubs processor library:
<...many layers...> at Azure.Messaging.EventHubs.Primitives.EventProcessor-1.RunProcessingAsync(CancellationToken cancellationToken)
I am correctly passing my EventHubs connection string to my container, but what I surmise is that my container is missing an SSL certificate or has a misconfigured SSL certificate. I suppose Visual Studio has helpfully silently gone ahead and installed a development certificate when I developed my message processor on Windows so that EventHubs connections "just work" in my development environment, but that SSL certificate is not available to my container, since it isn't part of the build output.
I know I probably should be using Azure key vault or whatever secret management service they provide, but how else can I resolve this SSL certificate issue as quickly or painlessly as possible? It would be nice if I can just keep my connection string in my appsettings.json (It's fine. Toy project, only using Azure free credits anyway.)
The easiest way forward would be to register a handler that participates in certificate validation and can, if desired, override normal handling and force acceptance. This, of course, comes with the warning that you're bypassing standard security checks and may be putting your network and host in danger.
You don't mention which client you're using, but each takes a set of options in their constructor. The options for each type have a member named ConnectionOptions which returns an EventHubsConnectionOptions instance that allows you to register a CertificateValidationCallback.
The Event Hubs Influencing SSL certificate validation sample demonstrates how to use it. More information is also available in the .NET documentation for RemoteCertificateValidationCallback.

Force TFS to use CredSSP when accessing network resources

I am using TFS 2017 Update 2 Release processing. I have a functioning deploy process that works within a domain (it runs successfully against 10 different deployment environments)... and now I need to deploy into a different environment, which lives in a different A/D domain.
Unfortunately, the domain trust is one way between the domains - and the destination domain ("Production") does not trust the domain I am installing from ("Dev")
The problem I'm seeing seems to be the infamous "double hop" credential problem.
My TFS app tier can see (and trigger activity on) the release server running TFS vNext Agent 2.117.2 Futher, I can execute inline PowerShell, and locally hosted PowerShell scripts on the release server just fine.
Howerver, as soon as I try to access a PowerShell script not on the release server (be it in the Production domain with the release server, or in the Dev domain) I get an error:
2018-02-13T19:03:32.6611149Z ##[error]. : AuthorizationManager check failed.
At line:1 char:3
+ . '\\unc\path\to\share\TFSScripts\Emit-Variables2. ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : SecurityError: (:) [], PSSecurityException
+ FullyQualifiedErrorId : UnauthorizedAccess
The account running the TFS release service has been confirmed to have access to the script file when running from the desktop of the release server, so access should not be an issue.
Further testing of the issue has identified that if we manually create a PSSession using -Authorization CredSSP and pass a credentials object we can successfully access the off server resources.
However, I can see no way to configure TFS to use CredSSP as the authorization mechanism.
The servers involved are W2K8R2 - so we cant use the constrained delegation functionality that W2K12 introduced. We have also tried SPNs with similar unsuccessful results. Kerberos has been forced to use TCP by setting the max packet size to 0 (thus also preventing fragmented UDP packets and related problems). Our max Kerberos packet size is set to 48000.
In the ultimate end state, The TFS App server, and all the TFS artifacts and release scripts will sit in the "dev" domain on one side of a firewall... and the production release server, and a set of servers to release to will exist in the "production" domain, on the other side of a firewall
CredSSP seems to be the only way to make this work - but I see no way for TFS to be configured for it.
This can't be a unique problem. Can someone provide some insight on how to get around this?
Sorry it's not able to force TFS to use CredSSP when accessing network resources. And on configuration of TFS to use CredSSP as the authorization mechanism
You must manually enable CredSSP in powershell.
Another way take a look at this solution, which may do the trick: TFS2015 Release Management: Deploying to an untrusted domain by having the deployment agent run under a shadow account.

Windows Service Install Ends in Rollback

When I try to install a Windows service:
c:\Windows\Microsoft.NET\Framework64\v4.0.30319\installutil
I get, what looks to be, some success messages and some failure messages. Part way down:
An exception occurred during the Install phase.
System.ComponentModel.Win32Exception: The specified service has been marked for deletion
At the end:
The Rollback phase completed successfully.
The transacted install has completed.
The installation failed, and the rollback has been performed.
The service is given an entry in the Services applet, but it is marked as "Disabled". When I attempt to change it to another state, I get a "marked for deletion" error message.
There are no messages in the Event Log. There is nothing useful in the log file created by installutil.exe (I believe it's written to the current working directory).
I have no direction to go with this. What do I do?
It turns out that the install might, or probably will, fail if that service is highlighted in the Services applet. It's safest to just close the Services applet, install the service, and then re-open the Services applet. It's really stupid.
Also, make sure to run the console as admin.
I experienced the same and the issue for me was that a service with the same name was already installed. So in order to install the new service I had to uninstall the older services. I am learning how to create and setup windows services and thus the naming conflicting. Tried uninstalling the service first through:
c:\Windows\Microsoft.NET\Framework64\v4.0.30319\installutil -u servicename.exe
Once this statement executes successfully, install your service and it should succeed without any rollbacks.
Right Click on Command Prompt and choose RUN AS ADMINISTRATOR
Then copy and paste in: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\InstallUtil.exe C:\TestService\bin\Debug\TestService.exe
Result in TestService.InstallLog is:
Installing service TestService...
Service TestService has been successfully installed.
Some times this happens due to permission issues.
Run the "Developer Command Prompt for VS 2012" as Administrator.
Then it will work.
Adding few more check's and points to solve this above issue.
Build service in release mode and take release folder files and kept in different path
Copy that path and go to visual studio command prompt window and run this bellow sample command to install the service.
Please close services.msc window if its opened , then run C:Program Files (x86)\Microsoft Visual Studio 11.0>InstallUtil.exe C:\RunLocationServices\TestService.exe
Go services.msc and select that service and click on start ,if it changed to "started" then your service running fine.
Still if issue exists then
Another Checkpoint & SOLUTION
When a service starts, the service communicates to the Service Control Manager how long the service must have to start (the time-out period for the service).
If the Service Control Manager does not receive a "service started" notice from the service within this time-out period,
the Service Control Manager terminates the process that hosts the service.
This time-out period is typically less than 30 seconds.
If you do not adjust this time-out period, the Service Control Manager ends the process.
To adjust this time-out period, follow these steps:
1.Go to Start > Run > and type regedit
2.Navigate to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control
With the control folder selected, right click in the pane on the right and select new DWORD Value
3.Name the new DWORD: ServicesPipeTimeout
4.Right-click ServicesPipeTimeout, and then click Modify
Click Decimal, type '180000', and then click OK
5.Restart the computer
Still if issue exists then problem in your service code ,infinate loop may occur due to your methods/classes of service calling. Do code review of each line.
This problem is due to security, you'd better open developer command prompt for VS 2012:
RUN AS ADMINISTRATOR
and install your service. It will surely fix your problem.
I tried and the issue was resolved.

MSDeploy WMSVC is not working in .net environment

I have a build/test server which is currently running Jenkins for my continuous integration and it also is acting as my test server where code will be deployed to once built (i hope to rectify this and seperate these at a later date when budget allows)
I have a .NET web solution (nothing complex just Umbraco essentially) that i have in SVN and Jenkins is now building correctly. I now want to deploy it onto the same server using MSDeploy. After the build completes the package is generated but the deploy fails with the error
ERROR_DESTINATION_NOT_REACHABLE: Web deployment task failed. (Could not connect to the remote computer ("xxxxx.xxxxxxx.xxx.xxxx"). On the remote computer, make sure that Web Deploy is installed and that the required process ("Web Management Service") is started
Here is my msbuild parameters that Jenkins uses
/P:Configuration=Release
/P:DeployOnBuild=True
/P:MSDeployPublishMethod=WMSVC
/P:DeployTarget=MSDeployPublish
/P:PublishProfile=GetSomePixels
/P:MsDeployServiceUrl=https://build.########
/P:AllowUntrustedCertificate=True
/P:CreatePackageOnPublish=True
/P:UserName=#######
/P:Password=########
I've checked the server and the Web Management Service is running and is starting up manually
I've also gone into IIS 8 manager (server 2012) and checked the "Allow Remote Connections" box under "Management Service". Restarted IIS and the WMSVC and still not working.
If i go to https://myserver.co.uk:8172/MsDeploy.axd in a browser it resolves (gives you the warning about an untrusted cert) and then displays a blank page.
Anyone got any ideas as to what i can do? I thought that it may be firewall related and even though it had added an exception to windows firewall for 8172 i have turned the entire firewall off to completely rulle that out and still no luck.
Have run this on the server to check its listening on the correct port
C:\Users\Administrator>netstat -a | findstr 8172
TCP 0.0.0.0:8172 GSP-BUILD:0 LISTENING
TCP [::]:8172 GSP-BUILD:0 LISTENING
Ok i've resolved this. It appears you have to activate the web management service first and then install web deploy and i'd done it the other way round. I uninstalled WebDeploy and re-installed it, restarted the server and its working
Agree with comment.
We had a similar issue. Initial installation even post Web Management Service activation appeared to be incomplete. In our case, even though the service said it was started we couldn't achieve the "green tick" when testing the connection from the Publish dialog when defining a profile.
Reinstalling WebDeploy 3.6 made it function properly.

TFS 2012 Team Build and Web Application Deployment - ERROR_USER_NOT_ADMIN

We have a solution consisting of several class libraries, and a Web
Application Project. We are using TFS 2012 with Team Build. The solution
compiles correctly on the build server.
I am currently trying to do this via MSBuild Arguments.
/p:DeployOnBuild=True /p:DeployTarget=MsDeployPublish
/p:CreatePackageOnPublish=False /p:MSDeployPublishMethod=RemoteAgent
/p:MsDeployServiceUrl=https://testWebServer:8172/MsDeploy.axd?site=direct /p:AllowUntrustedCertificate=True
/p:DeployIisAppPath="direct"
/p:AuthType=NTLM
The solution builds but does not deploy. I get the following error message:
msdeploy error ERROR_DESTINATION_INVALID: Web deployment task failed.
( Could not connect to the remote computer ("https"). Make sure that
the remote computer name is correct and that you are able to connect
to that computer. Learn more at:
http://go.microsoft.com/fwlink/?LinkId=221672#ERROR_DESTINATION_INVALID.)
[C:\Builds\1\ProjectName\Solution General Build\Sources\Temp
Source\ProjectName\Solution\Project.csproj]
Is there another argument I should be passing to specify the server? I did
not intend for https to be the server name... I have tried omitting the
https:// to no avail, error is the same, so it is getting the value from
somewhere.
I have tried this with the following values for MsDeployServiceUrl:
https://testWebServer:8172/MsDeploy.axd?site=direct
https://testWebServer:8172/MsDeploy.axd
"https://testWebServer:8172/MsDeploy.axd?site=direct"
https://192.168.X.X:8172/MsDeploy.axd?site=direct
"https://192.168.X.X:8172/MsDeploy.axd?site=direct"
testWebServer:8172/MsDeploy.axd?site=direct
Update
Alright, the following is at least connecting:
/p:MsDeployServiceUrl=testWebServer
I have seen numerous posts concerning that particular argument, and almost invariably they are a URL, not just a hostname (the ones that appear to be a hostname I thought were just written that way for brevity).
I am now, however, faced with a new problem. I have made the Build Service Account (domain account) local admin on the webserver, and I am getting msdeploy error ERROR_USER_NOT_ADMIN as well as an Audit failure in the Security log.
Resolution
These are the MSBuild arguments I am currently going with.
/p:DeployOnBuild=True /p:DeployTarget=MsDeployPublish /p:CreatePackageOnPublish=False /p:MSDeployPublishMethod=WMSvc /p:MsDeployServiceUrl="https://SERVER:8172/MsDeploy.axd" /p:AllowUntrustedCertificate=True /p:DeployIisAppPath="siteName"
I am now getting ERROR_USER_UNAUTHORIZED. Apparently I have either not set up the delegation correctly or the IIS Manager User I have created is somehow incorrect. Regardless that will go in a different post if necessary.
What is the Server and IIS version, you are using?
IIS 6 uses Web Deployment Agent Service (MsDepSvc), whereas IIS 7 usually uses Web Management Service (WMSvc) which have different URLs (besides, you have to be an admin on the target server to execute MsDepSvc.
Can you try specifying
/P:MSDeployPublishMethod=WMSvc
Based on this article from Troy Hunt, Web Management Service (WMSvc) is using
.axd
URLs (the one you specify), whereas you are trying to force it use RemoteAgent publish method which seems to be inconsistent.
See this article for complete set of differencies between WMSvc and RemoteAgent publish methods.
I had a similar issue. To resolve the issue I tried the following steps:
As it was a hosted server we had to make sure that the port 8172 was open (obviously).
Creating a new login and set this up in IIS -> Deploy -> Configure -> Configure Web Deploy Publishing on the target server. I made sure that the password didn't have any spaces in to avoid the quotes issue just to be sure.
Actually running a manual deployment from the build server.
Finally specifying an IP address in the MSDeployServicerl:
/p:MsDeployServiceUrl=xxx.xxx.xxx.xxx:8172/msdeploy.axd
None of the web site names worked for me either. None of my parameters had quotes in. Of course if you leave a space in incorrectly in one of your parameters you will get the error:
MSBUILD : error MSB1008: Only one project can be specified

Resources