Jenkins performance issue - jenkins

I have Jenkins version 1.6.11 installed onto a windows server, the number of jobs configured are huge and the load is distributed among multiple Master & slave's. There are couple of issues occurring very frequently,
The whole Jenkins UI becomes so slow, either Jenkins server needs to restarted or the whole server needs to restarted to bring it back to normal.
Certain jobs are taking way too much time to load. To fix this, that particular job has to be abandoned and new ones has to be created for the same.
It would be really helpful if you could provide possible solutions for the two issues.

Use this syntax /node_modules/ to remove node modules folder, but before you do this, you should exclude .git folder using /.git/

Related

Jenkins jobs disappeared from GUI, but are still present on the disk

I had 2 folders containing several jobs and a lot of others jobs placed directly in the Jenkins jobs root.
I was the only one using the 2 folders, the rest of my colleagues only used the jobs in the root.
I haven't accessed Jenkins for 2 weeks. Today, when I logged in, the 2 folders containing my jobs were gone. The root jobs are still present.
On the disk the jobs are still present. I can find the configuration, the builds etc.
What could be the cause? My colleagues say they haven't touched my folders. No related errors in the jenkins log. All my plugins are still enabled. Indeed, many have updates available, but I guess it shouldn't matter. Also, if any problems were related to one of the jobs, why is the whole folders not showing up? I would expect only the jobs inside folders not showing up.
Any ideas how to approach this and recover my jobs?
Did someone uninstall the folders plug-in?
Have you tried reloading configuration from disk?

TFS Build Agent - Waiting for an agent to be requested

I am in the process of testing a TFS 2013 to TFS 2018 onprem upgrade. I have installed 2018.1 on a new system (and upgraded a copy of my TFS databases). I have installed a build agent on a new host which shows up under Agent Queues (as online and enabled).
I'm now trying to create a build. I set things up as I feel they should be and it sits at this screen:
Build
Waiting for an available agent
Console
Waiting for an agent to be requested
The VSTS Agent service is running on the build agent system. so I feel that is OK. I'm somewhat at a loss. Any assistance is appreciated.
Just try the below items to narrow down the issue:
Check the build definition requirements (Demands section) and the agent offering. Make sure it has the required capabilities installed on the agent machine.
When a build is queued, the system sends the job only to agents that have the capabilities demanded by the build definition.
Check if the service "Visual Studio Team Foundation Background Job Agent" is running on the TFS application tier server.
If it's not started, just start the service.
If the status is Running, just try to Restart the service.
Make sure the account that the agent is run under is in the "Agent Pool Service Account" role.
Try to change a domain account which is a member of the Build Agent Service Accounts group and belongs to "Agent Pool Service Account" role, to see whether the agent would work or not.
We have just spent five days trying to diagnose this issue and believe we have finally nailed the cause (and the solution!).
TL;DR version:
We're using TFS 2017 Update 3, YMMV. We believe the problem is a result of a badly configured old version of an Elastic Search component which is used by the Code Search extension. If you do not use the Code Search feature please disable or uninstall this extension and report back - we have seen huge improvements as a result.
Detailed explanation:
So what we discovered was that MS have repurposed an Elastic Search component to provide the code search facility within TFS - the service is installed when TFS is installed if you choose to include the search feature.
For those unfamiliar with Elastic, one particularly important aspect is that it uses a multi-node architecture, shifting load between nodes and balancing the workload across the cluster and herein lies the MS Code Search problem.
The Elastic Search component installed in TFS is (badly) configured to be single node, with a variety of features intentionally suppressed or disabled. With the high water-mark setting set to 85%, as soon as the search data reaches 85% of the available disk space on the data drive, the node stops creating new indexes and will only accept data to existing indexes.
In a normal Elastic cluster, this would cause another node to create a new index to accept the new data but, since MS have lobotomised the cluster down to one node, the fall-back... is the same node - rinse and repeat.
The behaviour we saw, looking at the communications between the build agent and the build controller, suggests that the Build Controller tries to communicate with Elastic and eventually fails. Over time, Elastic becomes more unresponsive and chokes this communication which manifests as the controller taking longer and longer to respond to build requests.
It is only because we actually use Elastic Search that we were able to interpret the behaviour and logs to come to this conclusion. Without that knowledge it would be almost impossible to determine the actual cause.
How to fix this?
There are a number of ways that you can fix this:
Don't install the TFS Search feature
If you don't want to use the Code Search feature, don't install it. The problem will not occur.
Remove the TFS Search feature [what we did]
If you don't use the Code Search feature, uninstall it. The problem will go away - you can either just disable the extension in all collections or you can use the server installer to fully remove it. I have detailed instructions from MS for anyone who wants to eradicate it completely, just ask.
Point the Search feature to a properly configured, real Elastic cluster
If you use Elastic properly, rather than stuffing it in a small box on its own, the problem will not occur.
Ensure the search data disk never hits the 85% water-mark
Elastic will continue to function "properly" and should return search results as expected, within the limited parameters.
Hope this helps someone out there avoid the pain we suffered.
The TF Background Job Agent wasn't running on the application tier, because that account didn't have 'log on as a service'.
I was facing the same issue and in my case, it was resolved by restarting the TFS server (TFS is hosted locally in our office).

VSTS agent very slow to download artifacts from local network share

I'm running an on-prem TFS instance with two agents. Agent 1 has a local path where we store our artifacts. Agent 2 has to access that path over a network path (\agent1\artifacts...).
Downloading the artifacts from agent 1 takes 20-30 seconds. Downloading the artifacts from agent 2 takes 4-5 minutes. If from agent 2 I copy the files using explorer, it takes about 20-30 seconds.
I've tried adding other agents on other machines. All of them perform equally poorly when downloading the artifacts but quick when copying manually.
Anyone else experience this or offer some ideas of what might work to fix this?
Yes It's definitely the v2 that's causing the problem.
Our download artifacts step has gone from 2mins to 36mins. Which is completely unacceptable. Im going to try out agent v2.120.2 to see if that's any better...
Agent v2.120.2
I think it's because of the amount of files in our artifacts, we have 3.71GB across 12,042 files in 2,604 Folders!
The other option I will look into it zipping or creating a nuget package for each public artifact and then after the drop, unzipping! Not the ideal solution but something I've done before when needing to use RoboCopy which is apparently what this version of the Agent uses.
RoboCopy is not great at handling lots of small files, and having to create a handle for each file across the network adds a lot of overhead!
Edit:
The change to the newest version made no difference. We've decided to go a different route and use an Artifact type of "Server" rather than "File Share" which has sped it up from 26 minutes to 4.5 minutes.
I've found the source of my problem and it seems to be the v2 agent.
Going off of Marina's comment I tried to install a 2nd agent on 01 and it had the exact same behavior as 02. I tried to figure out what was different and then I noticed 01's agent version is 1.105.7 and the new test instance is 2.105.7. I took a stab in the dark and installed the 1.105.7 on my second server and they now have comparable artifact download times.
I appreciate your help.

proper preparation before upgrading jenkins

I'm going to upgrade our jenkins-ci to the latest version. I'm following this wiki page (Going for the upgrade button in the "Manage jenkins page"): How to upgrade jenkins
My question is this, we have a lot of jobs that constantly run (some timed jobs, some triggered jobs). When upgrading, should (or even need) I disable all jobs before hand? If there are jobs currently running, should (or even need) i terminate them?
It depends a lot how you deployed you CI. If you installed by default (no custom settings i assume you can follow the auto procedure that you already provided in link).
When upgrading, should (or even need) I disable all jobs before hand?
When upgrading you should put your Jenkins instance in the quiet mode Configure > Manage > Quiet down, this will prevent further builds to be executed, also it will let all running builds to finish, i hope this answer to your both questions.
Speaking more about jobs, you should make a backup first in case something goes wrong.
Also you should think a lot about plugins and review them all since some of them might not work as you expect since you are upgrading to the new fresh Jenkins core. There is one plugin called plugin usage which might help you to understand your current status

Best way to manage dependent ant builds over multiple servers?

I have these ant scripts that build and deploy my appservers. My system though is actually over 3 servers. They all use the same deploy script (w/flags) and all work fine.
Problem is there are some dependencies. They all use the same database so I need a way to stop all appservers across all machines before the build first happens on machine 1. Then I need the deployment on machine 1 to go and complete first as it's the one that deals with the database build (which all appservers need to start).
I've had a search around and there are some tools that might be useful but they all seem overkill for what I need.
What do you think would be the best tool to sync and manage the ant builds over multiple machines (all running linux)?
Thanks,
Ryuzaki
You could make your database changes non-breaking, run your database change scripts first and then deploy to your appservers. This way your code changes aren't intrinsically tied to your database changes and both can happen independently.
When I say non-breaking I mean that database changes are written in such a way that 2 different version of code can function against the same database. For example rather than renaming columns, you add a new one instead.

Resources