Docker container logging interval: logs are not being send as they are outputted by container - docker

I am building my first very simple container. The python script simply outputs the message with the number of seconds that have passed.
When I try to read the log using:
docker logs "container name"
I do get the logs, however, it seems like the logs are only sent to the log file at a certain interval. In my case, this interval is about 3.5 minutes. I want there to be a constant stream (so that when the container outputs x, it is immediately written to the log file.
Example of the log with timestamps: (only shows a small part)
...
2022-08-09T09:46:24.677360000Z Waiting for file... 208 seconds passed
2022-08-09T09:46:24.677364500Z Waiting for file... 209 seconds passed
2022-08-09T09:46:24.677369700Z Waiting for file... 210 seconds passed
2022-08-09T09:46:24.677388700Z Waiting for file... 211 seconds passed
2022-08-09T09:46:24.677395900Z Waiting for file... 212 seconds passed
2022-08-09T09:49:54.949131800Z Waiting for file... 213 seconds passed
2022-08-09T09:49:54.949169300Z Waiting for file... 214 seconds passed
2022-08-09T09:49:54.949176000Z Waiting for file... 215 seconds passed
2022-08-09T09:49:54.949180700Z Waiting for file... 216 seconds passed
...

use follow flag
--follow, -f
try it
docker container logs --follow CONTAINER-NAME
relevant link https://docs.docker.com/engine/reference/commandline/container_logs/#usage

Related

Running usd_from_gltf in an AWS Lambda

I'm trying to run Google's usd_from_gltf utility inside AWS Lambda, using a custom Docker image. The setup seems to be working locally but when executing the same Lambda in AWS, it fails for certain input files.
Minimal app
https://github.com/petrbroz/glb-to-usdz-test
This is a minimalistic AWS SAM app with a Lambda function called GlbToUsdzFunction that downloads a Glb file from specified URL and converts it to Usdz. The Lambda function uses a custom Docker image (https://github.com/leon/docker-gltf-to-udsz), and Python's subprocess to run the usd_from_gltf tool to handle the conversion.
Sample file URLs
https://petrbroz.s3.us-west-1.amazonaws.com/glb-to-usdz-issues/snowmobile.glb
https://petrbroz.s3.us-west-1.amazonaws.com/glb-to-usdz-issues/wall-e.glb
When running locally
The Lambda function succeeds for both snowmobile.glb and wall-e.glb. Here's an example output for the former:
$ sam build
$ echo "{ \"url\": \"https://petrbroz.s3.us-west-1.amazonaws.com/glb-to-usdz-issues/snowmobile.glb\" }" | sam local invoke "GlbToUsdzFunction" --event -
Reading invoke payload from stdin (you can also pass it from file with --event)
Invoking Container created from glbtousdzfunction:glb-to-usdz-lambda
Building image.................
Skip pulling image and use local one: glbtousdzfunction:rapid-1.46.0-x86_64.
START RequestId: 720b6b49-e36c-4429-96fb-9e0e5c02c09b Version: $LATEST
Downloading file
Converting file
Warning: extensionsUsed: Extension is in extensionsUsed but not actually referenced: KHR_texture_transform [GLTF_WARN_EXTENSION_UNREFERENCED]
END RequestId: 720b6b49-e36c-4429-96fb-9e0e5c02c09b
REPORT RequestId: 720b6b49-e36c-4429-96fb-9e0e5c02c09b Init Duration: 0.22 ms Duration: 19997.59 ms Billed Duration: 19998 ms Memory Size: 1024 MB Max Memory Used: 1024 MB
{"status": "success"}
When running in AWS
The Lambda function succeeds for snowmobile.glb but fails for wall-e.glb. Here's the output for the latter:
START RequestId: b1bdc496-ec12-430e-a641-2574af354d60 Version: $LATEST
Downloading file
Converting file
ERROR: USD: Insufficient permissions to write to destination directory '/var/tmp' (Replace) [UFG_ERROR_USD]
ERROR: USD: Failed to map '/var/tmp/output.usdc': No such file or directory (AddFile) [UFG_ERROR_USD]
Warning: USD: Failed to add temporary layer at '/var/tmp/output.usdc' to the package at path 'output.usdz'. (_CreateNewUsdzPackage) [UFG_WARN_USD]
ERROR: Cannot write USD: "/tmp/output.usdz" [UFG_ERROR_IO_WRITE_USD]
Command '['usd_from_gltf', '/tmp/input.glb', '/tmp/output.usdz']' returned non-zero exit status 255.
END RequestId: b1bdc496-ec12-430e-a641-2574af354d60
REPORT RequestId: b1bdc496-ec12-430e-a641-2574af354d60 Duration: 2039.96 ms Billed Duration: 5166 ms Memory Size: 1024 MB Max Memory Used: 101 MB Init Duration: 3125.71 ms
Has anyone run into this? Am I doing something wrong here, or is this perhaps a bug on the AWS side, or on the usd_from_gltf side?
Some USD conversions cause the library to write intermediate files and it looks like it is built to use /var/tmp for this purpose. Since Lambdas can only write to /tmp, the workaround we came up with is to link /var/tmp to /tmp:
in the glb-to-usdz Dockerfile, add a line like RUN rm -rf /var/tmp && ln -s /tmp /var/tmp
This allows your second example to succeed.

How to print out elapsed time using pause in ansible playbook

I'm using a pause command in my playbook that pauses for 10 minutes while EC2 instances are built in AWS. I would like to see the elapsed time during the 10 minutes instead of guessing when the pause started. I see that stdout is an output of the pause command but I can't seem to get ansible to print stdout while the pause is pausing.
- name: Wait 10 minutes for the infra machines to be built in AWS
pause:
minutes: 10
Output:
TASK [machinesets : Wait 10 minutes for the infra machines to be built in AWS] ********************************************
Pausing for 600 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
I don't think the pause module prints out the progress as it waits. You could instead add a print right before the pause. Something like:
- name: Print datetime before pause
run_once: true
debug:
msg: "{{ lookup('pipe', 'date') }}"
- name: Wait 10 minutes for the infra machines to be built in AWS
pause:
minutes: 10

docker-compose healthcheck retry frequency != interval

I recently set up healthchecks in my docker-compose config.
It is doing great and I like it. Here's a typical example:
services:
app:
healthcheck:
test: curl -sS http://127.0.0.1:4000 || exit 1
interval: 5s
timeout: 3s
retries: 3
start_period: 30s
My container is quite slow to boot, hence I set up a 30 seconds start_period.
But it doesn't really fit my expectation: I don't need check every 5 seconds, but I need to know when the container is ready for the first time as soon as possible for my orchestration, and since my start_period is approximative, if it is not ready yet at first check, I have to wait for interval before retry.
What I'd like to have is:
While container is not healthy, retry every 5 seconds
Once it is healthy, check every 1 minute
Ain't there a way to achieve this out-of-the-box with docker-compose?
I could write a custom script to achieve this, but I'd rather have a native solution if it is possible.
Unfortunately, this is not possible out of the box.
All the duration set are final. They can't be changed depending on the container state.
However, according to the documentation, the probe does not seem to wait for the start_period to finish before checking your test. The only thing it does is that any failure hapenning during start_period will not be considered as an error.
Below is the sentence that make me think that :
start_period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries. However, if a health check succeeds during the start period, the container is considered started and all consecutive failures will be counted towards the maximum number of retries.
I encourage you to test if this is really the case as I've never really paid any attention if the healthcheck is tested during the start period or not.
And if it is the case, you can probably increase your start_period if you're unsure about the duration and also increase the interval in order to find a good compromise.
I wrote a script that does this, though I'd rather find a native solution:
#!/bin/sh
HEALTHCHECK_FILE="/root/.healthchecked"
COMMAND=${*?"Usage: healthcheck_retry <COMMAND>"}
if [ -r "$HEALTHCHECK_FILE" ]; then
LAST_HEALTHCHECK=$(date -r "$HEALTHCHECK_FILE" +%s)
# FIVE_MINUTES_AGO=$(date -d 'now - 5 minutes' +%s)
FIVE_MINUTES_AGO=$(echo "$(( $(date +%s)-5*60 ))")
echo "Healthcheck file present";
# if (( $LAST_HEALTHCHECK > $FIVE_MINUTES_AGO )); then
if [ $LAST_HEALTHCHECK -gt $FIVE_MINUTES_AGO ]; then
echo "Healthcheck too recent";
exit 0;
fi
fi
if $COMMAND ; then
echo "\"$COMMAND\" succeed: updating file";
touch $HEALTHCHECK_FILE;
exit 0;
else
echo "\"$COMMAND\" failed: exiting";
exit 1;
fi
Which I use: test: /healthcheck_retry.sh curl -fsS localhost:4000/healthcheck
The pain is that I need to make sure the script is available in every container, so I have to create an extra volume for this:
image: postgres:11.6-alpine
volumes:
- ./scripts/utils/healthcheck_retry.sh:/healthcheck_retry.sh

dask jobqueue worker failure at startup 'Resource temporarily unavailable'

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...
Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.
Everything below this is just information about my configurations and such
My setup is as follows:
cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)
client = await Client(cluster, processes=False, asynchronous=True)
I suppose i'm not even sure if processes=False should be set.
I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.
Each job is not complex, it is a subprocess.call( command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell.
Imposed by the scheduler we have
>> ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 512
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 1031203
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
And each node has 64 cores. so I don't really think i'm hitting any limits.
i'm using the jobqueue.yaml file that looks like:
slurm:
name: dask-worker
cores: 1 # Total number of cores per job
memory: 2 # Total amount of memory per job
processes: 1 # Number of Python processes per job
local-directory: /scratch # Location of fast local storage like /scratch or $TMPDIR
queue: fast_q
walltime: '24:00:00'
log-directory: /home/dbun/slurm_logs
I would appreciate any advice at all! Full log is below.
FORK BLOCKING IO ERROR
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker
Aborted!
CANT START NEW THREAD ERROR
https://pastebin.com/ibYUNcqD
BLOCKING IO ERROR
https://pastebin.com/FGfxqZEk
EDIT:
Another piece of the puzzle:
It looks like dask_worker is running multiple multiprocessing.forkserver calls? does that sound reasonable?
https://pastebin.com/r2pTQUS4
This problem was caused by having ulimit -u too low.
As it turns out each worker has a few processes associated with it, and the python ones have multiple threads. In the end you end up with approximately 14 threads that contribute to your ulimit -u. Mine was set to 512, and with a 64 core system I was likely hitting ~896. It looks like the a maximum threads per a process I could have had would have been 8.
Solution:
in .zshrc (.bashrc) I added the line
ulimit -u unlimited
Haven't had any problems since.

PowerShell remote invocation mysteriously hangs

I have created a series of functions that basically collect all the IIS configurations about a site, when run on a server locally it executes without issue (albeit slowly) however when I run them remotely using an invoke-command in PowerShell 2 it runs through and mysteriously stops approximately 15-20 seconds into the process. It generally stalls on the same request but not always. The same commands executed locally work without any issues. No exception is raised, it just hangs indefinitely.
I can post the code if necessary however it is several hundred lines so I'm more looking for guidance on how to investigate a problem like this or if anyone has encountered something similar.
Comparing IISConfig between [targetserver] and localhost.
Checking Installed IIS version on [targetserver]:
IIS major version : 7
IIS minor version : 5
IIS7+ detected, using WebAdmin module and IIS metabase
Name Value
---- -----
name Default Web Site
id 1
serverAutoStart True
state 1
Site Configuration:
Name Path PSPath Handlers_Ac Access_sslF Asp_AppAllo Asp_AppAllo Asp_limits_ Asp_EnableP Asp_limits_
cessFlags lags wClientDebu wDebugging bufferingLi arentPaths queueTimeou
g mit t
---- ---- ------ ----------- ----------- ----------- ----------- ----------- ----------- -----------
Default ... IIS:Site... WebAdmin... Read,Script False False 25000000 True 00:00:00
WebApp VDir: /MyApp, App Pool: MyApp
App pool Configuration:
AppPoolID Enable32Bit managedPipe managedRunt AppPoolName AppPoolAuto processMode processMode processMode recycling_l
AppOnWin64 lineMode imeVersion Start l_idleTimeo l_identityT l_UserName ogEventOnRe
ut ype cycle
--------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
False Classic v2.0 MyApp True 00:20:00 LocalSer... Time,Req...
Analyzing web directories for /MyApp, this could take a while....
Initial Collection Completed, found 141... took 0.9516122 seconds
0 C:\inetpub\wwwroot\MyApp\Core
1 C:\inetpub\wwwroot\MyApp\Core\AdminTools
2 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Cache
3 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Extra
4 C:\inetpub\wwwroot\MyApp\Core\AdminTools\HTTPPostTest
5 C:\inetpub\wwwroot\MyApp\Core\AdminTools\IISAdmin
6 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Profiling
7 C:\inetpub\wwwroot\MyApp\Core\AdminTools\RecordTestData
8 C:\inetpub\wwwroot\MyApp\Core\AdminTools\ScrambleTest
9 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Sessions
Analyzed 10 so far... took 6.7236862 seconds, remaining time 88.08028922 seconds
Current Folder: C:\inetpub\wwwroot\MyApp\Core\AdminTools\Sessions
10 C:\inetpub\wwwroot\MyApp\Core\AdminTools\SoapTest
11 C:\inetpub\wwwroot\MyApp\Core\AdminTools\StaticContent
Sometimes it makes it to 15 or so. I tried from my laptop and from one server to another and the behavior is the same.
Here is the loop which is hanging:
$start = [System.DateTime]::Now
$numanalyzed = 0
if ($true) #skip to test
{
# loop through all physical folders as it is much faster
foreach ($folder in $folders)
{
write-host $numanalyzed $folder.fullname
#figure out the virtual path to the folder
$iis7vwebfolderpath = $folder.FullName.Replace($iis7webapp.PhysicalPath, $iis7VDirWebApppath)
#Get-item $iis7vwebfolderpath | gm
$iis7VWebDirConfigItem = Get-LNOSIIS7ConfigForPSPath -PSPath $iis7vwebfolderpath
# add new item to list
$iis7VWebDirConfig += $iis7VWebDirConfigItem
# increment counter and report out progress every 10
$numAnalyzed++
if ($numanalyzed % 10 -eq 0)
{
$end = [System.DateTime]::Now
$timeSoFar = (NEW-TIMESPAN –Start $Start –End $End).TotalSeconds
$timeremaining = ($folders.Count - $numAnalyzed) * ($timeSoFar / $numanalyzed)
"Analyzed {0} so far... took {1} seconds, remaining time {2} seconds" -f $numanalyzed,$timeSoFar,$timeremaining | write-host
"Current Folder: {0}" -f $folder.FullName | Write-Host
}
}
}
$end = [System.DateTime]::Now
"Processed web dirs: {0} took {1} seconds" -f $iis7VWebDirConfig.Count,(NEW-TIMESPAN –Start $Start –End $End).TotalSeconds | write-host | Write-Host
The function I'm having performance problems with and I've got a separate question about but this post has the source code for the function:
web-administration vs WMI to query web directory properties performance problems
In my case, it seemed my PowerShell call froze due to the Idle-Timeout expiration (the call runs for a very long time).
Setting IdleTimeout value to a sufficiently long duration fixed my issue.
Once again, query the current configuration using
winrm get winrm/config/winrs
And set the timeout using
winrm set winrm/config/winrs '#{IdleTimeout="18000000"}'
I think i may have discovered the problem, i started getting some odd failures in other parts of the script:
[SEVERNAME] Processing data from remote server SERVERNAME failed with the following error message: The WSMan provider host process did not return a proper response. A provider in the host process may have behaved improperly. For more information, see the about_Remote_Troubleshooting Help topic.
+ CategoryInfo : OpenError: (SERVERNAME:String) [], PSRemotingTransportException
+ FullyQualifiedErrorId : 1726,PSSessionStateBroken
and
Processing data for a remote command failed with the following error message: Not enough storage is available to complete this operation. For more information, see the about_Remote_Troubleshooting Help topic.
+ CategoryInfo : OperationStopped (System.Manageme...pressionSyncJob:PSInvokeExpressionSyncJob) [], PSRemotingTransportException
+ FullyQualifiedErrorId : JobFailure
This lead me to the following site: http://www.gsx.com/blog/bid/83018/Troubleshooting-unknown-PowerShell-error-messages
The following recommendations seems to have cleared up most of the problems although i still have some testing to do.
Excerpt from site below:
As the first error message specifies, an overflow of memory in the remote session has occurred. Open a PowerShell prompt on the remote server and display the configuration of winrs using:
winrm get winrm/config/winrs
Check the "MaxMemoryPerShellMB" value. It is set by default to 150 MB on Windows Server 2008 R2 and Windows 7. This is something that Microsoft changed in Windows Server 2012 and Windows 8 to 1024 MB.
In order to resolve this issue, you need to increase the value to at least 512 MB with the following command:
winrm set winrm/config/winrs `#`{MaxMemoryPerShellMB=`"512`"`}
As an FYI if Invoke-Command always hangs:
Try a simple command to system :
Invoke-Command -ComputerName XXXXX -ScriptBlock { Get-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion }
Start the Windows Remote Management Service (on that system)
Check for the listening port:
netstat -aon | findstr "5985"
TCP 0.0.0.0:5985 0.0.0.0:0 LISTENING 4
TCP [::]:5985 [::]:0 LISTENING 4

Resources