Google Compute Engine VM stops after some time - docker

We're encountering a very strange problem at my company. Since a month ago, we used to publish images with Docker on Container Registry, then deploy them to Compute Engine (which creates an instance of a VM) and it worked fine.
Since 2 weeks ago, when we depoy an image from Container Registry to Compute Engine, the VM starts and works fine for some time, but after some hours it stops forever. We currently have a paying plan, so I guess this is not a payment issue.
Has anyone encountered this problem before? Is it a firewall issue? All the logs from the VM seem fine.
Here are the last logs of the VM:
[ 7214.995439] google_accounts_daemon[465]: Adding user toto to group google-sudoers
[ 7449.139530] google_accounts_daemon[465]: Removing user toto from group google-sudoers
[226991.170566] EXT4-fs (sda5): mounting ext2 file system using the ext4 subsystem
[226991.206891] EXT4-fs (sda5): mounted filesystem without journal. Opts:
[227024.855486] LoadPin: kernel-module pinning-excluded obj="/lib/modules/5.4.49+/kernel/fs/fat/fat.ko" pid=40981 cmdline="/sbin/modprobe -q -- fs-vfat"
[227024.880466] LoadPin: kernel-module pinning-excluded obj="/lib/modules/5.4.49+/kernel/fs/fat/vfat.ko" pid=40981 cmdline="/sbin/modprobe -q -- fs-vfat"
[227024.899845] LoadPin: kernel-module pinning-excluded obj="/lib/modules/5.4.49+/kernel/fs/nls/nls_cp437.ko" pid=40988 cmdline="/sbin/modprobe -q -- nls_cp437"
[227024.917675] LoadPin: kernel-module pinning-excluded obj="/lib/modules/5.4.49+/kernel/fs/nls/nls_iso8859-1.ko" pid=40990 cmdline="/sbin/modprobe -q -- nls_iso8859-1"

According to the GCP docs link, you can set the option of automatically restarting the vm to true. So it will restart the instance after failing.
To enable:
$ gcloud compute instances set-scheduling [instance-name] --zone [instance-zone] --restart-on-failure

Related

measure and possibly speed up VS Code extension installation in Docker container

I'm using VS Code with DevContainer extension to run inside a Docker container.
It works great, but every time either VS Code is updated or the Dockerfile and I have to rebuild the container it takes few minutes to install the extensions I need inside the container.
[218513 ms] Start: Run in container: cd /root/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f; export VSCODE_AGENT_FOLDER=/root/.vscode-server; /root/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/server.sh --disable-telemetry --extensions-download-dir /root/.vscode-server/extensionsCache --install-extension ms-python.python --install-extension ms-python.vscode-pylance --force
[537378 ms] Installing extensions...
Installing extension 'ms-python.python' v2020.12.424452561...
Installing extension 'ms-python.vscode-pylance' v2020.12.2...
Extension 'ms-python.vscode-pylance' v2020.12.2 was successfully installed.
Extension 'ms-python.python' v2020.12.424452561 was successfully installed.
[537379 ms]
[537379 ms] Start: Run in container: ls /root/.vscode-server/extensionsCache || true
[537387 ms] ms-python.python-2020.12.424452561
ms-python.vscode-pylance-2020.12.2
ms-toolsai.jupyter-2020.12.414227025
I have 2 questions about this:
Is it possible to measure what is taking the time? is it download or install (or both) that takes that long?
If it is download that takes most of the time, is there a way to cache the extensions?
There are multiple solution to speed up container initialization:
One way is to use docker volumes and mount that under $HOME/.vscode-server. In that case VS Code will use the instance already installed.
The other way is to mount a local folder into the dev-container as $HOME folder. It may slow down the overall performance of the container but we will also maintain permanent session eg. bash history, azure session, etc.
The second solution has currently some issue around extension installation (see: issue related to .installExtensionsMarker file) so for the moment I would recommend using docker volumes.
For more detail on how to configure the volumes see the following section of the Advanced Container Configuration document:
Avoiding extension reinstalls on container rebuild
I would also recommend to use some final image that is already have been built and which is pushed to container registry to avoid installation of any python or other packages during container rebuild.

Docker Toolbox Error: Looks like something went wrong in step 'Setting env'

I had Docker Toolbox installed on my Windows 7 PC and I wanted to upgrade my Docker installation to the most recent version. To do that, I decided to delete Docker Toolbox from my system and reinstall it. I uninstalled Docker Toolbox, uninstalled VirtualBox, and removed all instances of both in my files (such as files in AppData). After reinstalling Docker Toolbox and launching the Quickstart Terminal, I ran into the following error:
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this vi
rtual machine, run: C:\Program Files\Docker Toolbox\docker-machine.exe env defau
lt
Looks like something went wrong in step 'Setting env'... Press any key to contin
ue...
So it seems like it failed when "setting env". I'm not sure what that means in this context and I wish there was a way to check some extended logs to get more detail. I tried following the Docker documentation pointing the location of daemon logs in AppData, however, I could not find anything relevant. Something I did find was a file called "no-error-report", though it was empty.
I tried uninstalling everything again and reinstalling with the attribute NDIS5 network type checked, I've ran the Quickstart Terminal as admin, and I still ran into the same exact error.
Any suggestions on how I may approach this issue?
I got this same issue.
I fixed this by doing the below procedures.
I changed the below lines in start.sh
STEP="Setting env"
eval "$("${DOCKER_MACHINE}" env --shell=bash --no-proxy "${VM}" | sed -e "s/export/SETX/g" | sed -e "s/=/ /g")" &> /dev/null #for persistent Environment Variables, available in next sessions
eval "$("${DOCKER_MACHINE}" env --shell=bash --no-proxy "${VM}")" #for transient Environment Variables, available in current session
Changed --no-proxy to --http_proxy since I am using http proxy

How to debug an Elixir application in production?

This is not particularly about my current problem, but more like in general. Sometimes I have a problem that only happens in production configuration, and I'd like to debug it there. What is the best way to approach that in Elixir? Production runs without a graphical environment (docker).
In dev I can use IEX.pry, but since mix is unavailable in production, that does not seem to be an option.
For Erlang https://stackoverflow.com/a/21413344/1561489 mentions dbg and redbug, but even if they can be used, I would need help on applying them to Elixir code.
First, start a local node running iex on your dev machine using iex -S mix. If you don't want the application that's running locally to cause breakpoints to be activated, you need to disable the app from starting locally. To do this, you can simply comment out the application function in mix.exs or run iex -S mix run --no-start.
Next, you need to connect to the remote node running on docker from iex on your dev node using Node.connect(:"remote#hostname"). In order to do this, you have to make sure both the epmd and the node ports on the remote machine are reachable from your local node.
Finally, once your nodes are connected, from the local iex, run :debugger.start() which opens the debugger with the GUI. Now in the local iex, run :int.ni(<Module you want to debug>) and it will make the module visible to the debugger and you can go ahead and add breakpoints and start debugging.
You can find a tutorial with steps and screenshots here.
In the case that you are running your production on AWS, then you should first and foremost leverage CloudWatch to your advantage.
In your elixir code, configure your logger like this:
config :logger,
handle_otp_reports: true,
handle_sasl_reports: true,
metadata: [:application, :module, :function, :file, :line]
config :logger,
backends: [
{LoggerFileBackend, :shared_error}
]
config :logger, :shared_error,
path: "#{logging_dir}/verbose-error.log",
level: :error
Inside your Dockerfile, configure an environment variable for where exactly erl_crash.dump gets written to, such as:
ERL_CRASH_DUMP=/opt/log/erl_crash.dump
Then configure awslogs inside a .config file under .ebextensions as follows:
files:
"/etc/awslogs/config/stdout.conf":
mode: "000755"
owner: root
group: root
content: |
[erl_crash.dump]
log_group_name=/aws/elasticbeanstalk/your_app/erl_crash.dump
log_stream_name={instance_id}
file=/var/log/erl_crash.dump
[verbose-error.log]
log_group_name=/aws/elasticbeanstalk/your_app/verbose-error.log
log_stream_name={instance_id}
file=/var/log/verbose-error.log
And ensure that you set a volume to your docker under Dockerrun.aws.json
"Logging": "/var/log",
"Volumes": [
{
"HostDirectory": "/var/log",
"ContainerDirectory": "/opt/log"
}
],
After that, you can inspect your error messages under CloudWatch.
Now, if you are using ElasticBeanstalk(which my example above implicitly implies) with Docker deployment as opposed to AWS ECS, then the logs of std_input are redirected by default to /var/log/eb-docker/containers/eb-current-app/stdouterr.log inside CloudWatch.
The main purpose of erl_crash.dump is to at least know when your application crashed, thereby taking the container down. AWS EB will normally restart the container, thus keeping you ignorant about the restart. This understanding can also be obtained from other docker related logs, and you can configure alarms to listen for them and be notified accordingly when your docker had to restart. But another advantage of logging erl_crash.dump to CloudWatch is that if need be, you can always export it later to S3, download the file and import it inside :observer to do analysis of what went wrong.
If after consulting the logs, you still require a more intimate interaction with your production application, then you need to leverage remsh to your node. If you use distillery, you would configure the cookie and the node name of your production application with your release like this:
inside rel/confix.exs, set cookie:
environment :prod do
set include_erts: false
set include_src: false
set cookie: :"my_cookie"
end
and under rel/templates/vm.args.eex you set variables:
-name <%= node_name %>
-setcookie <%= release.profile.cookie %>
and inside rel/config.exs, you set release like this:
release :my_app do
set version: "0.1.0"
set overlays: [
{:template, "rel/templates/vm.args.eex", "releases/<%= release_version %>/vm.args"}
]
set overlay_vars: [
node_name: "p#127.0.0.1",
]
Then you can directly connect to your production node running inside docker by first ssh-ing inside the EC2-instance that houses the docker container, and run the following:
CONTAINER_ID=$(sudo docker ps --format '{{.ID}}')
sudo docker exec -it $CONTAINER_ID bash -c "iex --name q#127.0.0.1 --cookie my_cookie"
Once inside, you can then try to poke around or if need be, at your own peril inject modified code dynamically of the module you would like to inspect. An easy way to do that would be to create a file inside the container and to invoke a Node.spawn_link target_node, fn Code.eval_file(file_name, path) end
In the case your production node is already running and you do not know the cookie, you can go inside your running container and do a ps aux > t.log and do a cat t.log to figure out what random cookie has been applied and use accordingly.
Docker serves as an impediment to the way epmd is able to communicate with other nodes. The best therefore would be to rather create your own AWS AMI image using Packer and do bare metal deployments instead.
Amazon has recently released a new feature to AWS ECS, AWS VPC Networking Mode, which perhaps may facilitate inter-container epmd communication and thus connecting to your node directly. I have not tried it out as yet, I may be wrong.
In the case that you are running on a provider other than AWS, then figuring out how to get easy access to your remote logs with some SSM agent or some other service is a must.
I would recommend using some sort of exception handling tools, so far I am having great experiences on Sentry.

Unable to start any container when Volumes are enabled Docker Toolbox

I am running Docker Toolbox v. 1.13.1a on Windows 7 Pro Service pack 1 x64OS.
with Virtual Box Version 5.1.14 r112924
when I try to run any docker image e.g. official postgres image from Docker Hub with volumes disabled, it works fine!
But when I enable the volumes it fails.
I tried all official documentations
The VM has shared folder as required and has full access to it also
shared folder screenshot
In case of my example of postgresql it crashes with following log
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... LOG: could not link file "pg_xlog/xlogtemp.27" to "pg_xlog/000000010000000000000001": Operation not permitted
FATAL: could not open file "pg_xlog/000000010000000000000001": No such file or directory
child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/postgresql/data"
I know its the problem with folder permissions. But kinda stuck!
A ton of thanks in advance
I've been busy with this problem all day and my conclusion that it's currently simply not possible to run postgresql inside a docker container while keeping your data persistent in a separate volume.
I even tried running the container without linking to a volume and copying the data that was originally in /var/lib/postgresql into a folder of my host OS (Windows 10 Home), then copy that into the folder that got then linked to the container itself.
Alas, I got the next error:
FATAL: data directory "/var/lib/postgresql/data/pgadmin" has wrong ownership
HINT: The server must be started by the user that owns the data directory.
In conclusion: There's something going wrong with the ownership and the correct user owning it and to be able to fix it, you'll need a unix commandline on Windows that is able to run docker (something currently not possible with Bash on Ubuntu on Windows that is running using Ubuntu 16.04 binaries).
Maybe, in the future, you'll be able to run the needed commands (found here, under Arbitrary --user Notes), but these are *nix commands and powershell (started by Kitematic) can't run those. Bash for Ubuntu for Windows could run those, but that shell has no connection to the docker daemon/service on windows...
TL;DR: Lost a day of work: It is currently impossible on Windows.
I have been trying to fix this issue also ..
At first I thought it was a symlink problem (because the first error fails on " could not link .. operation not permitted)
To be sure symlink is permitted you have to :
share a folder in virtualbox
run virtualbox as administrator (if you account is in administrator group) Right click virtualbox.exe and select run as Administrator
if your account is not administrator, add the symlink privilege with secpol.msc > "Local Policies-User Rights Assignments" add your user to "Create symbolic links"
enable symlink for your shared folder in virtualbox :
VBoxManage setextradata VM_NAME VBoxInternal2/SharedFoldersEnableSymlinksCreate/SHARED_FOLDER_NAME 1
Alternatively you can also use the c:\User\username folder which is shared and symlink enabled by default dockertools installation
Now I can create symlinks in the shared folder from the docker container .. but I still have the same error "could not link ... operation not permitted"
So the reason must be somewhere else ... in the file permissions as you said but I do not see why ?

OrientDB " Error on moving existent database"

I'm trying to setup OrientDB distributed configuration with docker. But I'm getting error when starting second node -
2015-10-09 17:14:14:066 WARNI
[node1444321499719]->[[node1444321392311]] requesting deploy of
database 'testDB' on local server... [OHazelcastPlugin] 2015-10-09
17:14:14:117 INFO [node1444321499719]<-[node1444321392311] received
updated status node1444321499719.testDB=SYNCHRONIZING
[OHazelcastPlugin] 2015-10-09 17:14:14:119 INFO
[node1444321499719]<-[node1444321392311] received updated status
node1444321392311.testDB=SYNCHRONIZING [OHazelcastPlugin] 2015-10-09
17:14:15:935 WARNI [node1444321499719] moving existent database
'testDB' located in '/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB' and get a fresh copy
from a remote node... [OHazelcastPlugin] 2015-10-09 17:14:15:936 SEVER
[node1444321499719] error on moving existent database 'testDB' located
in '/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB'. Try to move the
database directory manually and retry
[OHazelcastPlugin][node1444321499719] Error on starting distributed
plugin
com.orientechnologies.orient.server.distributed.ODistributedException:
Error on moving existent database 'testDB' located in
'/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB'. Try to move the
database directory manually and retry
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.backupCurrentDatabase(OHazelcastPlugin.java:1007)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.requestDatabase(OHazelcastPlugin.java:954)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.installDatabase(OHazelcastPlugin.java:893)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.installNewDatabases(OHazelcastPlugin.java:1426)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.startup(OHazelcastPlugin.java:184)
at com.orientechnologies.orient.server.OServer.registerPlugins(OServer.java:979)
at com.orientechnologies.orient.server.OServer.activate(OServer.java:346)
at com.orientechnologies.orient.server.OServerMain.main(OServerMain.java:41)
I don't have this error if I'm starting orientdb cluster without docker.
Also I can move it in container
[root#64f6cc1eba61 orientdb]# mv -v /orientdb/databases/testDB
/orientdb/databases/../backup/databases/testDB
'/orientdb/databases/testDB' ->
'/orientdb/databases/../backup/databases/testDB'
'/orientdb/databases/testDB/distributed-config.json' ->
'/orientdb/databases/../backup/databases/testDB/distributed-config.json'
removed '/orientdb/databases/testDB/distributed-config.json' removed
directory: '/orientdb/databases/testDB' [root#64f6cc1eba61 orientdb]#
ls -l /orientdb/databases/../backup/databases/testDB total 4
-rw-r--r--. 1 root root 455 Oct 9 11:32 distributed-config.json [root#64f6cc1eba61 orientdb]#
I'm using OrientDB version 2.1.3
This was reported and fixed:
https://github.com/orientechnologies/orientdb/issues/4891
Set the 'distributed.backupDirectory' variable to a specific directory and the issue should be gone.
By the way, running orient distributed in docker is our experience currently a no go:
- Docker does not support multicast yet, you can work around it, but it's painful. But the main problem:
- Docker doesn't reuse ip addresses on restart, so a container restart will give it a new ip address, this messes up your cluster big time.
We abandoned using orient distributed with docker until docker is fixed on both issues (I believe it is both on the roadmap).
If you experience otherwise, I'm happy to hear your thoughts.

Resources