Oracle Cloud: attach boot volume - devops

There was a need to restore SSH access to an instance (Ubuntu 22) so using a guide provided by Oracle the boot volume has been detached and connected as a block storage to another instance. Once a new ssh key was added to the authorized_keys file, the disk was unmount and detached form the temporary instance. So now, I'm trying to re-attache the boot volume to the initial instance; it's successfully starting the process, but than "Attaching" status changes to "Detaching" and at the end the volume is still detached. I'm new at Oracle Cloude and have no idea how to debug any infrastructure operations (where to find an event log).
Here are some screenshots:
Attach volume->Attaching->Detaching->Detached

Related

Restoring Jenkins from volume snapshot of JENKINS_HOME directory, requires I reset every user's password before they can login. How can I avoid this?

Restoring Jenkins with a snapshot of the JENKINS_HOME directory requires that I reset every user's password before they can login. Why is this happening and how can I avoid this?
Is the password hashing using some information that is not stored in the JENKINS_HOME directory?
According to the following resources, restoring from a volume snapshot seems possible.
https://www.jenkins.io/doc/book/system-administration/backing-up/
https://devopscube.com/jenkins-backup-data-configurations/
More Details:
When I restore, I am spinning up an entire new ec2 instance from the same custom AMI that the backed up server was running.
I have Jenkins running in Ubuntu 22.04, on an AWS EC2 instance that is running the latest LTS version (2.361.2).
Volume Configuration:
Root Volume 8GB; From a custom AMI which is basically just Ubuntu 22.04 with Jenkins installed.
Data Volume 80GB; (Mounted at /data ) Restored from snapshot of previous identical Jenkins ec2 instance.
JENKINS_HOME: The root volume has a symlink at /var/lib/jenkins that points to /data/var/lib/jenkins
Jenkins seems to operate exactly as expected when doing a fresh install using this setup. When I recreate this server using a snapshot of the original data volume, all of the data is in the JENKINS_HOME directory and everything works as expected, with the exception that all user login attempts fail.
If I go through the password recovery steps in this post,
https://www.shellhacks.com/jenkins-reset-admin-password/
I am able to reset all user passwords and at that point it seems like everything is working as expected.
In general terms I suppose the question is: How do I move the JENKINS_HOME directory from one instance of Jenkins to another, in an automated fashion without losing the login credientials?

KStreams application - state.dir - No .checkpoint file

I have a KStreams application running inside a Docker container which uses a persistent key-value store. My runtime environment is Docker 1.13.1 on RHEL 7.
I have configured state.dir with a value of /tmp/kafka-streams (which is the default).
When I start this container using "docker run", I mount /tmp/kafka-streams to a directory on my host machine which is, say for example, /mnt/storage/kafka-streams.
My application.id is "myapp". I have 288 partitions in my input topic which means my state store / changelog topic will also have that many partitions. Accordingly, when start my Docker container, I see that there a folder with the number of the partition such as 0_1, 0_2....0_288 under /mnt/storage/kafka-streams/myapp/
When I shutdown my application, I do not see any .checkpoint file in any of the partition directories.
And when I restart my application, it starts fetching the records from the changelog topic rather than reading from local disk. I suspect this is because there is no .checkpoint file in any of the partition directories. (Note : I can see the .lock and rocksdb sub-directory inside the partition directories)
This is what I see in the startup log. It seems to be bootstrapping the entire state store from the changelog topic i.e. performing network I/O rather than reading from what is on disk :
2022-05-31T12:08:02.791 [mtx-caf-f6900c0a-50ca-43a0-8a4b-95eaad9e5093-StreamThread-122] WARN o.a.k.s.p.i.ProcessorStateManager - MSG=stream-thread [myapp-f6900c0a-50ca-43a0-8a4b-95eaa
d9e5093-StreamThread-122] task [0_170] State store MyAppRecordStore did not find checkpoint offsets while stores are not empty, since under EOS it has the risk of getting uncommitte
d data in stores we have to treat it as a task corruption error and wipe out the local state of task 0_170 before re-bootstrapping
2022-05-31T12:08:02.791 [myapp-f6900c0a-50ca-43a0-8a4b-95eaad9e5093-StreamThread-122] WARN o.a.k.s.p.internals.StreamThread - MSG=stream-thread [mtx-caf-f6900c0a-50ca-43a0-8a4b-95eaad
9e5093-StreamThread-122] Detected the states of tasks [0_170] are corrupted. Will close the task as dirty and re-create and bootstrap from scratch.
org.apache.kafka.streams.errors.TaskCorruptedException: Tasks [0_170] are corrupted and hence needs to be re-initialized
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.initializeStoreOffsetsFromCheckpoint(ProcessorStateManager.java:254)
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:109)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216)
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433)
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556)
Should I expect to see a .checkpoint file in each of the partition directories under /mnt/storage/kafka-streams/myapp/ when I shutdown my application ?
Is this an issue because I am running my KStreams app inside a docker container ? If there were permissions issues, then I would have expected to see issues in creating the other files such as .lock or rocksdb folder (and it's contents).
If I run this application as a standalone/runnable Springboot JAR on my Windows laptop i.e. not in a Docker container, I can see that it creates the .checkpoint file as expected.
My java application inside the Docker container is run via an entrypoint script. It seems that if I stop the container, then it does not send the TERM signal to my java process and hence does not have a clean shutdown of the java KStreams application.
So, all I needed to do was to find a way to somehow send a TERM signal to my java application inside the container.
For the moment, I just ssh'ed into the container and did a kill -s TERM <pid> for my java process.
Once I did that, it resulted in a clean shutdown and thus created the .checpoint file as well.

How to stop and restart a Compute Engine VM that runs a Docker container

I'm running a Docker container on Compute Engine, using the Container Image VM property.
However, if I stop and restart the VM, my app works but the logs aren't collected any more.
When I run docker ps I only see my own Docker image. However, for a new VM that hasn't been stopped I also see a container image called gcr.io/stackdriver-agents/stackdriver-logging-agent.
Are there any specific steps I need to take to restore the VM as it was before it was stopped? How can I make logging work again, and are there other differences I should be aware of?
I understand you are running a docker container on Compute Engine and when you stop/restart the VM, the logs aren’t being collected anymore. As well as wanting to know how to restore a VM to its previous form and the stackdriver-logging-agent.
As described in this article [1], you can use GCE snapshots to create backups of persistent disks attached to the instance, including boot volumes. This is useful for backing up your data, recreating a disk that might have been lost, or copying a persistent disk. That being said, currently this is the method you can recover deleted disk.
Therefore, unfortunately if there are no snapshots taken already from the VM’s disk(s), the deleted disk volume cannot be recovered, this process is irreversible [2].
In the future, you can set disk ‘auto-delete’ [3] to no when creating an instance, this way disk will remain even if the instance is deleted.
As for the the logging agent image, it’s a container image that streams logs from your VM instances and from selected third-party software packages to Stackdriver Logging. It is a best practice to run the Logging agent on all your VM instances, which can answer your question as to why the logs aren’t appearing anymore. They are simply being recorded by the logging agent and sent to Stackdriver Logging.
For the logs not being recollected you can try this out to reset the service:
Please do the following on your affected Windows instance:
Stop the "StackdriverLogging" service. You can do it from command line with "net stop StackdriverLogging"
Navigate to the following directory: "C:\Program Files (x86)\Stackdriver\LoggingAgent\Main\pos\winevtlog.pos\worker0"
Remove the file “storage.json” located in that directory
Restart StackdriverLogging service - execute "net start StackdriverLogging" from command line.
This should reset logging agent state and make logging functional again.
[1] https://cloud.google.com/compute/docs/disks/create-snapshots
[2] https://cloud.google.com/compute/docs/disks/#pdspecs
[3] https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#--disk

How to prevent root of the host OS from accessing content inside a container?

We are new to container technology and are currently evaluating whether it can be used in our new project. One of our key requirements is data security of multi-tenants. i.e. each container contains data that is owned by a particular tenant only. Even we, the server admin of the host servers, should NOT be able to access the content inside a container.
By some googling, we know that root of the host OS can execute command inside a container, for example, by the "docker execute" command. I suppose the command is executed with root privileges?
How to get into a docker container?
We wonder if such kinds of access (not just "docker execute", but also any method to access a container's content by server admin of the host servers) can be blocked/disabled by some security configurations?
For the bash command specifically,
Add the exit command in the end of the .bashrc file. So the user logs in and finally gets kicked out.
You can go through this link for the better understanding of why it is not implemented by default
https://github.com/moby/moby/issues/8664

What happens to docker service configs or secrets if a swarm is completely stopped?

I'm aware that service configs and secrets are stored in the RAFT log and that this log is replicated to other swarm managers.. but what if the entire swarm is stopped? Is the RAFT log persistent or should you always keep local copies?
I eventually found out that if you back up the swarm, you should be able to recover as detailed in the documentation:
Back up the swarm
Docker manager nodes store the swarm state and manager logs in the /var/lib/docker/swarm/ directory. In 1.13 and higher, this data includes the keys used to encrypt the Raft logs. Without these keys, you will not be able to restore the swarm.
You can back up the swarm using any manager. Use the following procedure.
If the swarm has auto-lock enabled, you will need the unlock key in order to restore the swarm from backup. Retrieve the unlock key if necessary and store it in a safe location. If you are unsure, read Lock your swarm to protect its encryption key.
Stop Docker on the manager before backing up the data, so that no data is being changed during the backup. It is possible to take a backup while the manager is running (a “hot” backup), but this is not recommended and your results will be less predictable when restoring. While the manager is down, other nodes will continue generating swarm data that will not be part of this backup.
Note: Be sure to maintain the quorum of swarm managers. During the time that a manager is shut down, your swarm is more vulnerable to losing the quorum if further nodes are lost. The number of managers you run is a trade-off. If you regularly take down managers to do backups, consider running a 5-manager swarm, so that you can lose an additional manager while the backup is running, without disrupting your services.
Back up the entire /var/lib/docker/swarm directory.
Restart the manager.

Resources