Do flock locks reset after a system restart? - flock

Let's say the system powers down unexpectedly due to a power outage.
Are flock locks always considered to be "unlocked" when the system starts up?
On Linux, flock relies on fcntl(...) (file descriptors).
Asked another way: Is it unnecessary to manually call flock -u <lock_filename> when the system first starts up? (i.e. from cron #reboot)?
Update:
BSD flock man pages says:
Locks are on files, not file descriptors. That is, file descriptors
duplicated through dup(2) or fork(2) do not result in multiple instances
of a lock, but rather multiple references to a single lock.

My Linux guru friend here mentions that there is a kernel locks table (for file locks) (usually stored in memory) which disappears on a reboot.
And that the file lock is just there for as long as the process is running.

According to the linux man page:
Locks created by flock() are associated with an open file table entry.
This is a data structure in the kernels memory and not in the file system that may be on persistent disk storage.
Open files are closed when the process exits - thus flocks are valid only as long as a process holding it is running.

Related

Trying to understand and debug ETCDv2 Memory usage

I’m trying to understand ETCD’s memory and disk usage within a deployed system using the ETCDv2 API. The system has a file being saved on a regular basis, each time under a new key, and we’re concerned that long-term there’s no clean-up of state leading to both memory and disk usage growing unbounded on each VM in the etcd cluster. We’ve also emulated this, using a large file (several MB) being saved every few minutes.
From the etcd docs, I expected the following:
Each insertion would save the file to disk, causing disk usage to grow unbounded.
This matches what I am seeing.
In memory, etcd would save a key-value pair where the value is a lookup address for the file on disk (taking up a very small amount of memory) and a cached version of the file (taking a large amount of memory).
I would then expect that rebooting an etcd pod after several file writes would cause the cache to be (mostly) cleared, meaning a consistently up pod would have memory growing unbounded but if the pod rebooted, the cache would be cleared of all but the active entry (and any specifically requested by e.g. attempted rollbacks) and the memory usage would (mostly) reset with each reboot.
However, in practice we see a very small memory drop with a reboot which is almost immediately returned after the pod recovers (as though all the cache is restored from the peers).
Is my understanding correct? And if so:
Why does the memory usage reset fully after an etcd pod reboot? Does the etcd cache get synced with its cluster, as well as the main key-value table and file storage?
Is there a recommended way to keep etcd’s memory and disk usage within bounded limits?
Additional notes:
I’ve tried reducing the snapshot_count configuration setting - this doesn’t seem to have had any impact (unless I’ve reduced it too far - I cut it right down to 5 from the default of 100,000).
I’ve attempted changing our file saving to overwrite a single file with a new version each time, instead of storing a new file. This doesn’t appear to have had any impact (although this may be due to issues in my prototype; I’m still investigating).
We can’t migrate existing deployments to etcd v3 file-systems, so are specifically looking at etcd v2 solutions. I think this rules out compact and defrag steps, which seem to be a core part of the answer to this problem in v3.
Any help or insight very gratefully appreciated.
Thanks!

pgpool node detached for no reason?

I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options

Does an Operating System check every Instruction?

Not sure if anyone here can answer this.
I've learned that an Operating System checks if an instruction of a program changes something outside of its allocated memory, and if it does then the OS won't allow the program to do this.
But, if the OS has to check this for every instruction, won't this take up at least 5/6 of the CPU? I tried to replicate this, and this is how many clock cycles I've come up with to check this for every instruction.
If I've understood something wrong, please correct me, because I can't imagine that an OS takes up that much of the CPU.
There are several safe-guards in place to ensure a non-privileged process behaves. I will discuss two of them in the context of the x86_64 architecture, but these concepts (mostly) extend to other major platforms.
Privilege Levels
There is a bit in a particular CPU register that indicates the current privilege level. These privileges are often called rings, where ring 0 corresponds to the kernel (ie. highest privilege), and ring 3 corresponds to a userspace process (ie. lowest privilege). There are other rings, but they're not relevant to this introduction.
Certain instructions in x86_64 may only be executed by privileged processes. The current ring must be 0 to execute a privileged instruction. If you try to execute this instruction without the correct privileges, the processor raises a general protection fault. The kernel synchronously processes this interrupt, and will almost certainly kill the userspace process.
The ring level can only be changed while in ring 0, so the userspace process can't simply change from ring 3 to ring 0 by itself.
Execute Permission in Page Tables
All instructions to be executed are stored in memory. Many architectures (including x86_64) use page tables to store mappings from virtual addresses to physical addresses. These page tables have several bookkeeping entries as well, one of which is an execute permission bit. If this bit is not set for a page that corresponds to the instruction trying to be executed, then the processor will produce a general protection fault. As before, the kernel will synchronously process this interrupt, and likely kill the offending process.
When are these execute bits set? They can be dynamically set via mmap(2), but in most cases the compiler emits special CODE sections in the binaries it generates, and when the OS loads the binary into memory it sets the execute bit in the page table entries for the pages that correspond to the CODE sections.
Who's checking these bits?
You're right to ask about the performance penalty of an OS checking these bits for every single instruction. If the OS were doing this, it would be prohibitively expensive. Instead, the processor supports privilege levels and page tables (with the execute bit). The OS can set these bits, and rely on the processor to generate interrupts when a process acts outside its privileges.
These hardware checks are very fast.

Possible to use a data folder on NFS?

This seems to be the most reliable in-process data store I found. I tried a few things locally (sig kill, sig term, System.exit(), etc. in the middle of a transaction) and xodus could pick up from where its last good state was.
I'm interested to know whether xodus supports store data over NFS (using an NFS folder as the environment)? Is it possible to corrupt the datastore if the file locking may not work well, like in the case of some NFS, when multiple processes open the same folder from different hosts?
I took a quick look at the lock file (xd.lck, well, at least it looks like a lock file to me), which seems to include pid, host name, and a call stack for the LockingManager. However, I'm not sure how this lock file works with xodus. I found that this file is not removed after the environment closes. Nor does its content change.
It's not recommended to use any kind of remote or removable storage for hosting database files. The database can easily be corrupted - not only on attempt of shared access, but also due to possible connectivity issues. In upcoming versions (released after 1.3.232), an attempt to use remote or removable storage would fail if and where it can be reliably detected.

Do memory mapped files in Docker containers in Kubernetes work the same as in regular processes in Linux?

I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing.
Using mmap, process B is suppossed to read the file from memory instead of disk assuming process A has not called munmap.
If I would like to deploy process A and process B to diferent containers in the same pod in Kubernetes, is memory mapped IO supposed to work the same way as the initial example? Should container B (process B) read the file from memory as in my regular Linux desktop?
Let's assume both containers are in the same pod and are reading/writing the file from the same persistent volume. Do I need to consider a specific type of volume to achieve mmap IO?
In case you are courious I am using Apache Arrow and pyarrow to read and write those files and achieve zero-copy reads.
A Kubernetes pod is a group of containers that are deployed together on the same host. (reference). So this question is really about what happens for multiple containers running on the same host.
Containers are isolated on a host using a number of different technologies. There are two that might be relevant here. Neither prevent two processes from different containers sharing the same memory when they mmap a file.
The two things to consider are how the file systems are isolated and how memory is ring fenced (limited).
How the file systems are isolated
The trick used is to create a mount namespace so that any new mount points are not seen by other processes. Then file systems are mounted into a directory tree and finally the process calls chroot to set / as the root of that directory tree.
No part of this affects the way processes mmap files. This is just a clever trick on how file names / file paths work for the two different processes.
Even if, as part of that setup, the same file system was mounted from scratch by the two different processes the result would be the same as a bind mount. That means the same file system exists under two paths but it is *the same file system, not a copy.
Any attempt to mmap files in this situation would be identical to two processes in the same namespace.
How are memory limits applied?
This is done through cgroups. cgroups don't really isolate anything, they just put limits on what a single process can do.
But there is a natuarl question to ask, if two processes have different memory limits through cgroups can they share the same shared memory? Yes they can!
Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.
The reference is a little obscure, but describes how memory limits are applied to such situations.
Conclusion
Two processes both memory mapping the same file from the same file system as different containers on the same host will behave almost exactly the same as if the two processes were in the same container.

Resources