Isn't it useless to write stream data for IPC into a file in the filesystem and so on to your (HDD o. SSD)? I mean, isn't it better to create a "buffered" pipe in the memory, so that we have more performance on the drive? But I'm new in IPC... or isn't it writing onto the disk? But how is this possible that the system writes into the filesystem without writing into a disk?
Aren't Named Pipes in the Filesystem slow?
They're no slower than any other sort of pipe.
isn't it better to create a "buffered" pipe in the memory
If you aren't memory constrained, then yes (see older OS link below).
[...] or isn't it writing onto the disk?
Your guess is correct - on many modern operating systems data going into a named pipe is not being written to the disk; the filesystem is just the namespace that holds something that tells you where the ends of the pipe can be found. From the Linux man page for pipe:
Note: although FIFOs have a pathname in the filesystem, I/O on FIFOs does not involve operations on the underlying device (if there is one).
There are older operating systems that buffer pipe data within a filesystem but given your question's phrasing (on such systems ALL pipes go through the filesystem not just named ones) I suspect this is a tangent.
Related
How can I measure the efficiency of a container image, in terms of what portion of its contents are actually used (accessed) for the processes therein?
There are various forms of wastage that could contribute to excessively large images, such as layers storing files that are superseded in later layers (which can be analysed using dive), or binaries interlaced with unstripped debug information, or the inclusion of extraneous files (or data) that are simply not needed for the process which executes in the container. Here I'm asking about the latter.
Are there docker-specific tools (analogous to dive) for estimating/measuring this kind of wastage/efficiency, or should I just apply general Linux techniques? Can the filesystem access time (atime) be relied upon inside a container (to distinguish which files have/haven't been read since the container was instantiated) or do I need to instrument the image with tools like the Linux auditing system (auditd)?
You know, when an application opens a file and write to it, the system chooses in which cluster will be stored. I want to choose myself ! Let me tell you what I really want to do... In fact, I don't necessarily want to write anything. I have a HDD with a BAD range of clusters in the middle and I want to mark that space as it is occupied by a file, and eventually set it as a hidden-unmoveable-system one (like page file in windows) so that it won't be accessed anymore. Any ideas on how to do that ?
Later Edit:
I think THIS is my last hope. I just found it, but I need to investigate... Maybe a file could be created anywhere and then relocated to the desired cluster. But that requires writing, and the function may fail if that cluster is bad.
I believe the answer to your specific question: "Can I write a file to a specific cluster location" is, in general, "No".
The reason for that is that the architecture of modern operating systems is layered so that the underlying disk store is accessed at a lower level than you can access, and of course disks can be formatted in different ways so there will be different kernel mode drivers that support different formats. Even so, an intelligent disk controller can remap the addresses used by the kernel mode driver anyway. In short there are too many levels of possible redirection for you to be sure that your intervention is happening at the correct level.
If you are talking about Windows - which you haven't stated but which appears to assumed - then you need to be looking at storage drivers in the kernel (see https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/). I think the closest you could reasonably come would be to write your own Installable File System driver (see https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/_ifsk/). This is really a 'filter' as it sits in the IO request chain and can intercept and change IO Request Packets (IRPs). Of course this would run in the kernel, not in userspace, and normally this would be written in C and I note your question is tagged for Delphi.
Your IFS Driver can sit at differnt levels in the request chain. I have used this technique to intercept calls to specific file system locations (paths / file names) and alter the IRP so as to virtualise the request - even calling back to user space from the kernel to resolve how the request should be handled. Using the provided examples implementing basic functionality with an IFS driver is not too involved because it's a filter and not a complete storgae system.
However the very nature of this approach means that another filter can also alter what you are doing in your driver.
You could look at replacing the file system driver that interfaces to the hardware, but I think that's likely to be an excessive task under the circumstances ... and as pointed out already by #fpiette the disk controller hardware can remap your request anyway.
In the days of MSDOS the access to the hardware was simpler and provided by the BIOS which could be hooked to allow the requests to be intercepted. Modern environments aren't that simple anymore. The IFS approach does allow IO to be hooked, but it does not provide the level of control you need.
EDIT regarding suggestion by the OP of using FSCTL_MOVE_FILE
For simple environment this may well do what you want, it is designed to support a defragmentation process.
However I still think there's no guarantee that this actually will do what you want.
You will note from the page you have linked to it states that it is moving one or more virtual clusters of a file from one logical cluster to another within the same volume
This is a code that's passed to the underlying storage drivers which I have referred to above. What the storage layer does is up to the storage layer and will depend on the underlying technology. With more advanced storage there's no guarantee this actually addresses the physical locations which I believe your question is asking about.
However that's entirely dependent on the underlying storage system. For some types of storage relocation by the OS may not be honoured in the same way. As an example consider an enterprise storage array that has a built in data-tiering function. Without the awareness of the OS data will be relocated within the storage based on the tiering algorithms. Also consider that there are technologies which allow data to be directly accessed (like NVMe) and that you are working with 'virtual' and 'logical' clusters, not physical locations.
However, you may well find that in a simple case, with support in the underlying drivers and no remapping done outside the OS and kernel, this does what you need.
Since you problem is to mark bad cluster, you don't need to write any program. Use the command line utility CHKDSK that Windows provides.
I an elevated command prompt (Run as administrator), run the command:
chkdsk /r c:
The check will be done on the next reboot.
Don't forget to read the documentation.
I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing.
Using mmap, process B is suppossed to read the file from memory instead of disk assuming process A has not called munmap.
If I would like to deploy process A and process B to diferent containers in the same pod in Kubernetes, is memory mapped IO supposed to work the same way as the initial example? Should container B (process B) read the file from memory as in my regular Linux desktop?
Let's assume both containers are in the same pod and are reading/writing the file from the same persistent volume. Do I need to consider a specific type of volume to achieve mmap IO?
In case you are courious I am using Apache Arrow and pyarrow to read and write those files and achieve zero-copy reads.
A Kubernetes pod is a group of containers that are deployed together on the same host. (reference). So this question is really about what happens for multiple containers running on the same host.
Containers are isolated on a host using a number of different technologies. There are two that might be relevant here. Neither prevent two processes from different containers sharing the same memory when they mmap a file.
The two things to consider are how the file systems are isolated and how memory is ring fenced (limited).
How the file systems are isolated
The trick used is to create a mount namespace so that any new mount points are not seen by other processes. Then file systems are mounted into a directory tree and finally the process calls chroot to set / as the root of that directory tree.
No part of this affects the way processes mmap files. This is just a clever trick on how file names / file paths work for the two different processes.
Even if, as part of that setup, the same file system was mounted from scratch by the two different processes the result would be the same as a bind mount. That means the same file system exists under two paths but it is *the same file system, not a copy.
Any attempt to mmap files in this situation would be identical to two processes in the same namespace.
How are memory limits applied?
This is done through cgroups. cgroups don't really isolate anything, they just put limits on what a single process can do.
But there is a natuarl question to ask, if two processes have different memory limits through cgroups can they share the same shared memory? Yes they can!
Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.
The reference is a little obscure, but describes how memory limits are applied to such situations.
Conclusion
Two processes both memory mapping the same file from the same file system as different containers on the same host will behave almost exactly the same as if the two processes were in the same container.
What is the difference between partition and volume?
Kindly give an analogy if possible since I am unable to understand the difference between them.
Partitions -
Storage media (DVD's, USB sticks, HDD's, SSD's) can all be divided into partitions, these partitions are identified by a partition table.
The partition table is where the partition information is stored, the information stored within here is basically where the partition starts and where it finishes on the disc platter.
Volumes -
A Volume is a logical abstraction from physical storage.
Large disks can be partitioned into multiple logical volumes
Volumes are divided up into fixed size blocks or a cluster or blocks.
We don't see the partition as this is sorted by the file system controller but we see volumes as they are logical and are provided by a gui with a hierarchical structure and human interface. When we request to see a file it runs through a specific order to view that information from within the volume on the partition:
Application created the file I/O request
The file system creates a block I/O request
Block I/O drive accesses the disk
Hope this helps... If any part needs clearing up let me know, try my best to clear it up more
I'm getting going with Docker, and I've found that I can put the main image repository on a different disk (symlink /var/lib/docker to some other location).
However, now I'd like to see if there is a way to split that across multiple disks.
Specifically, I have an old SSD that is blazingly fast to read from, but doesn't have too many writes left until it kicks the can. It would be awesome if I could store the immutable images on here, then have my writeable images on some other location that can handle the writes.
Is this something that is possible? How do you split up the repository?
Maybe you could do this using the AUFS driver and some trickery such as moving layers to the SSD after initially creating them and pointing symlinks at them - I'm not sure, I never had a proper look at how that storage driver worked.
With devicemapper thinp, btrfs and OverlayFS this isnt possible AFAICT:
The Docker dm-thinp and btrfs drivers both build layers one on top of the other using block device snapshot mechanisms. Your best bet here would be to include the SSD in the storage pool and rely on some ability to migrate the r/o snapshots to a specific block device that is part of the pool. Doubt this exists though.
The OverlayFS driver stacks layers by hard-linking files in independent directory structures. Hard-links only work within a filesystem.