What is the difference between partition and volume?
Kindly give an analogy if possible since I am unable to understand the difference between them.
Partitions -
Storage media (DVD's, USB sticks, HDD's, SSD's) can all be divided into partitions, these partitions are identified by a partition table.
The partition table is where the partition information is stored, the information stored within here is basically where the partition starts and where it finishes on the disc platter.
Volumes -
A Volume is a logical abstraction from physical storage.
Large disks can be partitioned into multiple logical volumes
Volumes are divided up into fixed size blocks or a cluster or blocks.
We don't see the partition as this is sorted by the file system controller but we see volumes as they are logical and are provided by a gui with a hierarchical structure and human interface. When we request to see a file it runs through a specific order to view that information from within the volume on the partition:
Application created the file I/O request
The file system creates a block I/O request
Block I/O drive accesses the disk
Hope this helps... If any part needs clearing up let me know, try my best to clear it up more
Related
I have process A and process B. Process A opens a file, calls mmap and write to it, process B do the same but reads the same mapped region when process A has finished writing.
Using mmap, process B is suppossed to read the file from memory instead of disk assuming process A has not called munmap.
If I would like to deploy process A and process B to diferent containers in the same pod in Kubernetes, is memory mapped IO supposed to work the same way as the initial example? Should container B (process B) read the file from memory as in my regular Linux desktop?
Let's assume both containers are in the same pod and are reading/writing the file from the same persistent volume. Do I need to consider a specific type of volume to achieve mmap IO?
In case you are courious I am using Apache Arrow and pyarrow to read and write those files and achieve zero-copy reads.
A Kubernetes pod is a group of containers that are deployed together on the same host. (reference). So this question is really about what happens for multiple containers running on the same host.
Containers are isolated on a host using a number of different technologies. There are two that might be relevant here. Neither prevent two processes from different containers sharing the same memory when they mmap a file.
The two things to consider are how the file systems are isolated and how memory is ring fenced (limited).
How the file systems are isolated
The trick used is to create a mount namespace so that any new mount points are not seen by other processes. Then file systems are mounted into a directory tree and finally the process calls chroot to set / as the root of that directory tree.
No part of this affects the way processes mmap files. This is just a clever trick on how file names / file paths work for the two different processes.
Even if, as part of that setup, the same file system was mounted from scratch by the two different processes the result would be the same as a bind mount. That means the same file system exists under two paths but it is *the same file system, not a copy.
Any attempt to mmap files in this situation would be identical to two processes in the same namespace.
How are memory limits applied?
This is done through cgroups. cgroups don't really isolate anything, they just put limits on what a single process can do.
But there is a natuarl question to ask, if two processes have different memory limits through cgroups can they share the same shared memory? Yes they can!
Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.
The reference is a little obscure, but describes how memory limits are applied to such situations.
Conclusion
Two processes both memory mapping the same file from the same file system as different containers on the same host will behave almost exactly the same as if the two processes were in the same container.
Isn't it useless to write stream data for IPC into a file in the filesystem and so on to your (HDD o. SSD)? I mean, isn't it better to create a "buffered" pipe in the memory, so that we have more performance on the drive? But I'm new in IPC... or isn't it writing onto the disk? But how is this possible that the system writes into the filesystem without writing into a disk?
Aren't Named Pipes in the Filesystem slow?
They're no slower than any other sort of pipe.
isn't it better to create a "buffered" pipe in the memory
If you aren't memory constrained, then yes (see older OS link below).
[...] or isn't it writing onto the disk?
Your guess is correct - on many modern operating systems data going into a named pipe is not being written to the disk; the filesystem is just the namespace that holds something that tells you where the ends of the pipe can be found. From the Linux man page for pipe:
Note: although FIFOs have a pathname in the filesystem, I/O on FIFOs does not involve operations on the underlying device (if there is one).
There are older operating systems that buffer pipe data within a filesystem but given your question's phrasing (on such systems ALL pipes go through the filesystem not just named ones) I suspect this is a tangent.
I have a GlusterFS setup with two nodes(node1 and node2) setup to a replicated volume.
The volume contains many small files, 8kb - 200kb in size. When I subject node1 to heavy read load, glusterfsd and glusterfs processed together uses ~ 100% CPU on both nodes.
There is no write load on any of the nodes. But why is the CPU load so high, on both nodes?
As I understand it all the data is replicated to both nodes, so it "should" perform like a local filesystem.
this is commonly related to small files, e.g. if you have PHP apps running from a gluster volume.
This one bit me in the rear once, and it mostly has to do that in many php frameworks, you get a lot of stats to see if a file exists at that spot, if not, it will state a level (directory) higher, or with a slightly different name. Repeat 1000 times. Per file.
Now here's the catch: that lookup if the file exists does not just happen on that node / the local brick. (if you use replication), but on ALL the nodes / bricks involved. The cost involved can explode fast. (specially on some cloud platforms, where IOPS are capped)
This article helped me out significantly. In the end there was still a small penalty, but the benefits outweighed that.
https://www.vanderzee.org/linux/article-170626-141044/article-171031-113239/article-171212-095104
When I see the details of my dataflow compute engine instance, I can see two categories of disks being used - (1) Boot disk and local disks, and (2) Additional disks.
I can see that the size that I specify using the diskSizeGb option determines the size of a single disk under the category 'Boot disk and local disks'. My not-so-heavy job is using 8 additional disks of 40GB each.
What are additional disks used for and is it possible to limit their size/number?
Dataflow will create for your job Compute Engine VM instances, also known as workers.
To process the input data and store temporary data, each worker may require up to 15 additional Persistent Disks.
The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode. 40GB is very far from the default value
In this case, the Dataflow service will span more disks for your worker. If you want to keep a 1:1 ratio between workers and disks, please increase the ‘diskSizeGb’ field.
The existing answer explains how many disks, and information about the disks - but it does not answer the main question: Why so many disks per worker?
WHY does Dataflow need several disks per worker?
The way in which Dataflow does load balancing for streaming jobs is that a range of keys is allocated to each disk. Persistent state about each key is stored in these disks.
A worker can be overloaded if the ranges that are allocated to its persistent disks have a very high volume. To load-balance, Dataflow can move a range from one worker to another by transferring a persistent disk to a different worker.
So this is why Dataflow uses multiple disks per worker: Because this allows it to do load balancing and autoscaling by moving the disks from worker to worker.
I'm getting going with Docker, and I've found that I can put the main image repository on a different disk (symlink /var/lib/docker to some other location).
However, now I'd like to see if there is a way to split that across multiple disks.
Specifically, I have an old SSD that is blazingly fast to read from, but doesn't have too many writes left until it kicks the can. It would be awesome if I could store the immutable images on here, then have my writeable images on some other location that can handle the writes.
Is this something that is possible? How do you split up the repository?
Maybe you could do this using the AUFS driver and some trickery such as moving layers to the SSD after initially creating them and pointing symlinks at them - I'm not sure, I never had a proper look at how that storage driver worked.
With devicemapper thinp, btrfs and OverlayFS this isnt possible AFAICT:
The Docker dm-thinp and btrfs drivers both build layers one on top of the other using block device snapshot mechanisms. Your best bet here would be to include the SSD in the storage pool and rely on some ability to migrate the r/o snapshots to a specific block device that is part of the pool. Doubt this exists though.
The OverlayFS driver stacks layers by hard-linking files in independent directory structures. Hard-links only work within a filesystem.