Cassandra Data storage: data directory space not equal to the space occupied - memory

This is a beginners question on Cassandra Architecture.
I have a 3 node Cassandra cluster. The data directory is at $CASSANDRA_HOME/data/data. I've loaded a huge data set. I did a nodetool flush and then nodetool tablestats on the table I loaded the data. This says the total space occupied is around 50GiB. I was curious and checked the size of my data directory du $CASSANDRA_HOME/data/data on each of the nodes,which shows around 1-2GB on each. How could the data directory be less than the space occupied by a single table? Am I missing something? My table is created with replication factor 1

du gives out the true storage capacity used by the paths given to it. This is not always directly connected to the size of the data stored in these paths.
Two main factors mix up the output of du compared to any other storage usage information you might get (e. g. from Cassandra).
du might give out a smaller number than expected because of two reasons: ⓐ It combines hard links. This means that if the paths given to it contain hard linked files (I won't explain hard links here, but this term is a fixed one for Unixish operating systems so it can be looked up easily), these are counted only once while the files exist multiple times. ⓑ It is aware of sparse files; these are files which contain large (sometimes huge) areas of empty space (zero-bytes). In many Unixish file systems these can be stored efficiently, depending on how they have been created.
du might give out a larger number than expected because file systems have some overhead. To store a file of n bytes, n + h bytes need to be stored because of this. h depends on the file system and its configuration. The most important factor is that file systems typically store files in a block structure. If a file isn't exactly the size of a multiple of the block size of the file system, the last needed block is still allocated completely by this file, so some of its size if wasted. du will show the whole block as allocated because, in fact, it is.
So in your case Cassandra might talk about space occupied of 50GiB but a lot of it might be empty (never written-to) space. This might be stored in a sparse file on the file system which in fact only uses 2GiB of storage size (which du shows).

Related

Does setting ActiveRecord's text limit save hard drive space?

#pozs... this is NOT a duplicate of the one you indicated. That was the first place I looked. I could care less about the difference between text and varchar. I'm asking about physical space used within the medium aka server hard drive.
I know that hard drives are split into blocks of bytes aka chunks, that if used less then the total amount of the block the remaining space is an empty waste of unused space. What I'm curious about is that the text option itself uses a certain amount of storage. Can the space used be reduced rather than just limiting quantity of input. I could say text limit => 1, and it may still use thousands upon thousands of bytes... this is what I'm asking about.
This is a photo of hard drive blocks. This is how I imagine ActiveRecord text type space used
Here's the wiki on Blocks(data storage) http://en.wikipedia.org/wiki/Block_(data_storage) As you can see they say "Block storage is normally abstracted by a file system or database management system (DBMS)" What they do NOT say is HOW it is abstracted.
According to Igor's blog he says "To my surprise, they determined that the average I/O size of our Postgres databases was much higher that 8KB block size, and up to 1MB." http://igorsf.wordpress.com/2010/11/01/things-to-check-when-configuring-new-postgres-storage-for-high-performance-and-availability/ While this is helpful to know it doesn't tell me the default behaviour between ActiveRecord and PostgreSQL in handling blocks.
According to concernedoftunbridgewells "The database will allocate space in a table or index in some given block size. In the case of Postgres this is 8K". https://dba.stackexchange.com/questions/15510/understanding-block-sizes/15514#15514?newreg=fc10593601be479b8ed697d1bbd108ed So if 8K is used as a block, then how high or low do I set the text type limit to match and fit within the one 8K block, because it may use more then just one block.
I know that PostgreSQL block size setting can be changed. So I would like clarity on "how ActiveRecord PostgreSQL block size handling currently works". I will accept a good answer for that.
A page contains more than one item (assuming there is space obviously). If your row is less than 8k, then other rows will be stored on the same page with it (I simplify slightly - postgres stores large columns separately anyway). Limiting the max length of a column doesn't interact with this.
My reading of the details on character types is that strings under 126 characters incur 3 bytes less overhead, but this happens on a row by row basis, independently of what the maximum length is.
The postgresql docs have details on the exact on disk format and how postgresql deals with large columns.
IMO, the size takes by a text type db column is mainly depends on the content store in the column. The limit setting inside the ActiveRecord will just do a validation before the content is saving into the db column, and didn't has an impact to the actual storage.

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

Why is a process's address space divided into four segments (text, data, stack and heap)?

Why does a process's address space have to divide into four segments (text, data, stack and heap)? What is the advandatage? is it possible to have only one whole big segment?
There are multiple reasons for splitting programs into parts in memory.
One of them is that instruction and data memories can be architecturally distinct and discontiguous, that is, read and written from/to using different instructions and circuitry inside and outside of the CPU, forming two different address spaces (i.e. reading code from address 0 and reading data from address 0 will typically return two different values, from different memories).
Another is reliability/security. You rarely want the program's code and constant data to change. Most of the time when that happens, it happens because something is wrong (either in the program itself or in its inputs, which may be maliciously constructed). You want to prevent that from happening and know if there are any attempts. Likewise you don't want the data areas that can change to be executable. If they are and there are security bugs in the program, the program can be easily forced to do something harmful when malicious code makes it into the program data areas as data and triggers those security bugs (e.g. buffer overflows).
Yet another is storage... In many programs a number of data areas aren't initialized at all or are initialized to one common predefined value (often 0). Memory has to be reserved for these data areas when the program is loaded and is about to start, but these areas don't need to be stored on the disk, because there's no meaningful data there.
On some systems you may have everything in one place (section/segment/etc). One notable example here is MSDOS, where .COM-style programs have no structure other than that they have to be less than about 64KB in size and the first executable instruction must appear at the very beginning of file and assume that its location corresponds to IP=0x100 (where IP is the instruction pointer register). How code and data are placed and interleaved in a .COM program is unimportant and up to the programmer.
There are other architectural artifacts such as x86 segments. Again, MSDOS is a good example of an OS that deals with them. .EXE-style programs in it may have multiple segments in them that correspond directly to the x86 CPU segments, to the real-mode addressing scheme, in which memory is viewed through 64KB-long "windows" known as segments. The position of these windows/segments is relative to the value of the CPU's segment registers. By altering the segment register values you can move the "windows". In order to access more than 64KB one needs to use different segment register values and that often implies having multiple segments in the .EXE (can be not just one segment for code and one for data, but also multiple segments for either of them).
At least the text and data segments are separated to prevent malicious code that's stored inside a variable from being run.
Instructions (compiled code) are stored in the text segment, while the contents of your variables are stored in a data segment, the latter of which never gets executed, only read from and written to.
A little more info here.
Isn't this distinction just a big, hacky workaround for patching security into the von-Neumann architecture where data and instructions share the same memory?

Relation between Linux /proc/meminfo and /sys/devices/system/node/nodex/meminfo

I'd like to get the amount of "free memory" per NUMA node.
When dealing with a whole machine, one usually parses /proc/meminfo like free does (the number wanted is MemFree + Buffers + Cached).
There also exist /sys/devices/system/node/nodex/meminfo, which seem to display numbers per NUMA node. Does anybody know how these numbers can be correlated to the content of /proc/meminfo? My trivial assumption would be to sum up some numbers for all NUMA nodes in the system, and the result is equal to some number in /proc/meminfo. But so far I failed to figure out the relationships, especially for page caches.
The code for proc is in fs/proc/meminfo.c, for the sysfs files it's in drivers/base/node.c. Comparing them might give you some hints.
Note that you'll probably never get the numbers to add up 100%, because you can't atomically read the content of all the files, so the values will change while you're reading them.
There also seems to be an inconsistency in the total RAM reported via both methods. One explanation for that is that free_init_mem doesn't appear to be NUMA aware, and increments total_ram_pages but does not do any NUMA accounting.

Finding what hard drive sectors occupy a file

I'm looking for a nice easy way to find what sectors occupy a given file. My language preference is C#.
From my A-Level Computing class I was taught that a hard drive has a lookup table on the first few KB of the disk. In this table there is a linked list for each file detailing what sectors that file occupies. So I'm hoping there's a convinient way to look in this table for a certain file and see what sectors it occupies.
I have tried Google'ing but I am finding nothing useful. Maybe I'm not searching for the right thing but I can't find anything at all.
Any help is appreciated, thanks.
About Drives
The physical geometry of modern hard drives is no longer directly accessible by the operating system. Early hard drives were simple enough that it was possible to address them according to their physical structure, cylinder-head-sector. Modern drives are much more complex and use systems like zone bit recording , in which not all tracks have the same amount of sectors. It's no longer practical to address them according to their physical geometry.
from the fdisk man page:
If possible, fdisk will obtain the disk geometry automatically. This is not necessarily the physical disk geometry (indeed, modern disks do not really have anything
like a physical geometry, certainly not something that can be described in simplistic Cylinders/Heads/Sectors form)
To get around this problem modern drives are addressed using Logical Block Addressing, which is what the operating system knows about. LBA is an addressing scheme where the entire disk is represented as a linear set of blocks, each block being a uniform amount of bytes (usually 512 or larger).
About Files
In order to understand where a "file" is located on a disk (at the LBA level) you will need to understand what a file is. This is going to be dependent on what file system you are using. In Unix style file systems there is a structure called an inode which describes a file. The inode stores all the attributes a file has and points to the LBA location of the actual data.
Ubuntu Example
Here's an example of finding the LBA location of file data.
First get your file's inode number
$ ls -i
659908 test.txt
Run the file system debugger. "yourPartition" will be something like sda1, it is the partition that your file system is located on.
$sudo debugfs /dev/yourPartition
debugfs: stat <659908>
Inode: 659908 Type: regular Mode: 0644 Flags: 0x80000
Generation: 3039230668 Version: 0x00000000:00000001
...
...
Size of extra inode fields: 28
EXTENTS:
(0): 266301
The number under "EXTENTS", 266301, is the logical block in the file system that your file is located on. If your file is large there will be multiple blocks listed. There's probably an easier way to get that number, I couldn't find one.
To validate that we have the right block use dd to read that block off the disk. To find out your file system block size, use dumpe2fs.
dumpe2fs -h /dev/yourPartition | grep "Block size"
Then put your block size in the ibs= parameter, and the extent logical block in the skip= parameter, and run dd like this:
sudo dd if=/dev/yourPartition of=success.txt ibs=4096 count=1 skip=266301
success.txt should now contain the original file's contents.
sudo hdparm --fibmap file
For ext, vfat and NTFS ..maybe more.
fibmap is also a linux C library.

Resources