Calculate disk capacity with given information of disk blocks - memory

I got stuck with the follow question regarding file system
Consider a file system that uses inodes to represent files. Disk blocks are 4 KB in size, and a pointer to a disk block requires 4 bytes. This file system has 12 direct disk blocks, as well as single, double, and triple indirect disk blocks.
(1). What is the maximum size of a file that can be stored in this file system?
(2). What is the disk capacity?
Part 1 is simple, which is
(12*4KB)+(1024*4KB)+(1024*1024*4KB)+(1024*1024*1024*4KB) = 4TB
as each block has 4KB / 4 bytes = 1024 pointers
But I got stuck at part 2, my initial thought is that since the maximum size of a file that can be stored in the file system is 4TB, the disk capacity is 4TB as well. Is that the case?
Hope anyone can help with my problem. Thank you very much.

Related

Cassandra Data storage: data directory space not equal to the space occupied

This is a beginners question on Cassandra Architecture.
I have a 3 node Cassandra cluster. The data directory is at $CASSANDRA_HOME/data/data. I've loaded a huge data set. I did a nodetool flush and then nodetool tablestats on the table I loaded the data. This says the total space occupied is around 50GiB. I was curious and checked the size of my data directory du $CASSANDRA_HOME/data/data on each of the nodes,which shows around 1-2GB on each. How could the data directory be less than the space occupied by a single table? Am I missing something? My table is created with replication factor 1
du gives out the true storage capacity used by the paths given to it. This is not always directly connected to the size of the data stored in these paths.
Two main factors mix up the output of du compared to any other storage usage information you might get (e. g. from Cassandra).
du might give out a smaller number than expected because of two reasons: ⓐ It combines hard links. This means that if the paths given to it contain hard linked files (I won't explain hard links here, but this term is a fixed one for Unixish operating systems so it can be looked up easily), these are counted only once while the files exist multiple times. ⓑ It is aware of sparse files; these are files which contain large (sometimes huge) areas of empty space (zero-bytes). In many Unixish file systems these can be stored efficiently, depending on how they have been created.
du might give out a larger number than expected because file systems have some overhead. To store a file of n bytes, n + h bytes need to be stored because of this. h depends on the file system and its configuration. The most important factor is that file systems typically store files in a block structure. If a file isn't exactly the size of a multiple of the block size of the file system, the last needed block is still allocated completely by this file, so some of its size if wasted. du will show the whole block as allocated because, in fact, it is.
So in your case Cassandra might talk about space occupied of 50GiB but a lot of it might be empty (never written-to) space. This might be stored in a sparse file on the file system which in fact only uses 2GiB of storage size (which du shows).

Set associative cache

A computer system uses 32 bit memory address and it has a main memory consisting 1Gb. It has a 4kb cache organised in block set associative manner with 4blocks per set and 64bytes per block. Calculate the number of bits in each tag,set,word fields of the memory address
This simulator can help you: www.cachesimulator.com
After the simulated cache have loaded you can find the address layout under "Cache information".

Can HDF5 perform "value mapping"?

If I have a 32^3 array of 64 bit integers, but it contains only a dozen different values, can you tell HDF5 to use an "internal mapping" to save memory and/or disk space? What I mean is that the array would be access normally with 64 bit ints, but each value would internally be stored as a byte (?) index into a table of 64 bit ints, potentially saving about 7/8 of the memory and/or disk space. If this is possible, does it actually saves memory, disk space or both?
I don't believe that HDF5 provides this functionality right out of the box, but there is no reason why you couldn't implement routines to write your data to an HDF5 file and read it back again in the way that you seem to want. I suppose you could write your look-up table and your array into different datasets.
It's possible, but not something I have any evidence to indicate, that HDF's compression facility would sufficiently compress your integer dataset that you could save a useful amount of space.
Then again, for the HDF5 files I work with (10s of GBs) I wouldn't bother to try to devise my own encoding scheme to save such modest amounts of space as a 32768 element array of 64 bit numbers might be able to dispense with. Sure, you could transform a dataset of 2097152 bits into one of 131072 but disk space (even RAM) just isn't that tight these days.
I'm beginning to form the impression that you are trying to use HDF5 on, perhaps, a smartphone :-)

Reading a bit from memory

I'm looking into reading single bits from memory (RAM, harddisk). My understanding was, one can not read less than a byte.
However I read someone telling it can be done with assembly.
I wan't the bandwidth usage to be as low as possible and the to be retrieved data is not sequential, so I can not read a byte and convert it to 8 bits.
I don't think the CPU will read less than the size of a cache line from RAM (64 bytes on recent Intel chips). From disk, the minimum is typically 4 kiB.
Reading a single bit at a time is neither possible nor necessary, since the data bus is much wider than that.
You cannot read less than a byte from any PC or hard disk that I know of. Even if you could, it would be extremely inefficient.
Some machines do memory mapped port io that can read/write less than a byte to the port, but it still shows up when you get it as at least a byte.
Use the bitwise operators to pick off specific bits as in:
char someByte = 0x3D; // In binary, 111101
bool flag = someByte & 1; // Get the first bit, 1
flag = someByte & 2; // Get the second bit, 0
// And so on. The number after the & operator is a power of 2 if you want to isolate one bit.
// You can also pick off several bits like so:
int value = someByte & 3; // Assume the lower 2 bits are interesting for some reason
It used to be, say 386/486 days, where a memory was a bit wide, 1 meg by 1 bit, but you will have 8 or some multiple number of chips, one for each bit lane on the bus, and you could only read in widths of the bus. today the memories are a byte wide and you can only read in units of 32 or 64 or multiples of those. Even when you read a byte, most designs fill in the whole byte. it adds unnecessarily complication/cost, to isolate the bus all the way to the memory, a byte read looks to most of the system as a 32 or 64 bit read, as it approaches the edge of the processor (sometimes physical pins, sometimes the edge of the core inside the chip) is when the individual byte lane is separated out and the other bits are discarded. Having the cache on changes the smallest divisible read size from the memory, you will see a burst or block of reads.
it is possible to design a memory system that is 8 bits wide and read 8 bits at a time, but why would you? unless it is an 8 bit processor which you probably couldnt take advantage of a 8bit by 2 gig memory. dram is pretty slow anyway, something like 133 mhz (even your 1600mhz memory is only short burst as you read from slow parts, memory has not gotten faster in over 10 years).
Hard disks are similar but different, I think sectors are the smallest divisible unit, you have to read or write in those units. so when reading you have a memory cycle on the processor, no different that going to a memory, and depending on the controller either before you do the read or as a result, a sector is read of the disk, into a buffer, not unlike a cache line read, then your memory cycle to the buffer in the disk controller either causes a bus width read and the processor divides it up or if the bus adds complexity to isolate byte lanes then you isolate a byte, but nobody isolates bit lanes. (I say the word nobody and someone will come back with an exception...)
most of this is well documented, not hard to find. For arm platforms look for the amba and/or axi specifications, freely downloaded. the number of bridges, pcie controllers, disk controller documents are all available for PCs and other platforms. it still boils down to an address and data bus or one goesouta and one goesinta data bus and some control signals that indicate the access type. some busses have byte lane enables, which is generally for a write not a read. If I want to write only a byte to a dram in a modern 64 bit system, I DO have to tell everyone almost all the way out to the dram what I want to write. To write a byte on a memory module which must be accessed 64 bits at a time, at a minimum a 64 bit read happens into a temporary place either the cache or the memory controller, then the byte to be written modifies the specific byte within the 64 bit word, then that 64 bit quantity, eventually, is written back to the memory module itself. You can do this using a combination of the address bits and a few control signals or you can just put 8 byte lane enables and the lower address bits can be ignored. Hard disk, same deal, have to read a sector, modify one byte, then eventually write the whole sector at a time. with flash and eeprom, you can only write zeros (from the programmers perspective), you erase to ones (from the programmers perspective, is actually a zero in the logic, there is an inversion) and a write has to be a sector at a time, sectors can be 64 bytes, 128 bytes, 256 bytes typically.

Matlab: Free memory is lost after calling a function

I have some troubles with memory management in Matlab. Finally it leads to not enough free memory and an error.I tried to pinpoint the problem and found one interesting "feature": Somehow I loose free Memory in Matlab.
I do the following:
1) Start Matlab
2) typing "memory" I get: Maximum possible array: 1293 mb, Memory available for all arrays: 1456 mb
3) I'll call a function. The function is rather long, so it's hard to paste it here. But basically it loads 5 ca. 300mb mat files (sequentially), picks some few values and returns them. The returned matrix is ca. 1,2mb (4650x35 double)
4) I clear all variables in workspace ("clear all")
5) typing "memory" I get: Maximum possible array: 759 mb, Memory available for all arrays: 1029 mb
If I repeat steps 3) to 5) the memory numbers are constant.
So what is wrong here? Where do I loose the 400mb of free space? The memory used by Matlab is constant at around 330mb.
Does anyone have some ideas what is wrong here? Or is this something totally natural, but I miss it??
Thanks
Thomas
Ps: I use Matlab 2010a and Win 7 pro 32bit.
A good part of this "lost" memory is probably due to memory fragmentation. As Matlab allocates and frees arrays over the course of a session, the memory gets broken up into smaller areas, and some is lost to overhead in the memory manager, at both the Matlab and the underlying C levels. The overhead is not counted as "used" by Matlab because it's not being used to hold M-code array values. Some memory may also be consumed by Matlab loading additional M-files and libraries, allocating internal buffers or structures, or by expansion of the Java heap in Matlab's embedded JVM. This is normal. After doing some work, Matlab won't have as much memory available as it did in a fresh session.
AFAIK, once low-level fragmentation occurs, there's nothing you can do to eliminate it aside from restarting Matlab. Allocating lots of small arrays can accelerate fragmentation. This sometimes happens if you use large cellstrs or large arrays of objects. So if you are having problems, you may need to reduce your peak memory usage in the function by breaking the work in to smaller chunks, reducing cell usage, and so on. And if you have big cellstr arrays in the MAT files, convert them to char. The "high water mark" of allocation is what governs fragmentation, so if you can break your data set in to smaller chunks, you can fit it in less memory.
Inside your function, clear as much as you can from one MAT file before moving on to the next. One way to do this implicitly is to move the per-file processing into a subfunction if it's currently sitting in a loop in your main function.
To help debug, do a "dbstop if all error", which will get triggered by the OOM. From there, you can use whos and the debugger to find out where the space is being taken up when you exhaust memory. That might reveal temp variables that need to be cleaned up, or suggest ways of chunking the work.
If you'd like to experiment to see what fragmentation looks like and how it affects memory()'s output, here's a function that will just create some fragmentation.
function fragmem(nbytes, chunksize)
%FRAGMEM Fragment the Matlab session's memory
if nargin < 2; chunksize = 1*2^10; end
nbytes = nbytes - rem(nbytes, chunksize);
nsteps = 100; % to make initial input relatively small
c = cell([1 nsteps]);
stepsize = nbytes / nsteps;
chunksperstep = ceil(stepsize / chunksize);
fprintf('Fragmenting %d MB memory into %d KB chunks (%d steps of %d chunks)\n',...
round(nbytes/2^20), round(chunksize/2^10), nsteps, chunksperstep);
x = zeros([1 chunksperstep * chunksize], 'uint8');
colsizes = repmat(chunksize, [1 chunksperstep]);
for i = 1:nsteps
c{i} = mat2cell(x, 1, colsizes);
end
Fragging 300 MB in 1KB chunks on my machine reproduces a "loss" on my win32 machine about the size you're seeing.
>> memory
Maximum possible array: 1384 MB (1.451e+009 bytes) *
Memory available for all arrays: 1552 MB (1.627e+009 bytes) **
Memory used by MATLAB: 235 MB (2.463e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>> fragmem(300*2^20)
Fragmenting 300 MB memory into 1 KB chunks (100 steps of 3072 chunks)
>> memory
Maximum possible array: 1009 MB (1.059e+009 bytes) *
Memory available for all arrays: 1175 MB (1.232e+009 bytes) **
Memory used by MATLAB: 257 MB (2.691e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>>

Resources