what is the differnce between page file and index file in essbase? - hyperion

what is the difference between .pag file and .ind file ?
I know the page file contains actual data means data-blocks and cells and index file holds the pointer of data block i.e. available in page file.
but is there any other difference ?regarding size?
As per my opinion size of page file is always larger than index file. Is it write?
If the size of Index file is larger than page file then what happened?If size of index file is larger than page file then is write?
If I have deleted the page file then it's affect to index file?
or
If I have deleted some data-block from page file then how is affect to index file?

You are correct about the page file including the actual data of the cube (although there is no data without the index, so in effect they are both the data).
Very typically the page files are bigger than the index. It's simply based on the number of dimensions and whether they are sparse or dense, the number of stored members in the dimensions, the density of the data blocks, the compression scheme used in the data blocks, and the number of index entries in the database.
It's not a requirement that one be larger than the other, it will simply depend on how you use the cube. I would advise you to not really worry about it unless you run into specific performance problems. At that point it is then useful, if for the purposes of optimizing retrieval, calc, or data load time, whether you should make a change to the configuration of the cube.
If you delete the page file it doesn't affect the index file necessarily, but you would lose all of the data in the cube. You would also lose the data if you just deleted all the index files. While the page files have data in them, as I mentioned, it is truly the combination of the page and index files that make up the data in the cube.
Under the right circumstances you can delete data from the database (such as doing a CLEARDATA operation) and you can reduce the size of the page files and/or the index. For example, deleting data such that you are clearing out some combination of sparse members may reduce the size of the index a bit as well as any data blocks associated with those index entries (that is, those particular combinations of sparse dimensions). It may be necessary to restructure and compact the cube in order for the size of the files to decrease. In fact, in some cases you can remove data and the size of the store files could grow.

Related

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

How sqlite3 write capacity is calculated

I create test table
create table if not exists `HKDevice` (primaryID integer PRIMARY KEY AUTOINCREMENT,mac integer)
insert 1 row:
NSString *sql = #"insert into `HKDevice` (mac)values('0')";
int result = sqlite3_exec(_db, sql.UTF8String, NULL, NULL, &errorMesg);
disk report write 48kb
This is much bigger than I thought,I know integer size 4 Byte in sqlite,I think should be write total less than 10Byte.
the second write is also close to the size of the first,so I'm more confused......
Can someone tell me why?Thanks!
Writing every row individually to the file would be inefficient for larger operations, so the database always reads and writes entire pages.
Your INSERT command need to modify at least three pages: the table data, the system table that contains the AUTOINCREMENT counter, and the database change counter in the database header.
To make the changes atomic even in the case of interruptions, the database needs to save the old data of all changed pages in the rollback journal. So that's six pages overall.
If you do not use an explicit transaction around both commands, every command is automatically wrapped into an automatic transaction, so you get these writes for both commands. That's twelve pages overall.
With the default page size of 4 KB, this is the amount of writes you've seen.
(Apparently, the writes for the file system metadata are not shown in the report.)
There will be some metadata overhead involved. Firstly the database file will be created which must maintain the schema for the tables, indexes, sequences, custom functions etc. These all contribute to the disk space usage.
As an example, on Linux adding the database table that you define above to a new database results in a file of size 12288 bytes (12KB). If you more tables the space requirements will increase, as it does when you add data to the tables.
Furthermore, for efficiency reasons, databases typically write data to disk in `pages", i.e. a page (or block) of space is allocated and then written to. The page size will be selected to optimise I/O, for example 4096 bytes. So writing a single integer might require 4KB on disk if a new page is required. However, you will notice that writing a second integer will not consume any more space because there is sufficient space available in the existing page. Once the page becomes full a new page will be allocated, and the disk size could increase.
The above is a bit of a simplification. The strategies for database page allocation and management is a complicated subject. I'm sure that a search would find many resources with detailed information for SQLite.

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

Why is a process's address space divided into four segments (text, data, stack and heap)?

Why does a process's address space have to divide into four segments (text, data, stack and heap)? What is the advandatage? is it possible to have only one whole big segment?
There are multiple reasons for splitting programs into parts in memory.
One of them is that instruction and data memories can be architecturally distinct and discontiguous, that is, read and written from/to using different instructions and circuitry inside and outside of the CPU, forming two different address spaces (i.e. reading code from address 0 and reading data from address 0 will typically return two different values, from different memories).
Another is reliability/security. You rarely want the program's code and constant data to change. Most of the time when that happens, it happens because something is wrong (either in the program itself or in its inputs, which may be maliciously constructed). You want to prevent that from happening and know if there are any attempts. Likewise you don't want the data areas that can change to be executable. If they are and there are security bugs in the program, the program can be easily forced to do something harmful when malicious code makes it into the program data areas as data and triggers those security bugs (e.g. buffer overflows).
Yet another is storage... In many programs a number of data areas aren't initialized at all or are initialized to one common predefined value (often 0). Memory has to be reserved for these data areas when the program is loaded and is about to start, but these areas don't need to be stored on the disk, because there's no meaningful data there.
On some systems you may have everything in one place (section/segment/etc). One notable example here is MSDOS, where .COM-style programs have no structure other than that they have to be less than about 64KB in size and the first executable instruction must appear at the very beginning of file and assume that its location corresponds to IP=0x100 (where IP is the instruction pointer register). How code and data are placed and interleaved in a .COM program is unimportant and up to the programmer.
There are other architectural artifacts such as x86 segments. Again, MSDOS is a good example of an OS that deals with them. .EXE-style programs in it may have multiple segments in them that correspond directly to the x86 CPU segments, to the real-mode addressing scheme, in which memory is viewed through 64KB-long "windows" known as segments. The position of these windows/segments is relative to the value of the CPU's segment registers. By altering the segment register values you can move the "windows". In order to access more than 64KB one needs to use different segment register values and that often implies having multiple segments in the .EXE (can be not just one segment for code and one for data, but also multiple segments for either of them).
At least the text and data segments are separated to prevent malicious code that's stored inside a variable from being run.
Instructions (compiled code) are stored in the text segment, while the contents of your variables are stored in a data segment, the latter of which never gets executed, only read from and written to.
A little more info here.
Isn't this distinction just a big, hacky workaround for patching security into the von-Neumann architecture where data and instructions share the same memory?

Organizing thousands of images on a server

I'm developing a website which might grow up to a few thousand users, all of which would upload up to ten pictures on the server.
I'm wondering what would be the best way of storing pictures.
Lets assume that I have, 5000 users with 10 pictures each, which gives us 50 000 pics. (I guess it wouldn't be a good idea to store them in the database in blobs ;) )
Would it be a good way to dynamically create directories for every 100 users registered, (50 dirs in total, assuming 5000 users), and upload their pictures there? Would naming convention 'xxx_yy.jpg' (xxx being user id and yy picture number) be ok?
In this case, however, there would be 1000 (100x10) pictures in one folder, isn't it too many?
I would most likely store the images by a hash of their contents. A 128-bit SHA, for instance. So, I'd rename a user's uploaded image 'foo.jpg' to be its 128-bit sha (probably in base 64, for uniform 16-character names) and then store the user's name for the file and its SHA in a database. I'd probably also add a reference count. Then if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish.
As for actual physical storage, now that you have a guaranteed uniform naming scheme, you can use your file system as a balanced tree. You can either decide how many files maximum you want in a directory, and have a balancer move files to maintain this, or you can imagine what a fully populated tree would look like, and store your files that way.
The only real drawback to this scheme is that it decouples file names from contents so a database loss can mean not knowing what any file is called, but you should be careful to back up that kind of information anyway.
Different filesystems perform differently with directories holding large numbers of files. Some slow down tremendously. Some don't mind at all. For example, IBM JFS2 stores the contents of directory inodes as a B+ Tree sorted by filename.... so it probably provides log(n) access time even in the case of very large directories.
getting ls or dir to read, sort, get size/date info, and print them to stdout is a completely different task from accessing the file contents given the filename.... So don't let the inability of ls to list a huge directory guide you.
Whatever you do, don't optimize too early. Just make sure your file access mechanism can be asbstracted (make a FileStorage that you .getfile(id) from, or something...).
That way you can put in whatever directory structure you like, or for example if you find it's better to store these items as a BLOB column in a database, you have that option...
granted i have never stored 50,000 images, but i usually just store all images in the same directory and name them as such to avoid conflict. then store the reference in the db.
$ext = explode( '.', $filename );
$newName = md5( microtime() ) . '.' . $ext;
that way you never have the same two filenames as microtime will never be the same.

Resources