How sqlite3 write capacity is calculated - ios

I create test table
create table if not exists `HKDevice` (primaryID integer PRIMARY KEY AUTOINCREMENT,mac integer)
insert 1 row:
NSString *sql = #"insert into `HKDevice` (mac)values('0')";
int result = sqlite3_exec(_db, sql.UTF8String, NULL, NULL, &errorMesg);
disk report write 48kb
This is much bigger than I thought,I know integer size 4 Byte in sqlite,I think should be write total less than 10Byte.
the second write is also close to the size of the first,so I'm more confused......
Can someone tell me why?Thanks!

Writing every row individually to the file would be inefficient for larger operations, so the database always reads and writes entire pages.
Your INSERT command need to modify at least three pages: the table data, the system table that contains the AUTOINCREMENT counter, and the database change counter in the database header.
To make the changes atomic even in the case of interruptions, the database needs to save the old data of all changed pages in the rollback journal. So that's six pages overall.
If you do not use an explicit transaction around both commands, every command is automatically wrapped into an automatic transaction, so you get these writes for both commands. That's twelve pages overall.
With the default page size of 4 KB, this is the amount of writes you've seen.
(Apparently, the writes for the file system metadata are not shown in the report.)

There will be some metadata overhead involved. Firstly the database file will be created which must maintain the schema for the tables, indexes, sequences, custom functions etc. These all contribute to the disk space usage.
As an example, on Linux adding the database table that you define above to a new database results in a file of size 12288 bytes (12KB). If you more tables the space requirements will increase, as it does when you add data to the tables.
Furthermore, for efficiency reasons, databases typically write data to disk in `pages", i.e. a page (or block) of space is allocated and then written to. The page size will be selected to optimise I/O, for example 4096 bytes. So writing a single integer might require 4KB on disk if a new page is required. However, you will notice that writing a second integer will not consume any more space because there is sufficient space available in the existing page. Once the page becomes full a new page will be allocated, and the disk size could increase.
The above is a bit of a simplification. The strategies for database page allocation and management is a complicated subject. I'm sure that a search would find many resources with detailed information for SQLite.

Related

How to size memory required for TimesTen In-memory Database?

How do I figure out the right values for the memory parameters in TimesTen? How much memory do I need based on my tables and data?
A TimesTen database consists of two shared memory segments; one is small and is used exclusively by PL/SQL while the other is the main database segment which contains your data (tables, indexes etc.), temporary working space, the transaction log buffer and some space used by the system.
Attributes in the DSN definition set the size for these areas as follows:
PLSQL_MEMORY_SIZE - sets the size of the PL/SQL segment (default is 128 MB). If you do not plan to every use PL/SQL then you can reduce this to 32 MB. If you plan to make very heavy use of PL/SQL then you may need to increase this value.
LogBufMB - sets the size of the transaction log buffer. The default is 64 MB but this is too small for most production databases. A read-mostly workload may be able to get by with a value of 256 MB but workloads involving a lot of database writes will typically need 1024 MB and in extreme cases maybe as much as 16384 MB. When setting this value you should also take into account the setting (or default) for the LogBufParallelism attribute.
PermSize - sets the size for the permanent (persistent) database storage. This needs to be large enough to hold all of your table data, indexes, system metadata etc. and usually some allowance for growth, contingency etc.
TempSize - sets the value for the temporary memory region. This region is used for database locks, materialised tables, temporary indexes, sorting etc. and is not persisted to disk.
The total size of the main database shared memory segment is given by PermSize + TempSize + LogBufMB + SystemOverhead. The value for SystemOverhead varies from release to release but if you allow 64 MB then this is generally sufficient.
Documentation on database attributes can be found here: https://docs.oracle.com/database/timesten-18.1/TTREF/attribute.htm#TTREF114
You can estimate the memory needed for your tables and associated indexes using the TimesTen ttSize utility https://docs.oracle.com/database/timesten-18.1/TTREF/util.htm#TTREF369

Does setting ActiveRecord's text limit save hard drive space?

#pozs... this is NOT a duplicate of the one you indicated. That was the first place I looked. I could care less about the difference between text and varchar. I'm asking about physical space used within the medium aka server hard drive.
I know that hard drives are split into blocks of bytes aka chunks, that if used less then the total amount of the block the remaining space is an empty waste of unused space. What I'm curious about is that the text option itself uses a certain amount of storage. Can the space used be reduced rather than just limiting quantity of input. I could say text limit => 1, and it may still use thousands upon thousands of bytes... this is what I'm asking about.
This is a photo of hard drive blocks. This is how I imagine ActiveRecord text type space used
Here's the wiki on Blocks(data storage) http://en.wikipedia.org/wiki/Block_(data_storage) As you can see they say "Block storage is normally abstracted by a file system or database management system (DBMS)" What they do NOT say is HOW it is abstracted.
According to Igor's blog he says "To my surprise, they determined that the average I/O size of our Postgres databases was much higher that 8KB block size, and up to 1MB." http://igorsf.wordpress.com/2010/11/01/things-to-check-when-configuring-new-postgres-storage-for-high-performance-and-availability/ While this is helpful to know it doesn't tell me the default behaviour between ActiveRecord and PostgreSQL in handling blocks.
According to concernedoftunbridgewells "The database will allocate space in a table or index in some given block size. In the case of Postgres this is 8K". https://dba.stackexchange.com/questions/15510/understanding-block-sizes/15514#15514?newreg=fc10593601be479b8ed697d1bbd108ed So if 8K is used as a block, then how high or low do I set the text type limit to match and fit within the one 8K block, because it may use more then just one block.
I know that PostgreSQL block size setting can be changed. So I would like clarity on "how ActiveRecord PostgreSQL block size handling currently works". I will accept a good answer for that.
A page contains more than one item (assuming there is space obviously). If your row is less than 8k, then other rows will be stored on the same page with it (I simplify slightly - postgres stores large columns separately anyway). Limiting the max length of a column doesn't interact with this.
My reading of the details on character types is that strings under 126 characters incur 3 bytes less overhead, but this happens on a row by row basis, independently of what the maximum length is.
The postgresql docs have details on the exact on disk format and how postgresql deals with large columns.
IMO, the size takes by a text type db column is mainly depends on the content store in the column. The limit setting inside the ActiveRecord will just do a validation before the content is saving into the db column, and didn't has an impact to the actual storage.

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

what is the differnce between page file and index file in essbase?

what is the difference between .pag file and .ind file ?
I know the page file contains actual data means data-blocks and cells and index file holds the pointer of data block i.e. available in page file.
but is there any other difference ?regarding size?
As per my opinion size of page file is always larger than index file. Is it write?
If the size of Index file is larger than page file then what happened?If size of index file is larger than page file then is write?
If I have deleted the page file then it's affect to index file?
or
If I have deleted some data-block from page file then how is affect to index file?
You are correct about the page file including the actual data of the cube (although there is no data without the index, so in effect they are both the data).
Very typically the page files are bigger than the index. It's simply based on the number of dimensions and whether they are sparse or dense, the number of stored members in the dimensions, the density of the data blocks, the compression scheme used in the data blocks, and the number of index entries in the database.
It's not a requirement that one be larger than the other, it will simply depend on how you use the cube. I would advise you to not really worry about it unless you run into specific performance problems. At that point it is then useful, if for the purposes of optimizing retrieval, calc, or data load time, whether you should make a change to the configuration of the cube.
If you delete the page file it doesn't affect the index file necessarily, but you would lose all of the data in the cube. You would also lose the data if you just deleted all the index files. While the page files have data in them, as I mentioned, it is truly the combination of the page and index files that make up the data in the cube.
Under the right circumstances you can delete data from the database (such as doing a CLEARDATA operation) and you can reduce the size of the page files and/or the index. For example, deleting data such that you are clearing out some combination of sparse members may reduce the size of the index a bit as well as any data blocks associated with those index entries (that is, those particular combinations of sparse dimensions). It may be necessary to restructure and compact the cube in order for the size of the files to decrease. In fact, in some cases you can remove data and the size of the store files could grow.

TStringList, Dynamic Array or Linked List in Delphi?

I have a choice.
I have a number of already ordered strings that I need to store and access. It looks like I can choose between using:
A TStringList
A Dynamic Array of strings, and
A Linked List of strings (singly linked)
and Alan in his comment suggested I also add to the choices:
TList<string>
In what circumstances is each of these better than the others?
Which is best for small lists (under 10 items)?
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
Which is best to minimize memory use?
Which is best to minimize loading time to add extra items on the end?
Which is best to minimize access time for accessing the entire list from first to last?
On this basis (or any others), which data structure would be preferable?
For reference, I am using Delphi 2009.
Dimitry in a comment said:
Describe your task and data access pattern, then it will be possible to give you an exact answer
Okay. I've got a genealogy program with lots of data.
For each person I have a number of events and attributes. I am storing them as short text strings but there are many of them for each person, ranging from 0 to a few hundred. And I've got thousands of people. I don't need random access to them. I only need them associated as a number of strings in a known order attached to each person. This is my case of thousands of "small lists". They take time to load and use memory, and take time to access if I need them all (e.g. to export the entire generated report).
Then I have a few larger lists, e.g. all the names of the sections of my "virtual" treeview, which can have hundreds of thousands of names. Again I only need a list that I can access by index. These are stored separately from the treeview for efficiency, and the treeview retrieves them only as needed. This takes a while to load and is very expensive memory-wise for my program. But I don't have to worry about access time, because only a few are accessed at a time.
Hopefully this gives you an idea of what I'm trying to accomplish.
p.s. I've posted a lot of questions about optimizing Delphi here at StackOverflow. My program reads 25 MB files with 100,000 people and creates data structures and a report and treeview for them in 8 seconds but uses 175 MB of RAM to do so. I'm working to reduce that because I'm aiming to load files with several million people in 32-bit Windows.
I've just found some excellent suggestions for optimizing a TList at this StackOverflow question:
Is there a faster TList implementation?
Unless you have special needs, a TStringList is hard to beat because it provides the TStrings interface that many components can use directly. With TStringList.Sorted := True, binary search will be used which means that search will be very quick. You also get object mapping for free, each item can also be associated with a pointer, and you get all the existing methods for marshalling, stream interfaces, comma-text, delimited-text, and so on.
On the other hand, for special needs purposes, if you need to do many inserts and deletions, then something more approaching a linked list would be better. But then search becomes slower, and it is a rare collection of strings indeed that never needs searching. In such situations, some type of hash is often used where a hash is created out of, say, the first 2 bytes of a string (preallocate an array with length 65536, and the first 2 bytes of a string is converted directly into a hash index within that range), and then at that hash location, a linked list is stored with each item key consisting of the remaining bytes in the strings (to save space---the hash index already contains the first two bytes). Then, the initial hash lookup is O(1), and the subsequent insertions and deletions are linked-list-fast. This is a trade-off that can be manipulated, and the levers should be clear.
A TStringList. Pros: has extended functionality, allowing to dynamically grow, sort, save, load, search, etc. Cons: on large amount of access to the items by the index, Strings[Index] is introducing sensible performance lost (few percents), comparing to access to an array, memory overhead for each item cell.
A Dynamic Array of strings. Pros: combines ability to dynamically grow, as a TStrings, with the fastest access by the index, minimal memory usage from others. Cons: limited standard "string list" functionality.
A Linked List of strings (singly linked). Pros: the linear speed of addition of an item to the list end. Cons: slowest access by the index and searching, limited standard "string list" functionality, memory overhead for "next item" pointer, spead overhead for each item memory allocation.
TList< string >. As above.
TStringBuilder. I does not have a good idea, how to use TStringBuilder as a storage for multiple strings.
Actually, there are much more approaches:
linked list of dynamic arrays
hash tables
databases
binary trees
etc
The best approach will depend on the task.
Which is best for small lists (under
10 items)?
Anyone, may be even static array with total items count variable.
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
For large lists I will choose:
- dynamic array, if I need a lot of access by the index or search for specific item
- hash table, if I need to search by the key
- linked list of dynamic arrays, if I need many item appends and no access by the index
Which is best to minimize memory use?
dynamic array will eat less memory. But the question is not about overhead, but about on which number of items this overhead become sensible. And then how to properly handle this number of items.
Which is best to minimize loading time to add extra items on the end?
dynamic array may dynamically grow, but on really large number of items, memory manager may not found a continous memory area. While linked list will work until there is a memory for at least a cell, but for cost of memory allocation for each item. The mixed approach - linked list of dynamic arrays should work.
Which is best to minimize access time for accessing the entire list from first to last?
dynamic array.
On this basis (or any others), which data structure would be preferable?
For which task ?
If your stated goal is to improve your program to the point that it can load genealogy files with millions of persons in it, then deciding between the four data structures in your question isn't really going to get you there.
Do the math - you are currently loading a 25 MB file with about 100000 persons in it, which causes your application to consume 175 MB of memory. If you wish to load files with several millions of persons in it you can estimate that without drastic changes to your program you will need to multiply your memory needs by n * 10 as well. There's no way to do that in a 32 bit process while keeping everything in memory the way you currently do.
You basically have two options:
Not keeping everything in memory at once, instead using a database, or a file-based solution which you load data from when you need it. I remember you had other questions about this already, and probably decided against it, so I'll leave it at that.
Keep everything in memory, but in the most space-efficient way possible. As long as there is no 64 bit Delphi this should allow for a few million persons, depending on how much data there will be for each person. Recompiling this for 64 bit will do away with that limit as well.
If you go for the second option then you need to minimize memory consumption much more aggressively:
Use string interning. Every loaded data element in your program that contains the same data but is contained in different strings is basically wasted memory. I understand that your program is a viewer, not an editor, so you can probably get away with only ever adding strings to your pool of interned strings. Doing string interning with millions of string is still difficult, the "Optimizing Memory Consumption with String Pools" blog postings on the SmartInspect blog may give you some good ideas. These guys deal regularly with huge data files and had to make it work with the same constraints you are facing.
This should also connect this answer to your question - if you use string interning you would not need to keep lists of strings in your data structures, but lists of string pool indexes.
It may also be beneficial to use multiple string pools, like one for names, but a different one for locations like cities or countries. This should speed up insertion into the pools.
Use the string encoding that gives the smallest in-memory representation. Storing everything as a native Windows Unicode string will probably consume much more space than storing strings in UTF-8, unless you deal regularly with strings that contain mostly characters which need three or more bytes in the UTF-8 encoding.
Due to the necessary character set conversion your program will need more CPU cycles for displaying strings, but with that amount of data it's a worthy trade-off, as memory access will be the bottleneck, and smaller data size helps with decreasing memory access load.
One question: How do you query: do you match the strings or query on an ID or position in the list?
Best for small # strings:
Whatever makes your program easy to understand. Program readability is very important and you should only sacrifice it in real hotspots in your application for speed.
Best for memory (if that is the largest constrained) and load times:
Keep all strings in a single memory buffer (or memory mapped file) and only keep pointers to the strings (or offsets). Whenever you need a string you can clip-out a string using two pointers and return it as a Delphi string. This way you avoid the overhead of the string structure itself (refcount, length int, codepage int and the memory manager structures for each string allocation.
This only works fine if the strings are static and don't change.
TList, TList<>, array of string and the solution above have a "list" overhead of one pointer per string. A linked list has an overhead of at least 2 pointers (single linked list) or 3 pointers (double linked list). The linked list solution does not have fast random access but allows for O(1) resizes where trhe other options have O(lgN) (using a factor for resize) or O(N) using a fixed resize.
What I would do:
If < 1000 items and performance is not utmost important: use TStringList or a dyn array whatever is easiest for you.
else if static: use the trick above. This will give you O(lgN) query time, least used memory and very fast load times (just gulp it in or use a memory mapped file)
All mentioned structures in your question will fail when using large amounts of data 1M+ strings that needs to be dynamically chaned in code. At that Time I would use a balances binary tree or a hash table depending on the type of queries I need to maken.
From your description, I'm not entirely sure if it could fit in your design but one way you could improve on memory usage without suffering a huge performance penalty is by using a trie.
Advantages relative to binary search tree
The following are the main advantages
of tries over binary search trees
(BSTs):
Looking up keys is faster. Looking up a key of length m takes worst case
O(m) time. A BST performs O(log(n))
comparisons of keys, where n is the
number of elements in the tree,
because lookups depend on the depth of
the tree, which is logarithmic in the
number of keys if the tree is
balanced. Hence in the worst case, a
BST takes O(m log n) time. Moreover,
in the worst case log(n) will approach
m. Also, the simple operations tries
use during lookup, such as array
indexing using a character, are fast
on real machines.
Tries can require less space when they contain a large number of short
strings, because the keys are not
stored explicitly and nodes are shared
between keys with common initial
subsequences.
Tries facilitate longest-prefix matching, helping to find the key
sharing the longest possible prefix of
characters all unique.
Possible alternative:
I've recently discovered SynBigTable (http://blog.synopse.info/post/2010/03/16/Synopse-Big-Table) which has a TSynBigTableString class for storing large amounts of data using a string index.
Very simple, single layer bigtable implementation, and it mainly uses disc storage, to consumes a lot less memory than expected when storing hundreds of thousands of records.
As simple as:
aId := UTF8String(Format('%s.%s', [name, surname]));
bigtable.Add(data, aId)
and
bigtable.Get(aId, data)
One catch, indexes must be unique, and the cost of update is a bit high (first delete, then re-insert)
TStringList stores an array of pointer to (string, TObject) records.
TList stores an array of pointers.
TStringBuilder cannot store a collection of strings. It is similar to .NET's StringBuilder and should only be used to concatenate (many) strings.
Resizing dynamic arrays is slow, so do not even consider it as an option.
I would use Delphi's generic TList<string> in all your scenarios. It stores an array of strings (not string pointers). It should have faster access in all cases due to no (un)boxing.
You may be able to find or implement a slightly better linked-list solution if you only want sequential access. See Delphi Algorithms and Data Structures.
Delphi promotes its TList and TList<>. The internal array implementation is highly optimized and I have never experienced performance/memory issues when using it. See Efficiency of TList and TStringList

Resources