Mnesia: always suffix fragmented table fragments? - erlang

When I create a fragmented table in Mnesia, all of the table fragments will have the suffix _fragN except for the first fragment. This is error-prone, since any code that accesses the table without specifying the correct access module will appear to work, since it reads from and writes to the first fragment, but it will not mix with code using the correct access module, since they will be looking for elements in different places.
Is there a way to tell Mnesia to use a fragment suffix for all table fragments? That would avoid that problem, by making incorrect accesses fail noisily.
For example, if I create a table with four fragments:
1> mnesia:start().
ok
2> mnesia:create_table(foo, [{frag_properties, [{node_pool, [node()]}, {n_fragments, 4}]}]).
{atomic,ok}
then mnesia:info/0 will list the fragments as foo, foo_frag2, foo_frag3 and foo_frag4:
3> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
foo : with 0 records occupying 304 words of mem
foo_frag2 : with 0 records occupying 304 words of mem
foo_frag3 : with 0 records occupying 304 words of mem
foo_frag4 : with 0 records occupying 304 words of mem
schema : with 5 records occupying 950 words of mem
===> System info in version "4.14", debug level = none <===
opt_disc. Directory "/Users/legoscia/Mnesia.nonode#nohost" is NOT used.
use fallback at restart = false
running db nodes = [nonode#nohost]
stopped db nodes = []
master node tables = []
remote = []
ram_copies = [foo,foo_frag2,foo_frag3,foo_frag4,schema]
disc_copies = []
disc_only_copies = []
[{nonode#nohost,ram_copies}] = [schema,foo_frag4,foo_frag3,foo_frag2,foo]
3 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
I'd want foo to be foo_frag1 instead. Is that possible?

Related

Intel-x86:The interaction between WC, WB and UC Memory

The memory ordering guarantees across different memory regions on x86 architectures are not clear to me. Specifically, the Intel manual states that WC, WB and UC follow different memory orderings as follows.
WC: weak ordering (where e.g. two stores on different locations can be reordered)
WB (as well as WT and WP, i.e. all cacheable memory types): processor ordering (a.k.a TSO, where younger loads can be reordered before older stores on different locations)
UC: strong ordering (where all instructions are executed in the program order and cannot be reordered)
What is not clear to me is the interaction between UC and the other regions. Specifically, the manual mentions:
(A) UC accesses are strongly ordered in that they are always executed in program order and cannot be reordered; and
(B) WC accesses are weakly-ordered and can thus be reordered.
So between (A) and (B) it is not clear how UC accesses and WC/WB accesses are ordered w.r.t. one another.
1a) [UC-store/WC-store ordering] For instance, let us assume that x is in UC memory and y is WC memory. Then in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two stores in thread 0 can be reordered. (I have put an mfence between the two loads hoping that it would stop the loads from being reordered, as it is not clear to me whether WC/UC loads can be reordered; see 3a below)
thread 0 | thread 1
store [x] <-- 1 | load [y]; mfence
store [y] <-- 1 | load [x]
1b) What if instead (symmetrically) x were in WC memory and y were in UC memory?
2a) [UC-store/WB-load ordering] Similarly, can a UC-store and a WB-load (on different locations) be reordered? Let us assume that x is in UC memory and z is in WB memory. Then in the multi-threaded program below, is it possible for both loads to load 0? This would be possible if both x and z were in WB emory due to store buffering (or alternatively justified as: younger loads in each thread can be reordered before the older stores as they are on different locations). But since the accesses on x are in UC memory, it is not clear whether such behaviours are possible.
thread 0 | thread 1
store [x] <-- 1 | store [z] <-- 1
load [z] | load [x]
2b) [UC-store/WC-load ordering] What if z were in WC memory (and x is in UC memory)? Can both loads load 0 then?
3a) [UC-load/WC-load ordering] Can a UC-load and a WC-load be reordered? Once again, let us assume that x is in UC memory and y is in WC memory. Then, in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two loads could be reordered (I believe the two stores cannot be reordered due to the intervening sfence; the sfence may not be needed depending on the answer to 1a).
thread 0 | thread 1
store [x] <-- 1; sfence | load [y]
store [y] <-- 1 | load [x]
3b) What if instead (symmetrically) x were in WC memory and y were in UC memory?
4a) [WB-load/WC-load ordering] What if in the example of 3a above x were in WB memory (instead of UC) and y were in WC memory (as before)?
4b) What if (symmetrically) x were in WC memory and y were in WB memory?
WARNING: I am ignoring cache coherency in all of this; because it complicates everything and doesn't make any difference to understanding how WB, WT, WP, WC or WC work, or any of the answers.
Assume you have 4 pieces, like:
________
| |
| Caches |
|________|
/ \
______/_ _\__________________
| | | |
| CPU |-----| Physical address |
| core | | space (e.g. RAM) |
|________| |____________________|
\ /
__\______/_
| |
| Write |
| combining |
| buffer |
|___________|
As far as the CPU's core is concerned; everything is always "processor ordering" (total store ordering with store forwarding). The only difference between WC, WB, WT, WP and UC is the path data takes to go between the CPU core and the physical address space.
For UC, writes go directly to the physical address space and reads come directly from the physical address space.
For WC, writes go down to "write combining buffer" where they're combined with previous writes and eventually evicted from the buffer (and sent to the physical address space later). Reads from WC come directly from the the physical address space.
For WB, writes go to caches and are evicted from the caches (and sent to the physical address space) later. For WT writes go to both caches and the physical address space at the same time. For WP writes get discarded and don't reach the physical address space at all. For all of these, reads come from cache (and cause fetch from the physical address space into cache on "cache miss").
There are 3 other things that influence this:
store forwarding. Any store can be forwarded to a later load within "CPU core", regardless whether the area is supposed to be WC, WB, WT, ... or UC. This means that it's technically wrong to claim that 80x86 has "total store ordering".
non-temporal stores cause data to go to the write combining buffers (regardless of whether the memory area was originally WB or WT or ... or UC). Non-temporal reads allow a later non-temporal read to occur before an earlier store.
write fences prevent store forwarding and wait for the write combining buffer to be emptied. Read fences cause CPU to wait until earlier reads complete before allowing later reads. The mfence instruction combines the behavior of read fence and write fence. Note: I lost track of lfence - for some/recent CPUs I think it got perverted into hack to help mitigate "spectre" security problems (I think it became a speculative execution barrier rather than just a read fence).
Now...
1a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | load [y_in_WC]; mfence
store [y_in_WC] <-- 1 | load [x_in_UC]
In this case the mfence is irrelevant (the previous load [y_in_WC] acts like UC anyway); but the store to y_in_WC may take ages to make its way to the physical address space (which isn't important because it's possibly last anyway). It's not possible to load 1 from y and 0 from x.
1b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]; mfence
store [y_in_UC] <-- 1 | load [x_in_WC]
In this case, the store [x_in_WC] may take ages to make its way to the physical address space; which means that the data loaded by load [x_in_WC] may fetch older data from the physical address space (even if the load is done after the store). It's very possible to load 1 from y and 0 from x.
2a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | store [z_in_WB] <-- 1
load [z_in_WB] | load [x_in_UC]
In this case there's nothing confusing at all (everything happens in the program order; it's just that store [z_in_WB] writes to cache and load [z_in_WB] reads from cache); and it's not possible for both loads to load 0. Note: an external observer (e.g. a device watching the physical address space) may not see the store to z_in_WB for ages.
2b)
thread 0 | thread 1
store [x_in_UC] <-- 1 | store [z_in_WC] <-- 1
load [z_in_WC] | load [x_in_UC]
In this case the store [z_in_WC] may not reach the physical address space until after the load [z_in_WC] has occurred (even if the load is done after the store). It is possible for both loads to load 0.
3a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | load [y_in_WC]
store [y_in_WC] <-- 1 | load [x_in_UC]
Same as "1a". It's not possible to load 1 from y and 0 from x.
3b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]
store [y_in_UC] <-- 1 | load [x_in_WC]
Same as "1b". It's very possible to load 1 from y and 0 from x.
3c)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]
sfence | load [x_in_WC]
store [y_in_UC] <-- 1 |
The sfence forces thread 0 to wait for the write combining buffer to drain, so it's not possible to load 1 from y and 0 from x.
4a)
thread 0 | thread 1
store [x_in_WB] <-- 1 | load [y_in_WC]
store [y_in_WC] <-- 1 | load [x_in_WB]
Mostly the same as "1a" and "3a". The only difference is that the store to x_in_WB goes to caches (and the load to x_in_WB comes from caches). Note: an external observer (e.g. a device watching the physical address space) may not see the store to x_in_WB for ages.
4b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_WB]
store [y_in_WB] <-- 1 | load [x_in_WC]
Mostly the same as "1b" and "3b". Note: an external observer (e.g. a device watching the physical address space) may not see the store to y_in_WB for ages.
Intel's description of the UC memory type is spread out in numerous places in the Volume 3 of the manual. I'll focus on the parts that are relevant to memory ordering. The main one is from Section 8.2.5:
The strong uncached (UC) memory type forces a strong-ordering model on memory accesses. Here, all reads and writes to the UC memory
region appear on the bus and out-of-order or speculative accesses are
not performed.
This states that UC memory accesses across different instructions are guaranteed to be observed in program order. A similar statement appears in Section 11.3. Both don't say anything about ordering between UC and other memory types. It's interesting to note here that since the global observability of all UC accesses are ordered, it's impossible for a store forwarding to happen from a UC store to a UC load. In addition, UC stores are not coalesced or combined in the WCBs, although they do pass through these buffers because that's the physical path that all requests from the core to the uncore have to traverse.
The following two quotes discuss the ordering guarantees between UC loads and stores and previous or later stores of any type. Emphasis is mine.
Section 11.3:
If the WC buffer is partially filled, the writes may be delayed until
the next occurrence of a serializing event; such as an SFENCE or
MFENCE instruction, CPUID or other serializing instruction, a read or
write to uncached memory, an interrupt occurrence, or an execution of
a LOCK instruction (including one with an XACQUIRE or XRELEASE
prefix).
This means that UC accesses are ordered with respect to earlier WC stores. Contrast this with WB accesses, which are not ordered with earlier WC stores because they a WB access doesn't cause the WCBs to be drained.
Section 22.34:
Writes stored in the store buffer(s) are always written to memory in
program order, with the exception of “fast string” store operations
(see Section 8.2.4, “Fast String Operation and Out-of-Order Stores”).
This means that stores are always committed from the store buffer in program order, which implies that stores of all types, except WC, across different instructions are observed in program order. A store of any type cannot be reordered with an earlier UC store.
Intel provides no guarantees regarding the ordering of non-UC loads with earlier or later UC accesses (loads or stores), so ordering is architecturally possible.
The AMD memory model is described more precisely for all memory types. It clearly states that a non-UC load can be reordered with an earlier UC store and that a WC/WC+ loads can be reordered with an earlier UC load. So far the Intel and AMD models agree with each other. However, the AMD model also states that a UC load cannot pass an earlier load of any type. Intel doesn't state this anywhere in the manual as far as I know.
Regarding examples 4a and 4b, Intel doesn't provide guarantees on the ordering between a WB load and a WC load. The AMD model allows a WC load to pass an earlier WB load, but not the other way around.

Size in MB of mnesia table

How do you read the :mnesia.info?
For example I only have one table, some_table, and :mnesia.info returns me this.
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
some_table: with 16020 records occupying 433455 words of mem
schema : with 2 records occupying 536 words of mem
===> System info in version "4.15.5", debug level = none <===
opt_disc. Directory "/home/ubuntu/project/Mnesia.nonode#nohost" is NOT used.
use fallback at restart = false
running db nodes = [nonode#nohost]
stopped db nodes = []
master node tables = []
remote = []
ram_copies = ['some_table',schema]
disc_copies = []
disc_only_copies = []
[{nonode#nohost,ram_copies}] = [schema,'some_table']
488017 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
Also calling:
:mnesia.table_info("some_table", :size)
It returns me 16020 which I think is the number of keys, but how can I get the memory usage?
First, you need mnesia:table_info(Table, memory) to obtain the number of words occupied by your table, in your example you are getting the number of items in the table, not the memory. To transform that value to MB, you can first use erlang:system_info(wordsize) to get the word size in bytes for your machine architecture(on a 32 bit system a word is 4 bytes and 64 bits it's 8 bytes), multiply it by your Mnesia table memory to obtain the size in bytes and finally transform the value to MegaBytes like:
MnesiaMemoryMB = (mnesia:table_info("some_table", memory) * erlang:system_info(wordsize)) / (1024*1024).
You can use erlang:system_info(wordsize) to get the word size in bytes, on a 32 bit system a word is 32 bits or 4 bytes, on 64 bit it's 8 bytes. So your table is using 433455 x wordsize.

How to distribute the records over mnesia fragments?

I have two erlang MNESIA nodes running in the cluster.
I have created table by the below properties.
mnesia:create_table(vmq_offline_store,[
{frag_properties,[
{node_pool,[node()|nodes()]},
{hash_module,verneDB_frag_hash},
{n_fragments,8},
{n_disc_only_copies,length([node()|nodes()])}]
},
{index,[]},{type, bag},
{attributes,record_info(fields,vmq_offline_store)}]).
I could see all the 8 fragments created on the two erlang nodes.
After this,I inserted 50000 records into the table using RPC call from external node.These 50000 records inserted into only vmq_offline_store. Not distributed over all the fragments.
vmq_offline_store: with 50000 records occupying 2096701142 bytes on disc
vmq_offline_store_frag2: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag3: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag4: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag5: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag6: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag7: with 0 records occupying 5464 bytes on disc
vmq_offline_store_frag8: with 0 records occupying 5464 bytes on disc
Could you please help me how to distribute the records over the fragments?
It's not enough to create the Mnesia table with fragmentation properties. Every table operation must explicitly specify the "access module" for fragmented tables, mnesia_frag. This is done by calling the function mnesia:activity/4, instead of calling mnesia:transaction/1 or using dirty operations.
For example, this code:
Fun = fun() -> ... end,
{atomic, Result} = mnesia:transaction(Fun),
becomes:
Fun = fun() -> ... end,
Result = mnesia:activity(transaction, Fun, [], mnesia_frag),
(Note that on errors mnesia:activity signals an error instead of returning {aborted, Reason}.)
For dirty operations, code like this:
mnesia:dirty_write(MyRecord)
becomes:
mnesia:activity(sync_dirty, mnesia, write, [MyRecord], mnesia_frag)
or alternatively:
mnesia:activity(sync_dirty, fun() -> mnesia:write(MyRecord) end, [],
mnesia_frag)
That is, never use the mnesia:dirty_* functions; use the "bare" ones within a dirty activity.

Forcing a table rehash not working after a previous rehash

I've created a function that resizes an array and sets new entries to 0, but can also decrease the size of the array in 2 different ways:
1. Simply setting the n property to the new size (the length operator cannot be used because of this reason).
2. Setting all values after the new size to nil up to 2*size to force a rehash.
local function resize(array, elements, free)
local size = array.n
if elements < size then -- Decrease Size
array.n = elements
if free then
size = math.max(size, #array) -- In case of multiple resizes
local base = elements + 1
for idx = base, 2*size do -- Force a rehash -> free extra unneeded memory
array[idx] = nil
end
end
elseif elements > size then -- Increase Size
array.n = elements
for idx = size + 1, elements do
array[idx] = 0
end
end
end
How I tested it:
local mem = {n=0};
resize(mem, 50000)
print(mem.n, #mem) -- 50000 50000
print(collectgarbage("count")) -- relatively large number
resize(mem, 10000, true)
print(mem.n, #mem) -- 10000 10000
print(collectgarbage("count")) -- smaller number
resize(mem, 20, true)
print(mem.n, #mem) -- 20 20
print(collectgarbage("count")) -- same number as above, but it should be a smaller number
However when I don't pass true as the third argument to the second call of resize (so it doesn't force a rehash on the second call), the third call does end up rehashing it.
Am I missing something? I'm expecting the third one to also rehash after the second one has.
Here is a clearer picture of how the table usually looks like before and after the resizes:
table: 0x15bd3d0 n: 0 #: 0 narr: 0 nrec: 1
table: 0x15bd3d0 n: 50000 #: 50000 narr: 65536 nrec: 1
table: 0x15bd3d0 n: 10000 #: 10000 narr: 16384 nrec: 2
table: 0x15bd3d0 n: 20 #: 20 narr: 16384 nrec: 2
And here is what happens:
During the resize to 50000 elements, the table is rehashed several times, and at the end it contains exactly one hash part slot for the n field and enough array part slots for the integer keys.
During the shrinking to 10000 elements, you first assign nil to the integer keys 10001 to 65536, and then from 65537 to 100000. The first group of assignments will never cause a rehash, because you assign to existing fields. This has to do with the guarantees for the next function. The second group of assignments will cause rehashes, but since you are assinging nils, Lua will realize at some point that the array part of the table is more than half empty (see comment at the beginning of ltable.c). Lua will then shrink the array part to a reasonable size and use a second hash slot for the new key. But since you are assigning nils, that second hash slot is never occupied, and Lua is free to re-use it for all the remaining assignments (and it often but not always does). You wouldn't notice a rehash at this point anyway, because you will always end up with the 16384 array slots and 2 hash slots (one for n, one for the new element to be assigned).
The shrinking to 20 elements just continues this way, with the exception that a second hash slot is already available. So you might never get a rehash (and the array size stays larger than necessary), but if you do (Lua for some reason doesn't like the one free hash slot), you'll see the number of array slots drop to a reasonable level.
This is what it looks like when you do get a rehash during the second shrinking:
table: 0x11c43d0 n: 0 #: 0 narr: 0 nrec: 1
table: 0x11c43d0 n: 50000 #: 50000 narr: 65536 nrec: 1
table: 0x11c43d0 n: 10000 #: 10000 narr: 16384 nrec: 2
table: 0x11c43d0 n: 20 #: 20 narr: 32 nrec: 2
If you want to repeat my experiments, the git HEAD version of lua-getsize (original version here) now also returns the number of slots in the array/hash parts of a table.

SE 4.10 bcheck <filename>, SE 2.10 bcheck <filename.ext> and other bcheck anomalies

ISQL-SE 4.10.DD6 (DOS 6.22):
BCHECK C-ISAM B-tree Checker version 4.10.DD6
C-ISAM File: c:\dbfiles.dbs\*.*
ERROR: cannot open C-ISAM file
In SE2.10 it worked with wilcards * .* for all files, but in SE4.10 it doesn’t. I have an SQL script which my users periodically run to reorg customer and transactions tables. Then I have a FIX.BAT DOS script [bcheck –y * .*] as a utility option for my users in case any tables get screwed up. Since users running the reorg will now increment the table version number, example: CUSTO102, 110, … now I’m going to have to devise a way to strip the .DAT extensions from the .DBS dir and feed it to BCHECK. Before, my reorg would always re-create a static CUSTOMER.DAT with CREATE TABLE customer IN “C:\DBFILES.DBS\CUSTOMER”; but that created the write permission problem and had to revert back to SE’s default datafile journaling…
Before running BCHECK on CUSTO102, its .IDX file size was 22,089 bytes and its .DAT size is 882,832 bytes.
After running BCHECK on CUSTO102, its .IDX size increased to 122,561 bytes, however a new .IDY file was created with 88,430 bytes..
What's a .IDY file ???
C:\DBFILES.DBS> bcheck –y CUSTO102
BCHECK C-ISAM B-tree Checker version 4.10.DD6
C-ISAM File: c:\dbfiles.dbs\CUSTO102
Checking dictionary and file sizes.
Index file node size = 512
Current C-ISAM index file node size = 512
Checking data file records.
Checking indexes and key descriptions.
Index 1 = unique key
0 index node(s) used -- 1 index b-tree level(s) used
Index 2 = duplicates (2,30,0)
42 index node(s) used -- 3 index b-tree level(s) used
Index 3 = unique key (32,5,0)
29 index node(s) used -- 2 index b-tree level(s) used
Index 4 = duplicates (242,4,2)
37 index node(s) used -- 2 index b-tree level(s) used
Index 5 = duplicates (241,1,0)
36 index node(s) used -- 2 index b-tree level(s) used
Index 6 = duplicates (46,4,2)
38 index node(s) used -- 2 index b-tree level(s) used
Checking data record and index node free lists.
ERROR: 177 missing index node pointer(s)
Fix index node free list ? yes
Recreating index node free list.
Recreating index 6.
Recreating index 5.
Recreating index 4.
Recreating index 3.
Recreating index 2.
Recreating index 1.
184 index node(s) used, 177 free -- 1083 data record(s) used, 0 free
The problem with the wild cards is more likely an issue with the command interpreter that was used to run bcheck than with bcheck itself. If you give bcheck a list of file names (such as 'abc def.dat def.idx', then it will process the C-ISAM file pairs (abc.dat, abc.idx), (def.dat, def.idx) and (def.dat, def.idx - again). Since it complained about being unable to open 'c:\dbfiles.dbs\*.*', it means that the command interpreter did not expand the '*.*' bit, or there was nothing for it to expand into.
I expect that the '.IDY' file is an intermediate used while rebuilding the indexes for the table. I do not know why it was not cleaned up - maybe the process did not complete.
About sizes, I think your table has about 55,000 rows of size 368 bytes each (SE might say 367; the difference is the record status byte at the end, which is either '\0' for deleted or '\n' for current). The unique index on the CHAR(5) column (index 3) requires 9 bytes per entry, or about 56 keys per index node, for about 1000 index nodes. The duplicate indexes are harder to size; you need space for the key value plus a list of 4-byte numbers for the duplicates, all packed into 512-byte pages. The 22 KB index file was missing a lot of information. The revised index file is about the right size. Note that index 1 is the 'ROWID' index; it does not occupy any space. (Index 1 is also why although every table created by SE is stored in a C-ISAM file, not all C-ISAM files are necessarily compatible with SE.)

Resources