Intel-x86:The interaction between WC, WB and UC Memory

Intel-x86:The interaction between WC, WB and UC Memory - memory

The memory ordering guarantees across different memory regions on x86 architectures are not clear to me. Specifically, the Intel manual states that WC, WB and UC follow different memory orderings as follows.
WC: weak ordering (where e.g. two stores on different locations can be reordered)
WB (as well as WT and WP, i.e. all cacheable memory types): processor ordering (a.k.a TSO, where younger loads can be reordered before older stores on different locations)
UC: strong ordering (where all instructions are executed in the program order and cannot be reordered)
What is not clear to me is the interaction between UC and the other regions. Specifically, the manual mentions:
(A) UC accesses are strongly ordered in that they are always executed in program order and cannot be reordered; and
(B) WC accesses are weakly-ordered and can thus be reordered.
So between (A) and (B) it is not clear how UC accesses and WC/WB accesses are ordered w.r.t. one another.
1a) [UC-store/WC-store ordering] For instance, let us assume that x is in UC memory and y is WC memory. Then in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two stores in thread 0 can be reordered. (I have put an mfence between the two loads hoping that it would stop the loads from being reordered, as it is not clear to me whether WC/UC loads can be reordered; see 3a below)
thread 0 | thread 1
store [x] <-- 1 | load [y]; mfence
store [y] <-- 1 | load [x]
1b) What if instead (symmetrically) x were in WC memory and y were in UC memory?
2a) [UC-store/WB-load ordering] Similarly, can a UC-store and a WB-load (on different locations) be reordered? Let us assume that x is in UC memory and z is in WB memory. Then in the multi-threaded program below, is it possible for both loads to load 0? This would be possible if both x and z were in WB emory due to store buffering (or alternatively justified as: younger loads in each thread can be reordered before the older stores as they are on different locations). But since the accesses on x are in UC memory, it is not clear whether such behaviours are possible.
thread 0 | thread 1
store [x] <-- 1 | store [z] <-- 1
load [z] | load [x]
2b) [UC-store/WC-load ordering] What if z were in WC memory (and x is in UC memory)? Can both loads load 0 then?
3a) [UC-load/WC-load ordering] Can a UC-load and a WC-load be reordered? Once again, let us assume that x is in UC memory and y is in WC memory. Then, in the multi-threaded program below, is it possible to load 1 from y and 0 from x? This would be possible if the two loads could be reordered (I believe the two stores cannot be reordered due to the intervening sfence; the sfence may not be needed depending on the answer to 1a).
thread 0 | thread 1
store [x] <-- 1; sfence | load [y]
store [y] <-- 1 | load [x]
3b) What if instead (symmetrically) x were in WC memory and y were in UC memory?
4a) [WB-load/WC-load ordering] What if in the example of 3a above x were in WB memory (instead of UC) and y were in WC memory (as before)?
4b) What if (symmetrically) x were in WC memory and y were in WB memory?

WARNING: I am ignoring cache coherency in all of this; because it complicates everything and doesn't make any difference to understanding how WB, WT, WP, WC or WC work, or any of the answers.
Assume you have 4 pieces, like:
________
| |
| Caches |
|________|
/ \
______/_ _\__________________
| | | |
| CPU |-----| Physical address |
| core | | space (e.g. RAM) |
|________| |____________________|
\ /
__\______/_
| |
| Write |
| combining |
| buffer |
|___________|
As far as the CPU's core is concerned; everything is always "processor ordering" (total store ordering with store forwarding). The only difference between WC, WB, WT, WP and UC is the path data takes to go between the CPU core and the physical address space.
For UC, writes go directly to the physical address space and reads come directly from the physical address space.
For WC, writes go down to "write combining buffer" where they're combined with previous writes and eventually evicted from the buffer (and sent to the physical address space later). Reads from WC come directly from the the physical address space.
For WB, writes go to caches and are evicted from the caches (and sent to the physical address space) later. For WT writes go to both caches and the physical address space at the same time. For WP writes get discarded and don't reach the physical address space at all. For all of these, reads come from cache (and cause fetch from the physical address space into cache on "cache miss").
There are 3 other things that influence this:
store forwarding. Any store can be forwarded to a later load within "CPU core", regardless whether the area is supposed to be WC, WB, WT, ... or UC. This means that it's technically wrong to claim that 80x86 has "total store ordering".
non-temporal stores cause data to go to the write combining buffers (regardless of whether the memory area was originally WB or WT or ... or UC). Non-temporal reads allow a later non-temporal read to occur before an earlier store.
write fences prevent store forwarding and wait for the write combining buffer to be emptied. Read fences cause CPU to wait until earlier reads complete before allowing later reads. The mfence instruction combines the behavior of read fence and write fence. Note: I lost track of lfence - for some/recent CPUs I think it got perverted into hack to help mitigate "spectre" security problems (I think it became a speculative execution barrier rather than just a read fence).
Now...
1a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | load [y_in_WC]; mfence
store [y_in_WC] <-- 1 | load [x_in_UC]
In this case the mfence is irrelevant (the previous load [y_in_WC] acts like UC anyway); but the store to y_in_WC may take ages to make its way to the physical address space (which isn't important because it's possibly last anyway). It's not possible to load 1 from y and 0 from x.
1b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]; mfence
store [y_in_UC] <-- 1 | load [x_in_WC]
In this case, the store [x_in_WC] may take ages to make its way to the physical address space; which means that the data loaded by load [x_in_WC] may fetch older data from the physical address space (even if the load is done after the store). It's very possible to load 1 from y and 0 from x.
2a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | store [z_in_WB] <-- 1
load [z_in_WB] | load [x_in_UC]
In this case there's nothing confusing at all (everything happens in the program order; it's just that store [z_in_WB] writes to cache and load [z_in_WB] reads from cache); and it's not possible for both loads to load 0. Note: an external observer (e.g. a device watching the physical address space) may not see the store to z_in_WB for ages.
2b)
thread 0 | thread 1
store [x_in_UC] <-- 1 | store [z_in_WC] <-- 1
load [z_in_WC] | load [x_in_UC]
In this case the store [z_in_WC] may not reach the physical address space until after the load [z_in_WC] has occurred (even if the load is done after the store). It is possible for both loads to load 0.
3a)
thread 0 | thread 1
store [x_in_UC] <-- 1 | load [y_in_WC]
store [y_in_WC] <-- 1 | load [x_in_UC]
Same as "1a". It's not possible to load 1 from y and 0 from x.
3b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]
store [y_in_UC] <-- 1 | load [x_in_WC]
Same as "1b". It's very possible to load 1 from y and 0 from x.
3c)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_UC]
sfence | load [x_in_WC]
store [y_in_UC] <-- 1 |
The sfence forces thread 0 to wait for the write combining buffer to drain, so it's not possible to load 1 from y and 0 from x.
4a)
thread 0 | thread 1
store [x_in_WB] <-- 1 | load [y_in_WC]
store [y_in_WC] <-- 1 | load [x_in_WB]
Mostly the same as "1a" and "3a". The only difference is that the store to x_in_WB goes to caches (and the load to x_in_WB comes from caches). Note: an external observer (e.g. a device watching the physical address space) may not see the store to x_in_WB for ages.
4b)
thread 0 | thread 1
store [x_in_WC] <-- 1 | load [y_in_WB]
store [y_in_WB] <-- 1 | load [x_in_WC]
Mostly the same as "1b" and "3b". Note: an external observer (e.g. a device watching the physical address space) may not see the store to y_in_WB for ages.

Intel's description of the UC memory type is spread out in numerous places in the Volume 3 of the manual. I'll focus on the parts that are relevant to memory ordering. The main one is from Section 8.2.5:
The strong uncached (UC) memory type forces a strong-ordering model on memory accesses. Here, all reads and writes to the UC memory
region appear on the bus and out-of-order or speculative accesses are
not performed.
This states that UC memory accesses across different instructions are guaranteed to be observed in program order. A similar statement appears in Section 11.3. Both don't say anything about ordering between UC and other memory types. It's interesting to note here that since the global observability of all UC accesses are ordered, it's impossible for a store forwarding to happen from a UC store to a UC load. In addition, UC stores are not coalesced or combined in the WCBs, although they do pass through these buffers because that's the physical path that all requests from the core to the uncore have to traverse.
The following two quotes discuss the ordering guarantees between UC loads and stores and previous or later stores of any type. Emphasis is mine.
Section 11.3:
If the WC buffer is partially filled, the writes may be delayed until
the next occurrence of a serializing event; such as an SFENCE or
MFENCE instruction, CPUID or other serializing instruction, a read or
write to uncached memory, an interrupt occurrence, or an execution of
a LOCK instruction (including one with an XACQUIRE or XRELEASE
prefix).
This means that UC accesses are ordered with respect to earlier WC stores. Contrast this with WB accesses, which are not ordered with earlier WC stores because they a WB access doesn't cause the WCBs to be drained.
Section 22.34:
Writes stored in the store buffer(s) are always written to memory in
program order, with the exception of “fast string” store operations
(see Section 8.2.4, “Fast String Operation and Out-of-Order Stores”).
This means that stores are always committed from the store buffer in program order, which implies that stores of all types, except WC, across different instructions are observed in program order. A store of any type cannot be reordered with an earlier UC store.
Intel provides no guarantees regarding the ordering of non-UC loads with earlier or later UC accesses (loads or stores), so ordering is architecturally possible.
The AMD memory model is described more precisely for all memory types. It clearly states that a non-UC load can be reordered with an earlier UC store and that a WC/WC+ loads can be reordered with an earlier UC load. So far the Intel and AMD models agree with each other. However, the AMD model also states that a UC load cannot pass an earlier load of any type. Intel doesn't state this anywhere in the manual as far as I know.
Regarding examples 4a and 4b, Intel doesn't provide guarantees on the ordering between a WB load and a WC load. The AMD model allows a WC load to pass an earlier WB load, but not the other way around.

Related

Can the additional time it takes to execute the first instance of a query be eliminated in neo4j?

Whatever steps I take, the first example of any query I make in neo4j always takes longer than any subsequent execution of the same query. So I guess something other than the store is being cached.
I'm using the latest community container image for 3.5 (3.5.20 at the time of writing)
I have plenty of memory to cache absolutely everything if I want to
I'm using well documented warm-up strategies in order to (allegedly) prime the page cache
The database details...
I run CALL apoc.monitor.store(); and it tells me the size of each store: -
+------------------------------------------------------------------------------------------------------------+
| logSize | stringStoreSize | arrayStoreSize | relStoreSize | propStoreSize | totalStoreSize | nodeStoreSize |
+------------------------------------------------------------------------------------------------------------+
| 1224 | 148165607424 | 3016515584 | 26391839040 | 42701007672 | 241318778285 | 2430128610 |
+------------------------------------------------------------------------------------------------------------+
I run CALL apoc.warmup.run(true, true, true); (before running any queries). It takes about 15 minutes and displays a summary of what it's done. The text it outputs is not easily parsed in its raw form so I've summarised salient parts of it below. Basically it tells me the number of pages loaded for each store, and these are: -
nodePages 296,719
relPages 3,234,294
relGroupPages 4,580
propPages 5,233,608
stringPropPages 18,086,620
arrayPropPages 368,225
indexPages 2,235,227
---
Total 29,459,273
With a page size of 8,192 bytes per page that's approximately 225GiB of pages for the displayed stores
I have enough physical memory and I have already set NEO4J_dbms_memory_pagecache_size to 250G
I set NEO4J_dbms_memory_heap_initial__size and NEO4J_dbms_memory_heap_max__size to 8G
So (allegedly) the page cache is "warm" and I have enough physical memory.
Query timings...
I run my query, which returns 1,813 records, and I execute the same query several times in order to illustrate the issue. I see the following (typical) timings: -
1. 1,821 mS
2. 75 mS
3. 60 mS
4. 51 mS
5. 48 mS
6. 42 mS
7. 38 mS
8. 36 mS
9. 36 mS
The actual values are dependent on the query but the first execution of every query is always significantly longer than the second.
ADDENDUM (16/Jul).
Just to be clear, using apoc.warmup.run does help.
If I don't use it, the first query is much longer still.
Having just restarted the DB (without a warm-up) the first query
took 7,829mS. The 2nd was 116mS, the third 66mS
So, warm-up or not, the first query is always longer.
Question...
What's going on?
Can I do anything more to reduce the initial query time?
Oh, and using the query as the warm up is not the answer - I don't know what queries will be used

Not sure why apoc.warmup.run does not speed up your initial query, but you could just try using an initial query invocation as your "warmup" instead.

Implicit vs explicit garbage collection

Background:
I have created an API and I did profiling for memory usage & process time for each web service using this guide and memory profiler gem.
I created a table to keep all profiling results like this:
Request | Memory Usage(b) | Retained Memory(b) | Runtime(seconds)
----------------------------------------------------------------------------
Post Login | 444318 | 35649 | 1.254
Post Register | 232071 | 32673 | 0.611
Get 10 Users | 11947286 | 2670333 | 3.456
Get User By ID | 834953 | 131300 | 0.834
Note: all numbers are the average number of calling the same service 3 consecutive times.
And I read many guides and answers(like this and this) says that Garbage Collector is responsible for releasing memory and we should not explicitly interfere in memory management.
Then I just forced the Garbage Collector to start after each action (for test purpose) by adding the following filter in APIController:
after_action :dispose
def dispose
GC.start
end
And I found that Memory usage is reduced too much (more than 70%), retained memory almost same before and runtime is reduced as well.
Request | Memory Usage(b) | Retained Memory(b) | Runtime(seconds)
----------------------------------------------------------------------------
Post Login | 38636 | 34628 | 1.023
Post Register | 37746 | 31522 | 0.583
Get 10 Users | 2673040 | 2669032 | 2.254
Get User By ID | 132281 | 128913 | 0.782
Questions:
Is it good practice to add such filter and what are the side effects?
I thought that runtime will be more than before, but it seems less, what can be the reason?
Update:
I'm using Ruby 2.3.0 and I've used gc_tracer gem to monitor heap status because I'm afraid to have old garbage collection issues highlighted in Ruby 2.0 & 2.1
The issue is that the Ruby GC is triggered on total number of objects,
and not total amount of used memory
Then to do a stress test, I run the following:
while true
"a" * (1024 ** 2)
end
and result is that memory usage does not exceed the following limits (it was exceeding before and GC wont be triggered):
RUBY_GC_MALLOC_LIMIT_MAX
RUBY_GC_OLDMALLOC_LIMIT_MAX
So now I'm pretty sure that same GC issues of 2.0 & 2.1 don't exist anymore in 2.3, but still getting the following positive results by adding above mentioned filter (after_action :dispose):
Heap memory enhanced by 5% to 10% (check this related question)
General execution time enhanced by 20% to 40% (test done using third party tool Postman which consumes my API)
I'm still looking for answers to my two questions above.
Any feedback would be greatly appreciated

insufficient memory when using proc assoc in SAS

I'm trying to run the following and I receive an error saying that ERROR: The SAS System stopped processing this step because of insufficient memory.
The dataset has about 1170(row)*90(column) records. What are my alternatives here?
The error infor. is below:
332 proc assoc data=want1 dmdbcat=dbcat pctsup=0.5 out=frequentItems;
333 id tid;
334 target item_new;
335 run;
----- Potential 1 item sets = 188 -----
Counting items, records read: 19082
Number of customers: 203
Support level for item sets: 1
Maximum count for a set: 136
Sets meeting support level: 188
Megs of memory used: 0.51
----- Potential 2 item sets = 17578 -----
Counting items, records read: 19082
Maximum count for a set: 119
Sets meeting support level: 17484
Megs of memory used: 1.54
----- Potential 3 item sets = 1072352 -----
Counting items, records read: 19082
Maximum count for a set: 111
Sets meeting support level: 1072016
Megs of memory used: 70.14
Error: Out of memory. Memory used=2111.5 meg.
Item Set 4 is null.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.FREQUENTITEMS may be incomplete. When this step was stopped there were
1089689 observations and 8 variables.

From the documentation (http://support.sas.com/documentation/onlinedoc/miner/em43/assoc.pdf):
Caution: The theoretical potential number of item sets can grow very
quickly. For example, with 50 different items, you have 1225 potential
2-item sets and 19,600 3-item sets. With 5,000 items, you have over 12
million of the 2-item sets, and a correspondingly large number of
3-item sets.
Processing an extremely large number of sets could cause your system
to run out of disk and/or memory resources. However, by using a higher
support level, you can reduce the item sets to a more manageable
number.
So - provide a support= option make sure it's sufficiently high, e.g.:
proc assoc data=want1 dmdbcat=dbcat pctsup=0.5 out=frequentItems support=20;
id tid;
target item_new;
run;

Is there a way to frame the data mining task so that it requires less memory for storage or operations? In other words, do you need all 90 columns or can you eliminate some? Is there some clear division within the data set such that PROC ASSOC wouldn't be expected to use those rows for its findings?
You may very well be up against software memory allocation limits here.

retrieval of data from ETS table

I know that lookup time is constant for ETS tables. But I also heard that the table is kept outside of the process and when retrieving data, it needs to be moved to the process heap. So, this is expensive. But then, how to explain this:
18> {Time, [[{ok, Binary}]]} = timer:tc(ets, match, [utilo, {a, '$1'}]).
{0,
[[{ok,<<255,216,255,225,63,254,69,120,105,102,0,0,73,
73,42,0,8,0,0,0,10,0,14,...>>}]]}
19> size(Binary).
1759017
1.7 MB binary takes 0 time to be retrieved from the table!?
EDIT: After I saw Odobenus Rosmarus's answer, I decided to convert the binary to list. Here is the result:
1> {ok, B} = file:read_file("IMG_2171.JPG").
{ok,<<255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,
0,8,0,0,0,10,0,14,1,2,0,32,...>>}
2> size(B).
1986392
3> L = binary_to_list(B).
[255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,0,8,0,0,
0,10,0,14,1,2,0,32,0,0|...]
4> length(L).
1986392
5> ets:insert(utilo, {a, L}).
true
6> timer:tc(ets, match, [utilo, {a, '$1'}]).
{106000,
[[[255,216,255,225,63,254,69,120,105,102,0,0,73,73,42,0,8,0,
0,0,10,0,14,1,2|...]]]}
Now it takes 106000 microseconds to retrieve 1986392 long list from the table which is pretty fast, isn't it? Lists are 2 words per element. Thus the data is 4x1.7MB.
EDIT 2: I started a thread on erlang-question (http://groups.google.com/group/erlang-programming/browse_thread/thread/5581a8b5b27d4fe1) and it turns out that 0.1 second is pretty much the time it takes to do memcpy() (move the data to the process's heap). On the other hand Odobenus Rosmarus's answer explains why retrieving binary takes 0 time.

binaries itself (that longer than 64 bits) are stored in the special heap, outside of process heap.
So, retrieval of binary from the ets table moves to process heap just 'Procbin' part of binary. (roughly it's pointer to start of binary in the binaries memory and size).

Is there a reason why arrays in memory 'go' down while the function stack usually 'goes' up?

Though the actual implementation is platform specific, this idea is the cause for potentially dangerous buffer overflows. For example,
-------------
| arr[0] | \
------------- \
| arr[1] | -> arr[3] is local to a function
------------- /
| arr[2] | /
-------------
| frame ptr |
-------------
| ret val |
-------------
| ret addr |
-------------
| args |
-------------
My question is, is there a reason why the local array, for lack of a better verb, flows down? Instead, if the array was to flow up, wouldn't it significantly reduce the number of buffer overflow errors that overwrite the return address?
Granted, by using threads, one could overwrite the return address of a function that the current one has called. But lets ignore it for now.

The array on the stack works just like an array on the heap, i.e. its index increases as the memory address increases.
The stack grows downwards (towards lower addresses) instead of upwards, which is the reason for the array going in the opposite direction of the stack. There is some historic reason for that, probably from the time when the code, heap and stack resided in the same memory area, so the heap and the stack grew from each end of the memory.

I can't cite a source for this, but I believe it's so you can step through memory. Consider while *p++ or something along those lines.
Now, you could just as easily say while *p-- but I guess if they had a choice, they'd rather overwrite someone else's data than their own return value :) Talk about a 'greedy algorithm' (har har)

To have a subarray you usually pass just a pointer to it. Any indexing operation would need to know the size of the array, unless you'd like to make all of memory index backwards -- but if you would, you'd just get yourself in the same situation :P.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart