What is meant by FENCE.TSO in the RISC-V ISA?

What is meant by FENCE.TSO in the RISC-V ISA? - memory

I don't really understand the difference between a normal FENCE in RISC-V (has been answered here: What is meant by the FENCE instruction in the RISC-V instruction set?) and the FENCE.TSO. The manual says:
The optional FENCE.TSO instruction is encoded as a FENCE instruction with fm=1000, predecessor=RW, and successor=RW. FENCE.TSO orders all load operations in its predecessor set before
all memory operations in its successor set, and all store operations in its predecessor set before all
store operations in its successor set.This leaves non-AMO store operations in the FENCE.TSO’s
predecessor set unordered with non-AMO loads in its successor set.
Okay, So here is my guess. I will just show my sketch from what I understood.
There are two sets (including instructions), which are being separated by the FENCE instruction, namely predecessor set and successor set.
Load Operation 1
Load Operation 2
Load Operation 3
Store Operation 1
Store Operation 2
Store Operation 3
**FENCE.TSO**
Memory Operation 1
Memory Operation 2
Memory Operation 3
Store Operation 4
Store Operation 5
Store Operation 6
This is how I understand it. But I'm still confused by the sentence This leaves non-AMO store operations in the FENCE.TSO’s
predecessor set unordered with non-AMO loads in its successor set.
What are non-AMO loads and non-AMO store operations?
Alright, AMO seems to stand for "Atomic Memory Operation". Still I'm wondering, why I can't just use the "normal" FENCE.

You can use the "normal" FENCE, since it orders operations more strictly than FENCE.TSO does. This can be inferred from the note about backward compatibility with implementations that don't support the optional .TSO extension:
The FENCE.TSO encoding was added as an optional extension to the original base FENCE
instruction encoding. The base definition requires that implementations ignore any set bits and
treat the FENCE as global, and so this is a backwards-compatible extension.
So, what is the difference between a FENCE RW,RW and a FENCE.TSO RW,RW? Let's take a simple example.
load A
store B
<fence>
load C
store D
When <fence> is FENCE RW,RW, the following rules apply:
A < C
A < D
B < C
B < D
This results in four different possible orders: ABCD, BACD, ABDC, and BADC. In other words, A/B may be reordered, and C/D may be reordered, but both A and B must be observable no later than C and D.
When <fence> is FENCE.TSO RW,RW, the following rules apply:
A < C
A < D
B < D
Note how B < C is missing; FENCE.TSO does not impose any order between predecessor stores and sucessor loads. Presumably, this weaker ordering makes it cheaper than a "normal" FENCE.
This gives us five possible orders: ABCD, BACD, ABDC, BADC, and ACBD. If this is acceptable to your program, you may use FENCE.TSO.

Related

Guarantee Print Order After Parallelism

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?

Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.

Q : Is there a known term for my problem?
Yes, there is. A con·tra·dic·tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Is a read on an atomic variable guaranteed to acquire the current value of it in C++11?

It is known that the modifications on a single atomic variable form a total order. Suppose we have an atomic read operation on some atomic variable v at wall-clock time T. Then, is this read guaranteed to acquire the current value of v that is wrote by the last one in the modification order of v at time T? To put it in another way, if an atomic write is done before an atomic read in natural time, and there is no other writes in between, then is the read guaranteed to return the value just written?
My accepted answer is the 6th comment made by Cubbi to his answer.

Wall-clock time is irrelevant. However, what you're describing sounds like the write-read coherence guarantee:
$1.10[intro.multithread]/20
If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M.
(translating the standardese, "value computation" is a read, and "side effect" is a write)
In particular, if your relaxed write and your relaxed read are in different statements of the same function, they are connected by a sequenced-before relationship, therefore they are connected by a happens-before relationship, therefore the guarantee holds.

Depends on the memory order which you can specify for the load() operation.
By default, it is std::memory_order_seq_cst and the answer is yes, it guarantees the current value stored by another thread (if stored at all, i.e. it must use std::memory_order_release memory order at least, otherwise the store visibility is not guaranteed).
But if you specify std::memory_order_relaxed for the load operation the documentation says Relaxed ordering: there are no synchronization or ordering constraints, only atomicity is required of this operation. I.e. the program could end up not reading from the memory at all.

Is a read on an atomic variable guaranteed to acquire the current value of it
No
Even though each atomic variable has a single modification order (which is observed by all threads), that does not mean that all threads observe modifications at the same time scale.
Consider this code:
std::atomic<int> g{0};
// thread 1
g.store(42);
// thread 2
int a = g.load();
// do stuff with a
int b = g.load();
A possible outcome is (see diagram):
thread 1: 42 is stored at time T1
thread 2: the first load returns 0 at time T2
thread 2: the store from thread 1 becomes visible at time T3
thread 2: the second load returns 42 at time T4.
This outcome is possible even though the first load at time T2 occurs after the store at T1 (in clock time).
The standard says:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
It does not require a store to become visible right away and it even allows room for a store to remain invisible (e.g. on systems without cache-coherency).
In that case, an atomic read-modify-write (RMW) is required to access the last value.
Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.
Needless to say, RMW's are more expensive to execute (they lock the bus) and that is why a regular atomic load is allowed to return an older (cached) value.
If a regular load was required to return the last value, performance would be horrible while there would be hardly any benefit.

Need clarifications about a match context of an erlang bitstring

I'd read efficiency guide and erlang-questions mailing list archive &
all of the available books in erlang. But I haven't found the precise description of efficient
binaries pattern matching. Though, I haven't read sources yet :) But I hope
that people, who already have read them, would read this post. Here are my questions.
How many match contexts does an erlang binary have?
a) if we match parts of a binary sequentially and just once
A = <<1,2,3,4>>.
<<A1,A2,A3,A4>> = A.
Do we have just one binary match context(moving from the beginning of A to the end), or four?
b) if we match parts of a binary sequentially from the beginning to the end for the first time
and(sequentially again) from the beginning to the end for the second time
B = <<1,2,3,4>>.
<<B1,B2,B3,B4>> = B.
<<B11,B22,B33,B44>> = B.
Do we have just a single match context, which is moving from the beginning of B to the end
of B and then moving again from the beginning of B to the end of B,
or
we have 2 match contexts, one is moving from the beginning of B to the end of B,
and another - again from the beginning of B to the end of B (as first can't move
to the beginning again)
or we have 8 match contexts?
According to documentation, if I write:
my_binary_to_list(<<H,T/binary>>) ->
[H|my_binary_to_list(T)];
my_binary_to_list(<<>>) -> [].
there will be only 1 match context for the whole recursion tree, even though, this
function isn't tail-recursive.
a) Did I get it right, that there would be only 1 match context in this case?
b) Did I get right, that if I match an erlang binary sequentially(from the beginning to
the end), it doesn't matter which recursion type(tail or body) has to be used?(from the binary-matching efficiency point of view)
c) What if I'm going to process erlang binary NOT sequentially, say, I'm travelling through
a binary - first I match first byte, then 1000th, then 5th, then 10001th, then 10th...
In this case,
d1) If I used body-recursion, how many matching contexts for this binary would I have -
one or >1?
d2) if I used tail-recursion, how many matching contexts for this binary would I have -
one or >1?
If I pass a large binary(say 1 megabyte) via tail recursion, Will all the 1 megabyte data be copied? Or only a some kind of pointer to the beginning of this binary is being passed between calls?
Does it matter which binary I'm matching - big or small - a match context will be created for binary of any size or only for large ones?

I am only a beginner in erlang, so take this answer with a grain of salt.
How many match contexts does an erlang binary have?
a) Only one context is created, but it is entirely consumed in that instance, since there's nothing left to match, and thus it may not be reused.
b) Likewise, the whole binary is split, there are no context left after matching, though one context has been created for each line: the assignments of B1 up to B4 creates one context, and the second set of assignments from B11 to B44 does also create a context. So in total we get 2 context created and consumed.
According to documentation [...]
This section isn't quite totally clear for me as well, but this is what I could figure out.
a) Yes, there will be only one context allocated for the whole duration of the function recursive execution.
b) Indeed no mention is made of distinguishing tail recursion vs non tail recursion. However, the example given is clearly a function which can be transformed (though it's not trivial) into a tail-recursive one. I suppose that the compiler decides to duplicate a matching context when a clause contains more than one path for the context to follow. In that case, the compiler detects that the function is tail optimizable, and goes without doing the allocation.
c) We see the opposite situation happening in the example following the one you've reproduced, which contains a case expression: there, the context may follow 2 different paths, thus the compiler has to force the allocation at each recursion level.
If I pass a large binary (say 1 megabyte) via tail recursion [...]
From § 4.1:
A sub binary is created by split_binary/2 and when a binary is matched out in a binary pattern. A sub binary is a reference into a part of another binary (refc or heap binary, never into a another sub binary). Therefore, matching out a binary is relatively cheap because the actual binary data is never copied.
When dealing with binaries, a buffer is used to store the actual data, and any matching of sub-part is implemented as a structure containing a pointer to the original buffer, plus an offset and a length indicating which sub part is being considered. That's the sub binary type being mentioned in the docs.
Does it matter which binary I'm matching - big or small - ...
From that same § 4.1:
The binary containers are called refc binaries (short for reference-counted binaries) and heap binaries.
Refc binaries consist of two parts: an object stored on the process heap, called a ProcBin, and the binary object itself stored outside all process heaps.
[...]
Heap binaries are small binaries, up to 64 bytes, that are stored directly on the process heap. They will be copied when the process is garbage collected and when they are sent as a message. They don't require any special handling by the garbage collector.
This indicates that depending on the size of the binary, it may be stored as a big buffer outside of processes, and referenced in the processes through a proxy structure, or if that binary is 64 bytes of less, it will be stored directly in the process memory dealing with it. The first case avoids copying the binary when processes sharing it are running on the same node.

ARM single-copy atomicity

I am currently wading through the ARM architecture manual for the ARMv7 core. In chapter A3.5.3 about atomicity of memory accesses, it states:
If a single-copy atomic load overlaps a single-copy atomic store and
for any of the overlapping bytes the load returns the data written by
the write inserted into the Coherence order of that byte by the
single-copy atomic store then the load must return data from a point
in the Coherence order no earlier than the writes inserted into the
Coherence order by the single-copy atomic store of all of the
overlapping bytes.
As non-native english speaker I admit that I am slightly challenged in understanding this sentence.
Is there a scenario where writes to a memory byte are not inserted in the Coherence Order and thus the above does not apply? If not, am I correct to say that shortening and rephrasing the sentence to the following:
If the load happens to return at least one byte of the
the write, then the load must return all overlapping bytes from a point
no earlier than where the write inserted them into the
Coherence order of all of the overlapping bytes.
still transports the same meaning?

I see that wording in the ARMv8 ARM, which really tries to remove any possible ambiguity in a lot of places (even if it does make the memory ordering section virtually unreadable).
In terms of general understanding (as opposed to to actually implementing the specification), a little bit of ambiguity doesn't always hurt, so whilst it fails to make it absolutely clear what a "memory location" means, I think the old v7 manual (DDI0406C.b) is a nicer read in this case:
A read or write operation is single-copy atomic if the following conditions are both true:
After any number of write operations to a memory location, the value of the memory location is the value written by one of the write operations. It is impossible for part of the value of the memory location to come from one write operation and another part of the value to come from a different write operation
When a read operation and a write operation are made to the same memory location, the value obtained by the read operation is one of:
the value of the memory location before the write operation
the value of the memory location after the write operation.
It is never the case that the value of the read operation is partly the value of the memory location before the write operation and partly the value of the memory location after the write operation.
So your understanding is right - the defining point of a single-copy atomic operation is that at any given time you can only ever see either all of it, or none of it.
There is a case in v7 whereby (if I'm interpreting it right) two normally single-copy atomic stores that occur to the same location at the same time but with different sizes break any guarantee of atomicity, so in theory you could observe some unexpected mix of bytes there - this looks to have been removed in v8.

Memory Barriers and Relaxed Memory Models

Currently I try to improve my understanding of memory barriers, locks and memory model.
As far as I know there exist four different types of relaxations, namley
Write -> Read, Write -> Write, Read -> Write and Read -> Read.
An x86 processor allows just Write->Read relaxation which is often called Total Store Order (TSO).
Partial Store Order (PSO) allows further Write->Write relaxations and Relaxed Store Order (RSO)
allows all the above relaxations.
Further there exist three types of memory barriers: release, acquire and both together.
Locks can use just acquire and release barriers or sometimes full barriers (.Net).
Now consider the following example:
// thread 0
x = 1
flag = 1
//thread 1
while (flag != 1);
print x
My current understanding tells me, that I need no additional memory barriers if I run this code on
TSO machine.
If it is a PSO machine I need a release barrier between x=1 and flag = 1 to ensure
that thread 1 gets the actual value of x if flag =1.
If it is a RSO machine I need further a acquire barrier between while(flag != 1); and print x to prevent
that thread 1 reads the value of x to early.
Are my observations correct?

I am thinking your code sample is close to one in this question
That said for RSO you need more memory barriers than you describe, more specifically for example one that provides freshness guarantee for thread 1 before while.
I am unsure about TSO and PSO part, hope this can be helpful cause I was also trying to understand memory barriers in that question and a couple of related ones

Reordering can happen on both software (compiler) and hardware level. So keep that in mind. So even though on TSO CPU the 2 stores would not be reordered, there is nothing that prevents the compiler to reorder the 2 stores (or the 2 loads). So flag needs to be a synchronization variable and the store of flag needs to be a release store and the load of flag needs to be an acquire load.
But if we assume that the above code represents X86 instructions:
Then with TSO the above will work correctly since it will prevent the 2 stores and the 2 loads from being reordered.
But with PSO the above could fail because the 2 stores could be reordered.
So imagine you would have the following:
b = 1
x = 1
flag = 1
Whereby b is a value on the same cache line as flag. Then with write coalescing, the flag=1 and b=1 could be coalesced and as a consequence flag=1 could overtake the x=1 and hence become globally visible before the x=1.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart