memory barrier example of JMM cookbook confusion - memory-barriers

I am confused with this compiler insertion of barrier example in the JMM cookbook
http://g.oswego.edu/dl/jmm/cookbook.html
i = u (don't it involve a volatile load from u and normal store into i ? )
j = b (looks to me a Normal load from b and normal store into j )
According to the lookup table in the cookbook, where is the two barrier LoadLoad and LoadStore come from?
Thanks!
///////////////// JSR example ////
volatile int u;
int i,b,j;
i = u; //load u
LoadLoad
LoadStore
j = b; //load b

Related

Concatenate block of memory into a wire array?

In my project I have something like this:
reg [15:0] mem [3:0];
wire [63:0] data;
I know I can concatenate the mem into data like this:
assign data = {mem[3], mem[2], mem[1], mem[0]};
but it becomes some bad work when the memory grows big:
reg [3:0] mem [255:0];
wire [1023:0] data;
I'm afraid writing something like this isn't going to be a good idea, even I can write some other Python or Ruby script to generate such a line.
assign data = {mem[255], ..........., mem[0]};
summon_cthulhu();
Is there any better approach to do this?
Note: This is not an XY problem - it's the exact problem that I want to solve.
Use a generate-for loop
genvar ii;
for (ii=0;ii<256;ii=ii+1)
assign data[ii*16+:16] = mem[ii];
Here is one way to do it.
parameter MEM_WIDTH = 4;
parameter MEM_DEPTH = 256;
localparam DATA_SIZE = (MEM_WIDTH * MEM_DEPTH);
reg[MEM_WIDTH-1:0]mem[MEM_DEPTH-1:0];
reg[DATA_SIZE-1:0]data;
always#(*)
begin
for(i=0; i<MEM_DEPTH; i=i+1)
begin
data[i*MEM_WIDTH +: MEM_WIDTH] = mem[i];
end
end

Neo4j : Difference between cypher execution and Java API call?

Neo4j : Enterprise version 3.2
I see a tremendous difference between the following two calls in terms for speed. Here are the settings and query/API.
Page Cache : 16g | Heap : 16g
Number of row/nodes -> 600K
cypher code (ignore syntax if any) | Time Taken : 50 sec.
using periodic commit 10000
load with headers from 'file:///xyx.csv' as row with row
create(n:ObjectTension) set n = row
From Java (session pool, with 15 session at time as an example):
Thread_1 : Time Taken : 8 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
Thread_2 : Time taken is 9 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
.
.
.
Thread_3 : Basically the above code is reused..It's just an example.
Thread_N where N = (600K / 10K)
Hence, the over all time taken is around 2 ~ 3 mins.
The question are the following?
How does CSV load handles internally? Like does it open single session and multiple transactions within?
Or
Create multiple session based on the parameter passed as "Using periodic commit 10000", with this 600K/10000 is 60 session? etc
What's the best way to write via Java?
The idea is achieve the same write performance as CSV load via Java. As the csv load 12000 nodes in ~5 seconds or even better.
Your Java code is doing something very different than your Cypher code, so it really makes no sense to compare processing times.
You should change your Java code to read from the same CSV file. File IO is fairly expensive, but your Java code is not doing any.
Also, whereas your pure Cypher query is creating nodes with a fixed (and presumably relatively small) number of properties, your Java pList is growing in size with every loop iteration -- so that each Java loop creates nodes with between 1 to 10K properties! This may be the main reason why your Java code is much slower.
[UPDATE 1]
If you want to ignore the performance difference between using and not using a CSV file, the following (untested) code should give you an idea of what similar logic would look like in Java. In this example, the i loop assumes that your CSV file has 10 columns (you should adjust the loop to use the correct column count). Also, this example gives all the nodes the same properties, which is OK as long as you have not created a contrary uniqueness constraint.
Session session = Driver.session();
Map<String,Object> pList = new HashMap<String, Object>();
for (int i = 0; i < 10; i++) {
pList.put(i, i * i);
}
Map<String, Map> params = new HashMap<String, Map>();
params.put("props", pList);
String query = "create(n:Label) set n = {props})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
for(int k = 0; k < 10000; k++){
tx.run(query, params);
}
}
}
[UPDATE 2 and 3, copied from chat and then fixed]
Since the Cypher planner is able to optimize, the actual internal logic is probably a lot more efficient than the Java code I provided (above). If you want to also optimize your Java code (which may be closer to the code that Cypher actually generates), try the following (untested) code. It sends 10000 rows of data in a single run() call, and uses the UNWIND clause to break it up into individual rows on the server.
Session session = Driver.session();
Map<String, Integer> pList = new HashMap<String, Integer>();
for (int i = 0; i < 10; i++) {
pList.put(Integer.toString(i), i*i);
}
List<Map<String,Integer>> rows = Collections.nCopies(1, pList);
Map<String, List> params = new HashMap<String, List>();
params.put("rows", rows);
String query = "UNWIND {rows} AS row CREATE(n:Label) SET n = {row})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
tx.run(query, params);
}
}
You can try are creating the nodes using Java API, instead of relying on Cypher:
createNode - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/GraphDatabaseService.html#createNode-org.neo4j.graphdb.Label...-
setProperty - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/PropertyContainer.html#setProperty-java.lang.String-java.lang.Object-
Also, as predecessor had mentioned, props variable has different values for your cases.
Additionally, notice that every iteration you are performing query parsing (String query = "Create(n:Label {props})";) - unless it is optimized out by neo4j itself.

ArrayFire seq to int c++

Imagine a gfor with a seq j...
If I need to use the value of the instance j as a index, who can I do that?
something like:
vector<double> a(n);
gfor(seq j, n){
//Do some calculation and save this on someValue
a[j] = someValue;
}
Someone can help me (again) ?
Thanks.
I've found a solution for this...
if someone had a better option, feel free to post...
First, create a seq with the same size of your gfor instances.
Then, convert that seq in a array.
Now, take the value of that line on array (it's equals the index)
seq sequencia(0, 200);
af::array sqc = sequencia;
//Inside the gfor loop
countLoop = (int) sqc(j).scalar<float>();
Your approach works, but breaks gfors parallelization as converting the index to a scalar forces it to be written from the gpu back to the host, slamming the breaks on the GPU.
You want to do it more like this :
af::array a(200);
gfor(seq j, 200){
//Do some calculation and save this on someValue
a[j] = af::array(someValue); // for someValue a primitive type, say float
}
// ... Now we're safe outside the parallel loop, let's grab the array results
float results[200];
a.host(results) // Copy array from GPU to host, populating a c-type array

ARM: May I do direct memory accesses to a range returned by ioremap_nocache() [without using ioread*()/iowrite*()]?

I'm using a TI AM3358 SoC, running an ARM Cortex-A8 processor, which runs Linux 3.12. I enabled a child device of the GPMC node in the device tree, which probes my driver, and in there I call ioremap_nocache() with the resource provided by the device tree node to get an uncached region.
The reason I'm requesting no cache is that it's not an actual memory device which is connected to the GPMC bus, which would of course benefit from the processor cache, but an FPGA device. So accesses need to always go through the actual wires.
When I do this:
u16 __iomem *addr = ioremap_nocache(...);
iowrite16(1, &addr[0]);
iowrite16(1, &addr[1]);
iowrite16(1, &addr[2]);
iowrite16(1, &addr[3]);
ioread16(&addr[0]);
ioread16(&addr[1]);
ioread16(&addr[2]);
ioread16(&addr[3]);
I see the 8 accesses are done on the wires using a logic analyzer. However, when I do this:
u16 v;
addr[0] = 1;
addr[1] = 1;
addr[2] = 1;
addr[3] = 1;
v = addr[0];
v = addr[1];
v = addr[2];
v = addr[3];
I see the four write accesses, but not the subsequent read accesses.
Am I missing something? What would be the difference here between ioread16() and a direct memory access, knowing that the whole GPMC range is supposed to be addressable just like memory?
Could this behaviour be the result of any compiler optimization which can be avoided? I didn't look at the generated instructions yet, but until then, maybe someone experienced enough has something interesting to reply.
ioread*() and iowrite*(), on ARM, perform a data memory barrier followed by a volatile access, e.g.:
#define readb(c) ({ u8 __v = readb_relaxed(c); __iormb(); __v; })
#define readw(c) ({ u16 __v = readw_relaxed(c); __iormb(); __v; })
#define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; })
#define writeb(v,c) ({ __iowmb(); writeb_relaxed(v,c); })
#define writew(v,c) ({ __iowmb(); writew_relaxed(v,c); })
#define writel(v,c) ({ __iowmb(); writel_relaxed(v,c); })
__raw_read*() and __raw_write*() (where * is b, w, or l) may be used for direct reads/writes. They do the exact single instruction needed for those operations, casting the address pointer to a volatile pointer.
__raw_writew() example (store register, halfword):
#define __raw_writew __raw_writew
static inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
asm volatile("strh %1, %0"
: "+Q" (*(volatile u16 __force *)addr)
: "r" (val));
}
Beware, though, that those two functions do not insert any barrier, so you should call rmb() (read memory barrier) and wmb() (write memory barrier) anywhere you want your memory accesses to be ordered.

Can the STREAM and GUPS (single CPU) benchmark use non-local memory in NUMA machine

I want to run some tests from HPCC, STREAM and GUPS.
They will test memory bandwidth, latency, and throughput (in term of random accesses).
Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?)
Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.)
Thanks.
UPDATE:
(you may nor reorder the memory accesses that the program makes).
But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c
/* Perform updates to main table. The scalar equivalent is:
*
* u64Int ran;
* ran = 1;
* for (i=0; i<NUPDATE; i++) {
* ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
* table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)];
* }
*/
for (j=0; j<128; j++)
ran[j] = starts ((NUPDATE/128) * j);
for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
for (j=0; j<128; j++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to
for (j=0; j<128; j++) {
for (i=0; i<NUPDATE/128; i++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?
As far as I can tell it is allowed given that the memory interleaving
is a system setting rather than a code modification (you may nor reorder
the memory accesses that the program makes).
If GUPS actually gets better performance with non-local memory on a
NUMA machine seems doubtful to me. Will bank conflict-induced latency
really be greater than the off-node memory access latency?
STREAM should not be limited by bank conflicts but will probably
benefit from off-node accesses if the CPU has an on-chip memory
controller (like the Opterons) since the bandwidth is then shared
between the local memory controller and the NUMA interconnect.

Resources