How to vectorize/group together many signals generated from Qsys to Altera Quartus - vectorization

In the Altera Qsys, I am using ten input parallel ports (lets name them pio1 to pio10), each port is 12 bits. These parallel ports obtain values from the vhdl block in Quartus schematic. In the schematic bdf, I can see pio1 to pio10 from the nios ii system symbol so I can connect these pios to other blocks in my bdf.
My question is, how to vectorize these pio1 to pio10? Instead of seeing all ten pios one line by one line coming out from the Nios system symbol, what should I do in order to group all these ten pios so that I only see one instead of ten? From the one pio that I see, I can name it pio[1..10][1..12], the first bracket means pio1 to pio10, the second bracket means bit1 to bit 12 because each parallel port has 12 bits.
Could you please let me know how could I do that?

Related

Which OpenACC directive will tell compiler to execute a statement on device only?

I am learning OpenACC with Fortran (with a suite of tools from Nvidia) and am doing it by porting my implementation of the Conjugate Gradient (CG) solver to GPUs.
Clearly, I am trying to keep as much data as possible on the device (GPU memory), with the following commands:
27 ! Copy matrix (a_sparse), vectors (ax - b) and scalars (alpha - pap) to GPU
28 !$acc enter data copyin(a_sparse)
29 !$acc enter data copyin(a_sparse % row(:))
30 !$acc enter data copyin(a_sparse % col(:))
31 !$acc enter data copyin(a_sparse % val(:))
32 !$acc enter data copyin(ax(:), ap(:), x(:), p(:), r(:), b(:))
33 !$acc enter data copyin(alpha, beta, rho, rho_old, pap)
From that point on, all operations constituting the solution algorithm of the CG solver, are done with the present clause. For a vector operation, an excerpt looks like:
49 !$acc parallel loop &
50 !$acc& present(r, b, ax)
51 do i = 1, n
52 r(i) = b(i) - ax(i)
53 end do
I do the same things with scalars, for example:
87 !$acc kernels present(alpha, rho, pap)
88 alpha = rho / pap
89 !$acc end kernels
All scalar variables are on the device. With lines 87-89 I am trying to execute the command alpha = rho / pap on device only, avoiding any data transfer from or to host, but nsight-sys profiler shows me the following:
To my astonishment, there seems to be data transfer at line 87, both before (red "Enter Data" square) and after (red "Exit Data" square) the compute construct (blue "Cg.f90: 87" square).
Could anyone tell me what is going on? Are the lines 87-89 executed on device? Moreover, why are there no corresponding CUDA commands for these "Enter Data" and "Exit Data" fields? If so, why there seems to be data transfer between the host and the device? If not, is there an OpenACC command which would direct compiler to execute a programming line, which is not necessarily a loop, on the device only?
I noticed the same for the array operations, such as the ones I wrote above in lines 49-53, there is some data transfer there too, but I could attribute it to the variable n which should be passed to device.
It could a few things. The Fortran specifies that the right hand side of an array syntax operation needs to be fully evaluated before assignment to the left hand side, so the compiler may be allocating a temp array to hold the result of the evaluation. Though often the compiler can optimize away the need for the temp, so it may or may not be the issue. Try making this an explicit loop, rather than use array syntax to see if it solves the issue.
A second possibility, is that the compiler is needing to copy the array descriptors since it can't tell if they've changed or not. Though, I'd expect to see some data movement rather than just the enter/exit regions.
The third possibility is that this is just the present check itself which does still call the enter/exit runtime calls. Instead of copying data, the call looks up the device pointer which is later passed to the kernel call and the reference counter is incremented/decremented.

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

Modbus register adress space not linear?

I'm currently developing a modbus server to control a device.
The device manual says about holding registers:
Adress 6000: ValueA, 2 Byte
Adress 6001: ValueB, 1 Byte; ValueC, 4 Byte; ValueD, 4 Byte
Adress 6005: ValueE, 2 Byte
The only supported read function is FC 03 / Read Multiple Holding Registers
To my knowledge, one can see the register as a memory block of numbered 16Bit values, and could read it in one go by reading 6 registers / 12 Byte beginning at 6000.
I think the 1Byte-value isn't an issue, the register simply contains a value not exceeding 255.
But expanding the table above gives:
Adress 6000: ValueA, 2 Byte
Adress 6001: ValueB, 1 Byte
Adress 6002-6003: ValueC, 4 Byte
Adress 6004-6005: ValueD, 4 Byte
Adress 6005: ValueE, 2 Byte
so, there is an overlap last line at 6005.
My device manual is full of such occurences, and meanwhile, I'm thinking that modbus registers ain't such a simple, linear memory as I thought.
Does anybody know if modbus registers are linear, or not?
I stumbled across a similar situation and asked about it in a more specialized forum. The "to long, didn`t read" was, that the address space is linear most of the time, but not always.
Check out the following example:
Excuse the German parts, but what you can see here, is that register address 0x2021 holds data made up of eight words or eight 16-bit blocks. Following your above logic you would expect the second word to be stored in the register 0x2022, but I checked on my local device and they are not the same. So, in summary, there are some devices out there which decide, that they give one register more memory than it's ought to have. So, register 0x2021 really holds 8 words on his own and does not use register 0x2022 to hold memory.
you might have a similar case.

Maximum call depth exceeded

I am using an integration engine Iguana by iNTERFACEWARE that uses Lua as the scripting language.
I have a function that recursively calls itself to split the string based on number of words i.e if I huge text chunk is passed it is divided into chunks of 80 words let's say. Now this function works just fine upto some lines but beyond 3000 /4000 words, It breaks giving the following error:
call-stack-has-exceeded-maximum-of-depth-of-100
Here is the function that I am using:
Function to split text
What shall be done to fix this? I am on windows 8. Does it depend on the stack size of machine(windows 8 in my case) or the depth of recursive calls allowed by the software Iguana?
Any fix?

Cache and memory

First of all, this is not language tag spam, but this question not specific to one language in particulary and I think that this stackexchange site is the most appropriated for my question.
I'm working on cache and memory, trying to understand how it works.
What I don't understand is this sentence (in bold, not in the picture) :
In the MIPS architecture, since words are aligned to multiples of four
bytes, the least significant two bits are ignored when selecting a
word in the block.
So let's say I have this two adresses :
[1........0]10
[1........0]00
^
|
same 30 bits for boths [31-12] for the tag and [11-2] for the index (see figure below)
As I understand the first one will result in a MISS (I assume that the initial cache is empty). So one slot in the cache will be filled with the data located in this memory adress.
Now, we took the second one, since it has the same 30 bits, it will result in a HIT in the cache because we access the same slot (because of the same 10 bits) and the 20 bits of the adress are equals to the 20 bits stored in the Tag field.
So in result, we'll have the data located at the memory [1........0]10 and not [1........0]00 which is wrong !
So I assume this has to do with the sentence I quote above. Can anyone explain me why my reasoning is wrong ?
The cache in figure :
In the MIPS architecture, since words are aligned to multiples of four
bytes, the least significant two bits are ignored when selecting a
word in the block.
It just mean that in memory, my words are aligned like that :
So when selecting a word, I shouldn't care about the two last bits, because I'll load a word.
This two last bits will be useful for the processor when a load byte (lb) instruction will be performed, to correctly shift the data to get the one at the correct byte position.

Resources