How to use LogiCORE DSP48 Macro?

How to use LogiCORE DSP48 Macro? - signal-processing

I want to learn how to use LogiCORE DSP48 Macro. I'm reading the Xilinx documentation but I cannot understand well how to start my first design with DSP48 Macro. Can anyone help me to make a simple design to get a better understanding of this IP core please?
Thanks in advance!

In many cases you would use DSP48 by writing Verilog/VHDL expressions containing add, subtract, and multiply.
x = a * b + c
A problem with the above expression is that the multiplication and addition take place in a single cycle. You can run the expression at a higher frequency if the operation could be pipelined. Vivado can sometimes retime these expressions across registers in order to make use of the DSP48 pipeline registers.
However, I understand wanting to use the DSP48 directly. You instantiate DSP48's just like other RTL modules. The ports, parameters, and behaviors are described in the DSP Slice User Guide for the FPGA logic that you are using.
wire [47:0] c;
wire [24:0] a;
wire [17:0] b;
DSP48E1#() dsp(
.a(a),
.b(b),
.c(c),
.p(x),
.opmode(5),
.alumode(0)
);
This instance is copied from one of my inner-product implementations. It is fully pipelined because I was aiming for 500MHz operation. Only achieved 400MHz due to other combinational paths.
For Xilinx 7 Series:
DSP48E1 Slice User Guide
For Xilinx Ultrascale:
DSP48E2 Slice User Guide

Related

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?

It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

Does functional programming take up more memory?

Warning! possibly a very dumb question
Does functional programming eat up more memory than procedural programming?
I mean ... if your objects(data structures whatever) are all imutable. Don't you end up having more object in the memory at a given time.
Doesn't this eat up more memory?

It depends on what you're doing. With functional programming you don't have to create defensive copies, so for certain problems it can end up using less memory.
Many functional programming languages also have good support for laziness, which can further reduce memory usage as you don't create objects until you actually use them. This is arguably something that's only correlated with functional programming rather than a direct cause, however.

Persistent values, that functional languages encourage but which can be implemented in an imperative language, make sharing a no-brainer.
Although the generally accepted idea is that with a garbage collector, there is some amount of wasted space at any given time (already unreachable but not yet collected blocks), in this context, without a garbage collector, you end up very often copying values that are immutable and could be shared, just because it's too much of a mess to decide who is responsible for freeing the memory after use.
These ideas are expanded on a bit in this experience report which does not claim to be an objective study but only anecdotal evidence.

Apart from avoiding defensive copies by the programmer, a very smart implementation of pure functional programming languages like Haskell or Standard ML (which lack physical pointer equality) can actively recover sharing of structurally equal values in memory, e.g. as part of the memory management and garbage collection.
Thus you can have automatic hash consing provided by your programming language runtime-system.
Compare this with objects in Java: object identity is an integral part of the language definition. Even just exchanging one immutable String for another poses semantic problems.

There is indeed at least a tendency to regard memory as affluent ressource (which, in fact, it really is in most cases), but this applies to modern programming as a whole.
With multiple cores, parallel garbage collectors and available RAM in the gigabytes, one used to concentrate on different aspects of a program than in earlier times, when every byte one could save counted. Remember when Bill Gates said "640K should be enough for every program"?

I know that I'm a lot late on this question.
Functional languages does not in general use more memory than imperative or OO languages. It depends more on the code you write. Yes F#, SML, Haskell and such has immutable values (not variables), but for all of them it goes without saying that if you update f.x. a single linked list, it re-compute only what is necessary.
Say you got a list of 5 elements, and you are removing the first 3 and adding a new one in front of it. it will simply get the pointer that points to the fourth element and let the new list point to that point of data i.e. reusing data. as seen below.
old list
[x0,x1,x2]
\
[x3,x4]
new list /
[y0,y1]
If it was an imperative language we could not do this because the values x3 and x4 could very well change over time, the list [x3,x4] could change too. Say that the 3 elements removed are not used afterward, the memory they use can be cleaned up right away, in contrast to unused space in an array.
That all data are immutable (except IO) are a strength. It simplifies the data flow analysis from a none trivial computation to a trivial one. This combined with a often very strong type system, will give the compiler a bunch of information about the code it can use to do optimization it normally could not do because of indicability. Most often the compiler turn values that are re-computed recursively and discarded from each iteration (recursion) into a mutable computation. These two things gives you the proof that if your program compile it will work. (with some assumptions)
If you look at the language Rust (not functional) just by learning about "borrow system" you will understand more about how and when things can be shared safely. it is a language that is painful to write code in unless you like to see your computer scream at you that your are an idiot. Rust is for the most part the combination of all the study made of programming language and type theory for more than 40 years. I mention Rust, because it despite the pain of writing in it, has the promise that if your program compile, there will be NO memory leaking, dead locking, dangling pointers, even in multi processing programs. This is because it uses much of the research of functional programming language that has been done.
For a more complex example of when functional programming uses less memory, I have made a lexer/parser interpreter (the same as generator but without the need to generate a code file) when computing the states of the DFA (deterministic finite automata) it uses immutable sets, because it compute new sets of already computed sets, my code allocate less memory simply because it borrow already known data points instead of copying it to a new set.
To wrap it up, yes functional programming can use more memory than imperative once. Most likely it is because you are using the wrong abstraction to mirror the problem. i.e. If you try to do it the imperative way in a functional language it will hurt you.
Try this book, it has not much on memory management but is a good book to start with if you will learn about compiler theory and yes it is legal to download. I have ask Torben, he is my old professor.
http://hjemmesider.diku.dk/~torbenm/Basics/

I'll throw my hat in the ring here. The short answer to the question is no, and this is because immutability does not mean the same thing as stored in memory. For example, let's take this toy program :
x = 2
x = x * 3
x = x * 2
print(x)
Which uses mutation to compute new values. Compare this to the same program which does not use mutation:
x = 2
y = x * 3
z = y * 2
print(z)
At first glance, it appears this requires 3x the memory of the first program! However, just because a value is immutable doesn't mean it needs to be stored in memory. In the case of the second program, after y is computed, x is no longer necessary, because it isn't used for the rest of the program, and can be garbage collected, or removed from memory. Similarly, after z is computed, y can be garbage collected. So, in principle, with a perfect garbage collector, after we execute the third line of code, I only need to have stored z in memory.
Another oft-worried about source of memory consumption in functional languages is deep recursion. For example, calculating a large Fibonacci number.
calc_fib(x):
if x > 1:
return x * calc_fib(x-1)
else:
return x
If I run calc_fib(100000), I could implement this in a way which requires storing 100000 values in memory, or I could use Tail-Call Elimination (basically storing only the most-recently computed value in memory instead of all function calls). For less straightforward recursion you can resort to trampolining. So for functional languages which support this, recursion does not need to be a source of massive memory consumption, either. However, not all nominally functional languages do (for example, JavaScript does not).

Rules engine for spatial and temporal reasoning?

I have an application that receives a number of datums that characterize 3 dimensional spatial and temporal processes. It then filters these datums and creates actions which are then sent to processes that perform the actions. Rinse and repeat.
At present, I have a collection of custom filters that perform a lot of complicated spatial/temporal calculations.
Many times as I discuss my system to individuals in my company, they ask if I'm using a rules engine.
I have yet to find a rules engine that is able to reason well temporally and spatially. (Things like: When are two 3D entities ever close? Is 3D entity A ever contained in 3D region B? If entity C is near entity D but oriented backwards relative to C then perform action D.)
I have looked at Drools, Cyc, Jess in the past (say 3-4 years ago). It's time to re-examine the state of the art. Any suggestions? Any standards that you know of that support this kind of reasoning? Any defacto standards? Any applications?
Thanks!

Premise - remember that a SQL-based1 DBMS is a (quite capable) inference engine, as can be seen from these comparisons between SQL and Prolog:
prolog to SQL converter
difference between SQL and Prolog
To address specifically your spatio-temporal applications, this book will help:
TEMPORAL DATA AND THE RELATIONAL MODEL - A Detailed Investigation into
the Application of Interval and Relation Theory to the Problem of Temporal Database Management.
That is, combining Interval and Relation Theory is possible to reasoning about spatio-temporal problems effectively (see 5.2 Applications of Intervals).
Of course, if your SQL-based DBMS is not (yet) equipped with interval (and other) operators you will need to extend it appropriately (via store-procedures and/or User-Defined Functions - UDFs).
Update: skimming the paper pointed out in comments by timemirror (Towards a 3D Spatial Query Language for Building Information Models) they do essentially what I touched on above:
(last page)
IMPLEMENTATION CONCEPTS
The implementation of the abstract
type system into a query language will
be performed on the basis of the query
language SQL, which is a widely
established standard in the field of
object-relational databases. The
international standard SQL:1999
extends the relational model to
include object-oriented aspects, such
as the possibility to define complex
abstract data types with integrated
methods.
I do not concur with the "object-relational database" terminology (for reason off-topic here) but I think the rest is pertinent.
Update: a quote regardind 3D and interval theory from the book cited above:
NOTE: All of the intervals discussed
so far can be thought of as
one-dimensional. However, we might
want to combine two one-dimensional
intervals to form a twodimensional
interval. For example, a rectangular
plot of ground might be thought of as
a two-dimensional interval, because it
is, by definition, an object with
length and width, each of which is
basically a one-dimensional interval
measured along some axis. And, of
course, we can extend this idea to any
number of dimensions. For example, a
(rather simple!) building might be
regarded as a three-dimensional
interval: It is an object with length,
width, and height, or in other words a
cuboid. (More realistically, a
building might be regarded as a set of
several such cuboids that overlap in
various ways.) And so on. In what
follows, however, we will restrict our
attention to one-dimensional intervals
specifically, barring explicit
statements to the contrary, and we
will omit the "one-dimensional"
qualifier for simplicity.
Note
I wrote SQL-based and not relational because there are ways to use such DBMSes that completely deviate from relational theory.

This is Spatial Reasoning... a few models but 9DE-IM is now accepted by OGC and implemented in PostGIS and other programming tools.
PostGIS implements a spatial reasoning engine based on dimensionally extended 9 intersection model... 9DE-IM..
http://postgis.refractions.net/documentation/manual-svn/ch04.html#DE-9IM
check sect 4.3.6.1. Theory...
So does the Java Topology Suite (and Net Topology suite for C# etc)...
http://docs.codehaus.org/display/GEOTDOC/Point+Set+Theory+and+the+DE-9IM+Matrix
In particualr check out the geometry.relate stuff.. such as
boolean isRelated = geometry.relate( geometry2, "T*T***T**" )
You can test the relationships, or filter data based on them.
Works with pts, lines, polygons etc...
This might help on temporal stuff..
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.4643&rep=rep1&type=pdf

Check out SpatialRules at http://www.objectfx.com/. It's a geospatial complex event processor for 2D and 3D.

How does a virtual machine work?

I've been looking into how programming languages work, and some of them have a so-called virtual machines. I understand that this is some form of emulation of the programming language within another programming language, and that it works like how a compiled language would be executed, with a stack. Did I get that right?
With the proviso that I did, what bamboozles me is that many non-compiled languages allow variables with "liberal" type systems. In Python for example, I can write this:
x = "Hello world!"
x = 2**1000
Strings and big integers are completely unrelated and occupy different amounts of space in memory, so how can this code even be represented in a stack-based environment? What exactly happens here? Is x pointed to a new place on the stack and the old string data left unreferenced? Do these languages not use a stack? If not, how do they represent variables internally?

Probably, your question should be titled as "How do dynamic languages work?."
That's simple, they store the variable type information along with it in memory. And this is not only done in interpreted or JIT compiled languages but also natively-compiled languages such as Objective-C.

In most VM languages, variables can be conceptualized as pointers (or references) to memory in the heap, even if the variable itself is on the stack. For languages that have primitive types (int and bool in Java, for example) those may be stored on the stack as well, but they can not be assigned new types dynamically.
Ignoring primitive types, all variables that exist on the stack have their actual values stored in the heap. Thus, if you dynamically reassign a value to them, the original value is abandoned (and the memory cleaned up via some garbage collection algorithm), and the new value is allocated in a new bit of memory.

The VM has nothing to do with the language. Any language can run on top of a VM (the Java VM has hundreds of languages already).
A VM enables a different kind of "assembly language" to be run, one that is more fit to adapting a compiler to. Everything done in a VM could be done in a CPU, so think of the VM like a CPU. (Some actually are implemented in hardware).
It's extremely low level, and in many cases heavily stack based--instead of registers, machine-level math is all relative to locations relative to the current stack pointer.
With normal compiled languages, many instructions are required for a single step. a + might look like "Grab the item from a point relative to the stack pointer into reg a, grab another into reg b. add reg a and b. put reg a into a place relative to the stack pointer.
The VM does all this with a single, short instruction, possibly one or two bytes instead of 4 or 8 bytes PER INSTRUCTION in machine language (depending on 32 or 64 bit architecture) which (guessing) should mean around 16 or 32 bytes of x86 for 1-2 bytes of machine code. (I could be wrong, my last x86 coding was in the 80286 era.)
Microsoft used (probably still uses) VMs in their office products to reduce the amount of code.
The procedure for creating the VM code is the same as creating machine language, just a different processor type essentially.
VMs can also implement their own security, error recovery and memory mechanisms that are very tightly related to the language.
Some of my description here is summary and from memory. If you want to explore the bytecode definition yourself, it's kinda fun:
http://java.sun.com/docs/books/jvms/second_edition/html/Instructions2.doc.html

The key to many of the 'how do VMs handle variables like this or that' really comes down to metadata... The meta information stored and then updated gives the VM a much better handle on how to allocate and then do the right thing with variables.
In many cases this is the type of overhead that can really get in the way of performance. However, modern day implementations, etc have come a long way in doing the right thing.
As for your specific questions - treating variables as vanilla objects / etc ... comes down to reassigning / reevaluating meta information on new assignments - that's why x can look one way and then the next.

To answer a part of your questions, I'd recommend a google tech talk about python, where some of your questions concerning dynamic languages are answered; for example what a variable is (it is not a pointer, nor a reference, but in case of python a label).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart