SIMD Programming - sse

I am using SSE extensions available in Core2Duo processor (compiler gcc 4.4.1). I see that there are 16 registers available each of which is 128 bit long. Now, I can accommodate 4 integer values into a single register, and 4 in another register and using intrinsics I can add them in one instruction. The obvious advantage is this way I require only 1 instruction instead of 4.
My question is "is that all for SIMD?". Let I have a1, a2, a3, a4, a5, a6, a7, a8 and b1, b2, b3, b4, b5, b6, b7, b8. Let A1, A2 are vector registers. Now, A1 <<< (a1, a2, a3, a4)
and B1 <<< (b1, b2, b3, b4), and add (A1, B1) will perform the vector addition.
Let A2 <<< (a5, a6, a7, a8), B2 <<< (b5, b6, b7, b8). Is there an add instruction which can do add(A1, B1) and add(A2, B2) simultaneously.
How many vector functional units are available in core2duo and where can I get these informations?
Any other source of informations related to these is highly appreciated.

No, there isn't any single SSE instruction to do that. You need to issue two instructions. Are you thinking of something like the x86 string instructions and the REP prefix? There's no SSE equivalent.
The two 4-wide vector operations will be executed concerrently in the sense that all modern processors are highly pipelined. The second instruction will go down the pipe only 1 cycle behind the first (assuming the two aren't interdependent, which is the case in your example), so their execution will overlap in time, except for that one cycle.
Each core of your multi-core processor has its own vector functional unit. You have to write multi-threaded code to take advantage of this.
Some cpus have 1 vector unit per core, some have only 1/2! In the latter case, the vector unit is only 64-bits wide and only executes one-half of the SSE instruction at a time. You get what you pay for.
You should look into AVX, the new instruction set extension that evolves SSE to support wider vector units.
Or you could look into real vector programming on a GPU with OpenCL or Cuda.

I don't think there's a single instruction to do this (unless they snuck one into a recent version of SSE).
However, since the operations that you're doing are independent, the compiler can issue the second add instruction before the first one finishes. So the timeline would look something like
begin C1 = A1 + B1
begin C2 = A2 + B2
wait
end C1 = A1 + B1
end C2 = A2 + B2
So even though you're using two instructions, you're not necessarily taking twice the time. The actual duration of the wait will depend on the processor and the latency of the particular instruction that you're using.
Here's a more detailed explanation of pipelining: http://en.wikipedia.org/wiki/Instruction_pipeline
For help on SIMD programming in general, Apple's SSE page is pretty good. It's somewhat geared towards people migrating applications from PowerPC to SSE, but there's some good general information there too.

The intel site contains all the info you'll ever need!
http://www.intel.com/products/processor/manuals/
Edit in answer to the comment: All the info is in the links linked to above but No. You could pack 8 16-bit integers into 1 register and thus perform 8 simultaneous adds but no SSE does not allow for adding 2 registers simultaneously.

Related

Quantum computing vs traditional base10 systems

This may show my naiveté but it is my understanding that quantum computing's obstacle is stabilizing the qbits. I also understand that standard computers use binary (on/off); but it seems like it may be easier with today's tech to read electric states between 0 and 9. Binary was the answer because it was very hard to read the varying amounts of electricity, components degrade over time, and maybe maintaining a clean electrical "signal" was challenging.
But wouldn't it be easier to try to solve the problem of reading varying levels of electricity so we can go from 2 inputs to 10 and thereby increasing the smallest unit of storage and exponentially increasing the number of paths through the logic gates?
I know I am missing quite a bit (sorry the puns were painful) so I would love to hear why or why not.
Thank you
"Exponentially increasing the number of paths through the logic gates" is exactly the problem. More possible states for each n-ary digit means more transistors, larger gates and more complex CPUs. That's not to say no one is working on ternary and similar systems, but the reason binary is ubiquitous is its simplicity. For storage, more possible states also means we need more sensitive electronics for reading and writing, and a much higher error frequency during these operations. There's a lot of hype around using DNA (base-4) for storage, but this is more on account of the density and durability of the substrate.
You're correct, though that your question is missing quite a bit - qubits are entirely different from classical information, whether we use bits or digits. Classical bits and trits respectively correspond to vectors like
Binary: |0> = [1,0]; |1> = [0,1];
Ternary: |0> = [1,0,0]; |1> = [0,1,0]; |2> = [0,0,1];
A qubit, on the other hand, can be a linear combination of classical states
Qubit: |Ψ> = α |0> + β |1>
where α and β are arbitrary complex numbers such that such that |α|2 + |β|2 = 1.
This is called a superposition, meaning even a single qubit can be in one of an infinite number of states. Moreover, unless you prepared the qubit yourself or received some classical information about α and β, there is no way to determine the values of α and β. If you want to extract information from the qubit you must perform a measurement, which collapses the superposition and returns |0> with probability |α|2 and |1> with probability |β|2.
We can extend the idea to qutrits (though, just like trits, these are even more difficult to effectively realize than qubits):
Qutrit: |Ψ> = α |0> + β |1> + γ |2>
These requirements mean that qubits are much more difficult to realize than classical bits of any base.

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

How can I use machine learning to extract larger chunks of text from a document?

I am currently learning about machine learning, as I think it might be helpful to solve a problem I have. However, I am unsure about what techniques I should apply to solve my problem. I apologise in advance for probably not knowing enough about this field to even ask a proper question.
What I want to to is extract the significant parts of a knitting pattern (the actual pattern, not all the intro and stuff like that). For instance, I would like to feed this web page into my program and get out something like this:
{
title: "Boot Style Red and White Baby Booties for Cold Weather"
directions: "
Right Bootie.
Cast on (31, 43) with white color.
Rows (1, 3, 5, 7, 9, 10, 11): K.
Row 2: K1, M1, (K14, K20), M1, K1, M1, (K14, K20), M1, K1. (35, 47 sts)
Row 4: K2, M1, (K14, K20), M1, K3, M1, (K14, K20), M1, K2. (39, 51 sts)
Row 6: K3, M1, (K14, K20), M1, K5, M1, (K14, K20), M1, K3. (43, 55 sts)
..."
}
I've been reading about extracting smaller parts, like sentences and words, and also about stuff like Named Entity Recognition, but they all seem to be focused on very small parts of the text.
My current thoughts are to use supervised learning, but I'm also very unsure about how to extract features out of the text. Naive methods like using letters, words or even sentences as features seems like they wouldn't be relevant enough to yield any kind of satisfactory results (and also, there would be tons of features, unless I use some kind of sampling), but what are really the significant features for finding out which parts are what in a knitting pattern?
Can someone point me in the right direction of algorithms and methods to do extraction of larger portions of the text?
One way to see this is as a straightforward classification problem: for each sentence in the page, you want to determine if it's relevant to you or not. Optionally, you have different classes of relevant sentences, such as "title" and "directions".
As a result, for each sentence you need to extract the features that contain information about its status. This will likely involve tokenizing the sentence, and possibly applying some type of normalization. Initially I would focus on features such as individual words (M1, K1, etc.) or n-grams (a number of adjacent words). Yes, there are many of them, but a good classifier will learn which features are informative, and which are not. If you're really worried about data sparseness, you can also reduce the number of features by mapping similar "words" such as M1 and K1 to the same feature.
Additionally, you will need to label a set of example sentences, to serve as the training and test sets for your classifier. This will allow you to train the system, evaluate its performance and compare different approaches.
To start, you can experiment with some simple, but popular classification methods such as Naive Bayes.

Count occurrences of given character per cell

Question
For example if I wanted to count the number of Ns in a column of strings how can I do this in Google Spreadsheets at a per cell basis (i.e. a formula that points at one cell at a time that I can drag down)?
Background
I'm having to decide a threshold -min-overlap <integer> for a program called TOMTOM** which compares similarity between PWMs*** of small DNA motifs****, N is a regular expression for any linear combination of the letters A, C, G and T. It would be nice if I could get an idea of the distribution of non-N lengths of my DNA motifs to help inform me of a proper -min-overlap <integer> value for TOMTOM.
And here are some real examples:
** TOMTOM is a tool for comparing a DNA motif to a database of known motifs. See here for more info.
*** PWM stands for Position Weight Matrix:
According to Wiki: A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.
According to this paper, it could be defined as:
Position weight matrix (PWM) or PWM‐like models are widely used to
represent DNA‐binding preferences of proteins (Stormo, 2000). In these
models, a matrix is used to represent the TF‐binding site (TFBS), with
each element representing the contribution to the overall binding
affinity from a nucleotide at the corresponding position. An inherent
assumption of traditional PWM models is position independence; that
is, the contribution of different nucleotide positions within a TFBS
to the overall binding affinity is assumed to be additive. Although
this approximation is broadly valid, nevertheless, it does not hold
for several proteins (Man & Stormo, 2001; Bulyk et al, 2002). To
improve quantitative modeling, PWM models have been extended to
include additional parameters, such as k‐mer features, to account for
position dependencies within TFBSs (Zhao et al, 2012; Mathelier &
Wasserman, 2013; Mordelet et al, 2013; Weirauch et al, 2013; Riley et
al, 2015). Interdependencies between nucleotide positions have a
structural origin. For example, stacking interactions between adjacent
base pairs form the local three‐dimensional DNA structure. TFs have
preferences for sequence‐dependent DNA conformation, which we call DNA
shape readout (Rohs et al, 2009, 2010).
OR, more contemporarily:
Based on this rationale, an alternative approach to augment
traditional PWM models is the inclusion of DNA structural features.
Models of TF–DNA binding specificity incorporating these DNA shape
features achieved comparable performance levels to models
incorporating higher‐order k‐mer features, while requiring a much
smaller number of parameters (Zhou et al, 2015). We previously
revealed the importance of DNA shape readout for members of the basic
helix‐loop‐helix (bHLH) and homeodomain TF families (Dror et al, 2014;
Yang et al, 2014; Zhou et al, 2015). We were also able, for Hox TFs,
to identify which regions in the TFBSs used DNA shape readout,
demonstrating the power of the approach to reveal mechanistic insights
into TF–DNA recognition (Abe et al, 2015). This capability was
extensively shown for only two protein families, due to the lack of
large‐scale high‐quality TF–DNA binding data. With the recent
abundance of high‐throughput measurements of protein–DNA binding, it
is now possible to dissect the role of DNA shape readout for many TF
families.
**** DNA motif: wiki: In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids, which may not be adjacent.
An alternative for one cell at a time (formula to be copied down):
=len(A2)-len(SUBSTITUTE(A2,"N",""))
I don't know if this is gonna help but let's say you have those strings in range A2:A6 and you enter
=ArrayFormula(LEN(REGEXREPLACE(A2:A6, "[^N]", "")))
in B2, that should output the N count for the whole range.
=len(A2)-len(SUBSTITUTE(A2,"N",""))
this works, but if you want to find all numbers matching a specific pattern, say, 3. Then:
=len(A2)-len(SUBSTITUTE(A2,"3",""))
Is what you need.

Hardware implementation for integer data processing

I am currently trying to implement a data path which processes an image data expressed in gray scale between unsigned integer 0 - 255. (Just for your information, my goal is to implement a Discrete Wavelet Transform in FPGA)
During the data processing, intermediate values will have negative numbers as well. As an example process, one of the calculation is
result = 48 - floor((66+39)/2)
The floor function is used to guarantee the integer data processing. For the above case, the result is -4, which is a number out of range between 0~255.
Having mentioned above case, I have a series of basic questions.
To deal with the negative intermediate numbers, do I need to represent all the data as 'equivalent unsigned number' in 2's complement for the hardware design? e.g. -4 d = 1111 1100 b.
If I represent the data as 2's complement for the signed numbers, will I need 9 bits opposed to 8 bits? Or, how many bits will I need to process the data properly? (With 8 bits, I cannot represent any number above 128 in 2's complement.)
How does the negative number division works if I use bit wise shifting? If I want to divide the result, -4, with 4, by shifting it to right by 2 bits, the result becomes 63 in decimal, 0011 1111 in binary, instead of -1. How can I resolve this problem?
Any help would be appreciated!
If you can choose to use VHDL, then you can use the fixed point library to represent your numbers and choose your rounding mode, as well as allowing bit extensions etc.
In Verilog, well, I'd think twice. I'm not a Verilogger, but the arithmetic rules for mixing signed and unsigned datatypes seem fraught with foot-shooting opportunities.
Another option to consider might be MyHDL as that gives you a very powerful verification environment and allows you to spit out VHDL or Verilog at the back end as you choose.

Resources