Loading/Storing to XMFLOAT4 faster than using XMVECTOR? - directx

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:
#include <Windows.h>
#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>
using std::chrono::high_resolution_clock;
#define TEST_COUNT 1000000000l
int main(void)
{
DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };
std::chrono::system_clock::time_point start, end;
std::chrono::milliseconds duration;
// Test with just the XMVECTOR
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMVectorAdd(v1, v2);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
DirectX::XMFLOAT4 z;
DirectX::XMStoreFloat4(&z, v1);
std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
// Now try with load/store
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMLoadFloat4(&x);
v2 = DirectX::XMLoadFloat4(&y);
v1 = DirectX::XMVectorAdd(v1, v2);
DirectX::XMStoreFloat4(&x, v1);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
}
Running a debug build yields the output:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds
Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds
Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.
EDIT:
After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:
002912E0 movups xmm1,xmmword ptr [esp+18h] <-- HERE
002912E5 add ecx,1
002912E8 movaps xmm0,xmm2 <-- HERE
002912EB adc esi,0
002912EE addps xmm0,xmm1
002912F1 movups xmmword ptr [esp+18h],xmm0 <-- HERE
002912F6 jne main+60h (0291300h)
002912F8 cmp ecx,3B9ACA00h
002912FE jb main+40h (02912E0h)
And for the second part:
00291400 add ecx,1
00291403 addps xmm0,xmm1
00291406 adc esi,0
00291409 jne main+173h (0291413h)
0029140B cmp ecx,3B9ACA00h
00291411 jb main+160h (0291400h)
Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.

If you define
DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);
and use v3 instead v1 as a result:
...
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v3 = DirectX::XMVectorAdd(v1, v2);
}
you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4

Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.
SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.
You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.
PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.
static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };
Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

Related

c++ vecotr with iterator can't be destroyed?

Recently I have faced a problem when the program running the memory keep increasing, and when program is closed the memory would restore normal level. Obviously, it's a memory leak. After some work, I have located the code responsible, but I don't know why? The program's work flow is simple:
first use lidar api to get point cloud and image data;
then transport to next tbb flow graph to process these data;
finally use open3d api to visualzie them.
In the first step, the lidar itself's api use asio to asynchronously invoke some callback function to transport data, so I create some tbb concurrent_queue to store these data, and a align function to match cloud and image with timestamp. The problem is in the align function. In the function, I create a vector<shared_ptr<open3d::..::PointCloud>> and use iterator to store point cloud elements. However, I found when the function complete, the shared_ptr use count don't reduce . Similar but simpler example code like this:
std::pair<std::shared_ptr<int>, int> helper() {
auto a = std::make_shared<int>(90);
auto c = 100;
std::vector<std::pair<std::shared_ptr<int>, int>> container;
container.reserve(5);
auto iter = container.begin();
for (int i = 0; i < 3; i++) {
*iter = std::make_pair(a, c);
iter++;
}
return *(iter-1);
}
int main() {
auto b = helper();
std::cout << "shared_ptr use count: " << std::get<0>(b).use_count() << std::endl;
return 0;
}
Ubuntu 20.04 + gcc 9.4, the print result is shared_ptr use count: 4.
Why the vector can't be auto destroyed when function is completed? Hope someone kindly explain this problem.
Thanks #Retired Ninja! The root of the problem is vector.reserve just reserve capacity not physical space. So the vector space after reserve is 0. The following iterator operation is assumed point to some undifined memory. While the result can be transport to main function with no value error, the shared_ptr use count can't reduce to 1 after function call.
To solve the problem, One can just modify the reserve to resize, which can change physical space of the vector and iterator point to defined memory space. Or avoid use iterator, just use push_back and return back().

Fastest way of storing non-adjacent d registers with NEON intrinsics

I am porting 32bit NEON asm code to NEON intrinsics, and I am wondering if this code can be written in a concise way using intrinsics:
vst4.32 {d0[0], d2[0], d4[0], d6[0]}, [%[v1]]!
1) The previous code operates on q registers, but when it comes to storage, instead of using q0, q1, q2 and q3, it has to recreate vectors which have each part in one of the d registers, e.g. v1[0] = d0[0], v1[1] = d2[0] ... v2[0] = d0[1], v2[1] = d2[1] ... v3[0] = d1[0], v3[1] = d3[0] ... etc.
This operation is a one-liner in asm, but with intrinsics I don't know if I can do that without first splitting high and low bits and building a new float32x4x4_t variable to feed to vst4_f32.
Is that possible?
2) I'm not entirely sure of what [%[v1]]! does (yes, I googled quite a bit): it should be a reference to a variable named v1 and the exclamation mark will do writeback, which should mean the pointer is increased by the same amount that was written by the instruction on the same line.
Correct? Any way of replicating that with intrinsics?
After some more investigation I found this specific instruction to store a specific lane of an array of 4 vectors, so no need to split into high and low bits variables:
float32x4x4_t u = { q0, q1, q2, q3 };
vst4q_lane_f32(v1, u, 0);
v1 += 4;
Writeback is just an increased pointer, as #charlesbaylis wrote.
In principle, a sufficiently smart compiler could use the instruction you want for the vst4_f32 intrinsic, but in practice, no compiler is that good.
To get the post-index writeback, you can write
vst4_f32(ptr, v);
ptr += 4;
Some compilers will recognise this. GCC 5.1 (when released) will do this in at least some cases.
[Edit: misread the question, vst4q_lane_f32 does map to the required instruction perfectly]
It seems to be inline assembly.
Anyway, the answers are:
1) No
2) Yes

OpenCV parallel_for not using multiple processors

I just saw in the new OpenCV 2.4.3 that they added a universal parallel_for. So following this example, I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!
I thought maybe this had something to do with my pushing into vectors or something (I'm a pretty big noob to parallel processing), so I set up a test loop of just running through a big number and it still doesn't work.
Code:
class Parallel_Test : public cv::ParallelLoopBody
{
private:
double* const mypointer;
public:
Parallel_Test(double* pointer)
: mypointer(pointer){
}
void operator() (const Range& range) const
{
//This constructor needs to be here otherwise it is considered an abstract class.
// qDebug()<<"This should never be called";
}
void operator ()(const cv::BlockedRange& range) const
{
for (int x = range.begin(); x < range.end(); ++x){
mypointer[x]=x;
}
}
};
//TODO Loop pixels in parallel
double t = (double)getTickCount();
//TEST PARALELL LOOPING AT ALL
double data1[1000000];
cv::parallel_for(BlockedRange(0, 1000000), Parallel_Test(data1));
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "Parallel TEST time " << t << endl;
t = (double)getTickCount();
for(int i =0; i<1000000; i++){
data1[i]=i;
}
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "SERIAL Scan time " << t << endl;
output:
Parallel TEST time 0.00415479
SERIAL Scan time 0.00204597
Wow! I found the answer! "parallel_for" and "parallel_for_" (with a trailing underscore!) are totally different. You need the trailing underscore to make it work! Otherwise it will just run your loop in serial and you will have to use a BLOCKEDRANGE instead of a range! AHH!
Thanks to #Daniil Osokin and especially #Vladislav Vinogradov for pointing this out!
So again you code will need to look something like this:
cv::parallel_for_(Range(0, 1000000), Parallel_Test(data1));
More updated details at: http://answers.opencv.org/question/3730/how-to-use-parallel_for/
The problem is most likely that your loop body is too small.
It appears all you are doing is assigning a pointer in one vector to another.
You really need to think of a parallel for as an inefficient for loop, that is the work inside each iteration needs to be large enough so that you wouldn't dream of getting speedups by unrolling the loop because in addition to the usual decrement, compare and jump that can go on you also have a few interlocked instructions and perhaps a virtual function call or two and some allocations.
So instead of copying a pointer try doing a good amount of real math or work on a large array of data.

Comparison Efficiency

What is generally faster:
if (num >= 10)
or:
if (!(num < 10))
The compiler will most likely optimize that sort of thing. Don't worry about it, just code for clarity in this case.
Assembly languages often have operations for >= and <= that are the same number of steps as < and >. For instance, with a Motorola 68k, if you want to compare the data registers %d0 and %d1 and branch if %d0 is greater than or equal to %d1, you would say something like:
cmp %d0, %d1 // compare %d0 and %d1, storing the result
// in the condition code registers.
bge labelname // Branch to the given label name if the comparison
// yielded "greater than or equal to" (hence bge)
It's a common mistake to think that a >= b means the computer will perform two operations instead of one because of that "or" in "greater than or equal to".
Any decent compiler will optimize those two statements to exactly the same underlying code. In fact, it will most likely generate exactly the same code for:
if (!(!(!(!(!(!(!(num < 10))))))))
I would opt for the first of yours just because its intent seems much clearer (mildly clearer than your second choice, massively clearer than that monstrosity I posted above). I tend to think in terms of how I would read it. Think of the two sentences:
if number is greater than or equal to ten.
if it's not the case that number is less than ten.
I believe the first one to be clearer.
In fact, just testing with "gcc -s" to get the assembler output, both statements generate the following code:
cmpl $9,-8(%ebp) ; compare value with 9
jle .L3 ; branch if 9 or less.
I believe you're wasting your time looking at micro-optimisations like this - you'd be far more efficient looking at things like algorithm selection. There's likely to be a much greater return on investment there.
In general any speed difference won't matter a great deal, but they don't necessarily mean exactly the same thing.
In many languages, comparing the floating point value NaN returns false for all comparisons, so if num = NaN, the first is false and the second true.
#include <iostream>
#include <limits>
int main ( ) {
using namespace std;
double num = numeric_limits<double>::quiet_NaN();
cout << boolalpha;
cout << "( num >= 10 ) " << ( num >= 10 ) << endl;
cout << "( ! ( num < 10 ) ) " << ( ! ( num < 10 ) ) << endl;
cout << endl;
}
outputs
( num >= 10 ) false
( ! ( num < 10 ) ) true
So the compiler can use a single instruction to compare num and the value 10 in the first case, but in the second may issue a second instruction to invert the result of the comparison. ( or it may just use a branch if zero rather than branch if non-zero, you can't say in general )
Other languages and compilers will vary, and for types where they really have the same semantics the code emitted might well be identical.

Parsing really big log files (>1Gb, <5Gb)

I need to parse very large log files (>1Gb, <5Gb) - actually I need to strip the data into objects so I can store them in a DB. The log file is sequential (no line breaks), like:
TIMESTAMP=20090101000000;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000100;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;TIMESTAMP=20090101000152;PARAM1=Value11;PARAM2=Value21;PARAM3=Value31;...
I need to strip this into the table:
TIMESTAMP | PARAM1 | PARAM2 | PARAM3
The process need to be as fast as possible. I'm considering using Perl, but any suggestions using C/C++ would be really welcome. Any ideas?
Best regards,
Arthur
Write a prototype in Perl and compare its performance against how fast you can read data off of the storage medium. My guess is that you'll be I/O bound, which means that using C won't offer a performance boost.
This presentation about the use of Python generators blew my mind:
http://www.dabeaz.com/generators-uk/
David M. Beazley shows how to process multi-gigabyte log files by basically defining a generator for each processing step. The generators are then 'plugged' into each other until you have some simple utility functions
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
which can then be used for all sorts of querying:
stat404 = set(r['request'] for r in log
if r['status'] == 404)
large = (r for r in log
if r['bytes'] > 1000000)
for r in large:
print r['request'], r['bytes']
He also shows that performance compares well to the performance of standard unix tools like grep, find etc.
Of course this being Python, it's much easier to understand and most importantly easier to customise or adapt to different problem sets than perl or awk scripts.
(The code examples above are copied from the presentation slides.)
Lex handles this sort of things amazingly well.
But really, use AWK. It's performance is not bad, even comparing with Perl, etc. Of cource Map/Reduce would work quite well, but what about the overhead of splitting the file into appropriate chunks?
Try AWK
The key won't be the language because the problem is I/O bound, so pick the language that you feel most comfortable with.
The key is how it is coded. You'll be fine as long as you don't load the whole file in memory -- load chunks at a time, and save the data chunks at a time, it will be more efficient.
Java has a PushbackInputStream that may make this easier to code. The idea is that you guess how much to read, and if you read too little, then push the data back, and read a larger chunk.
Then when you've read too much, process the data and then push back the remaining bit and continue to the next iteration of the loop.
Something like this should work.
use strict;
use warnings;
my $filename = shift #ARGV;
open my $io, '<', $filename or die "Can't open $filename";
my ($match_buf, $read_buf, $count);
while (($count = sysread($io, $read_buf, 1024, 0)) != 0) {
$match_buf .= $read_buf;
while ($match_buf =~ s{TIMESTAMP=(\d{14});PARAM1=([^;]+);PARAM2=([^;]+);PARAM3=([^;]+);}{}) {
my ($timestamp, #params) = ($1, $2, $3, $4);
print $timestamp ."\n";
last unless $timestamp;
}
}
This is easily handled in Perl, Awk, or C. Here's a start on a version in C for you:
#include <stdio.h>
#include <err.h>
int
main(int argc, char **argv)
{
const char *filename = "noeol.txt";
FILE *f;
char buffer[1024], *s, *p;
char line[1024];
size_t n;
if ((f = fopen(filename, "r")) == NULL)
err(1, "cannot open %s", filename);
while (!feof(f)) {
n = fread(buffer, 1, sizeof buffer, f);
if (n == 0)
if (ferror(f))
err(1, "error reading %s", filename);
else
continue;
for (s = p = buffer; p - buffer < n; p++) {
if (*p == ';') {
*p = '\0';
strncpy(line, s, p-s+1);
s = p + 1;
if (strncmp("TIMESTAMP", line, 9) != 0)
printf("\t");
printf("%s\n", line);
}
}
}
fclose(f);
}
Sounds like a job for sed:
sed -e 's/;\?[A-Z0-9]*=/|/g' -e 's/\(^\|\)\|\(;$\)//g' < input > output
You might want to take a look at Hadoop (java) or Hadoop Streaming (runs Map/Reduce jobs with any executable or script).
If you code your own solution, you will probably benefit from reading larger chunks of data from the file and processing them in batches (rather than using, say, readline()) and looking for the newline marking the end of each row. With this approach, you need to be mindful that you may not have retrieved the entirety of the last line, so some logic would be required to handle that.
I don't know what performance benefits you'd realize, since I haven't tested it, but I've leveraged similar techniques with success.
I know this is an exotic language and may be not the best solution to do that but when i've ad hoc data, i consider PADS

Resources