Efficient way of copying between std::complex vector and Intel IPP complex array

Efficient way of copying between std::complex vector and Intel IPP complex array - signal-processing

I'm using Intel IPP for signal processing.The top-level function are using std::vectorstd::complex > data types whereas the Intel IPP equivalent is Ipp32fc[]. The Ipp32fc data type is defined as
typedef struct {
Ipp32f re;
Ipp32f im;
} Ipp32fc;
From what I know, the Ipp32f data type is simply a C/C++ float. So far, I have been using for loop for copying, and it squeezes the processor a lot, considering the symbol rate I'm processing. I have tried to use standard memcpy without much luck.
All suggestions are welcomed.

There is a function named ippsCopy_32f which copies the content from one vector to other vector. Maybe you can try using the function for copying and see if it helps. Please refer to the below link which helps you to get more details regarding the respective functions which are present under the Vector Initialization Functions section in the IPP developer reference guide.
https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/vector-initialization-functions/vector-initialization-functions-1/copy.html

Related

DirectCompute: How to read from a RWTexture2D<float4>?

I have the following buffer:
RWTexture2D<float4> Output : register(u0);
This buffer is used by a compute shader for rendering a computed image.
To write a pixel in that texture, I just use code similar to this:
Output[XY] = SomeFunctionReturningFloat4(SomeArgument);
This works very well and my computed image is correctly rendered on screen.
Now at some stage in the compute shader, I would like to read back an
already computed pixel and process it again.
Output[XY] = SomeOtherFunctionReturningFloat4(Output[XY]);
The compiler return an error:
error X3676: typed UAV loads are only allowed for single-component 32-bit element types
Any help appreciated.

In Compute Shaders, data access is limited on some data types, and not at all intuitive and straightforward. In your case, you use a
RWTexture2D<float4>
That is a UAV typed of DXGI_FORMAT_R32G32B32A32_FLOAT format.
This forma is only supported for UAV typed store, but it’s not supported by UAV typed load.
Basically, you can only write on it, but not read it. UAV typed load only supports 32 bit formats, in your case DXGI_FORMAT_R32_FLOAT, that can only contain a single component (32 bits and that’s all).
Your code should run if you use a RWTexture2D<float> but I suppose this is not enough for you.
Possible workarounds that spring to my minds are:
1. using 4 different RWTexture2D<float>, one for each component
2. using 2 different textures, RWTexture2D<float4> to write your values and Texture2D<float4> to read from
3. Use a RWStructuredBufferinstead of the texture.
I don’t know your code so I don’t know if solutions 1. and 2. could be viable. However, I strongly suggest going for 3. and using StructuredBuffer. A RWStructuredBuffer can hold any type of struct and can easily cover all your needs. To be honest, in compute shaders I almost only use them to pass data. If you need the final output to be a texture, you can do all your calculations on the buffer, then copy the results on the texture when you’re done. I would add that drivers often use CompletePath to access RWTexture2D data, and FastPath to access RWStructuredBuffer data, making the former awfully slower than the latter.
Reference for data type access is here. Scroll down to UAV typed load.

How to save intensity value in sensor_msgs/Image from PointCloud?

I am using ROS-Kinetic. I have a pointcloud of type PointCloud. I have projected the same pointcloud on a plane. I would like to convert the planar pointcloud to an image of type sensor_msgs/Image.
toROSMsg(cloud, image);
enter code hereis throwing an error as
error: ‘const struct pcl::PointXYZI’ has no member named ‘rgb’
memcpy (pixel, &cloud (x, y).rgb, 3 * sizeof(uint8_t));
Kindly enlighten me in this regard. If possible along with a code snippet.
Thanks in advance

If toROSMsg() is complaining that your input cloud does not have an 'rgb' member, try to input a cloud of type pcl::PointXYZRGB. This is another type of point cloud handled by PCL. You can look at the documentation of PCL point types.
Convert to type pcl::PointXYZRGB with these lines:
pcl::PointCloud<pcl::PointXYZRGB>::Ptr cloudrgb (new pcl::PointCloud<pcl::PointXYZRGB>);
pcl::copyPointCloud(*cloud, *cloudrgb);
Then call your function as:
toROSMsg(cloudrgb, image);

What you try to achieve is some 2D voxelization. And I assume that you want to implement some "inverse sensor model" (ISM) as explained by Thrun, right?
This approach is commonly directly implemented into a mapping algorithm to circumvent the exhaustive calculation of the plain ISM.
Therefore, you'll hardly find an out of the box solution.
Anyway, you could do it in multiple ways like this:
Use pointcloud_to_laserscan for 2D projection (but you have it anyway)
Use the ISM alg. explained in the book
or
Transform the PCL to an octree
Downsample to a quadtree and convert it to an imge

Fast way to swap endianness using opencl

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.
Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.
Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?
Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.
I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.
I thought about a solution like this one (neither implemented nor tested, yet):
uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow
Hoping there's a better solution out there.
Thanks in advance,
runtimeterror

Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.
output[tid] = input[tid].wzyx;
swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.
Hope this helps!

Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.
The most efficient way is as follows
#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)
n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

high performance buffers in objective-c

I'm wondering what the most applicable kind of buffer implementation is for audio data in objective-c. I'm working with audio data on the iPhone, where I do some direct data manipulation/DSP of the audio data while recording or playing, so performance matters. I do iPhone development since some months now. Currently I'm dealing with c-arrays of element type SInt16 or Float32, but I'm looking for something better.
AFAIK, the performance of pointer-iterated c-arrays is unbeatable in an objective-c environment. However, pointer arithmetic and c-arrays are error prone. You always have to make sure that you do not access the arrays out of their bounds. You will not get a runtime error immediately if you do. And you have to make sure manually that you alloc and dealloc the arrays correctly.
Thus, I'm looking for alternatives. What high performance alternatives are there? Is there anything in objective-c similar to the c++ style std::vector?
With similar I mean:
good performance
iteratable with pointer-/iterator-based loop
no overhead of boxing/unboxing basic data types like Float32 or SInt16 into objective-c objects (btw, what's the correct word for 'basic data types' in objective-c?)
bounds-checking
possibility to copy/read/write chunks of other lists or arrays into and out of my searched-for list implementation
memory management included
I've searched and read quite a bit and of course NSData and NSMutableArray are among the mentioned solutions. However don't they double processing cost because of the overhead for the boxing/unboxing of basic data types? That the code looks outright ugly like a simple 'set'-operation becoming some dinosaur named replaceObjectAtIndex:withObject: isn't of my concern, but still it subtly makes me think that this class is not made for me.

NSMutableData hits one of your requirements in that it brings Objective-C memory management semantics to plain C buffers. You can do something like this:
NSMutableData* data = [NSMutableData dataWithLength: sizeof(Float32) * numberOfFloats];
Float32* cFloatArray = (Float32*)[data mutableBytes];
And you can then treat cFloatArray as a standard C array and use pointer iteration. When the NSMutableData object is dealloc'ed the memory backing it will be freed. It doesn't give you bounds checking, but it delivers memory management help while preserving the performance of C arrays.
Also, if you want some help from the tools in ironing out bounds-checking issues read up on Xcode's Malloc Scribble, Malloc Guard Edges and Guard Malloc options. These will make the runtime much more sensitive to bounds problems. Not useful in production, but can be helpful in ironing out issues during development.

The containers provided in the Foundation framework have little to offer for audio processing, being on the whole rather heavy-weight, nor providing extrinsic iterators.
Furthermore, none of the audio APIs in iOS or MacOSX that interact with buffers of samples are Objective-C - based, or take any parameters of Foundation framework containers.
Most likely, you would want to make use of the Accelerate Framework for DSP operations, and its APIs all work on arrays of floats or int16s.
Whilst all of the APIs are C-style, C++ and STL is the obvious weapon of choice for your requirements, and interworks cleanly with the rest of an application in the guise of Objective-C++. STL frequently compiles down to code which is about as efficient as hand-crafted C.
To memory-manage your buffers, perhaps use std::array - if you want bounds checking or std::shared_ptr or std::unique_ptr with a custom deleter if you're not worried.
Places where an iterator is expected - for instance algorithm functions in <algorithm> - can usually also take pointers to basic types - such as your sample buffers.

Can I safely assume that the destination Sample received by my DirectShow Transform Filter will have memory already allocated to it?

I've written a DirectShow Transform filter using Delphi 6 and the DSPACK library. I've examined the DSPACK base Filter classes and the code belonging to their 'WAV Dest' sample app, which is a Transform filter example. As far as I can tell, memory is not allocated by the receiving Filter for either the Transform filter's source IMediaSample or the destination IMediaSample parameters, although I do see the destination IMediaSample's length potentially adjusted using IMediaSample.SetActualLength().
I just want to make sure that I can rely on the code calling my Transform filter having already allocated memory for those two parameters so I don't have to, if that is indeed part of the DirectShow API specification. Otherwise, I assume I would need to do that allocation myself using CoTaskMemAlloc(). Can someone give me the definitive answer here?

Samples and Allocators. Filters are expected to pre-allocate buffers by negotiating an allocator with a connection peer pin, and the allocation itself takes place when the allocator is being committed.
You just have no way to allocate yourself with CoTaskMemAlloc as you suggested.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart