Does halide support ARMv8(aarch64) with neon? - arm64

I'd like to use Halide for ARM A53(aarch64) target with neon vectorization.
But I cannot figure out how to create Target object. Also I cannot find aarch64 target with neon feature in Target.h.
The below code I've tested runs on A53 target but the generated code does not contain neon instructions.
Target target("arm-64-linux"); // is it right?
Buffer<uint16_t> input(640,480);
Var x,y;
Func brighter("brighter");
brighter(x,y) = input(x,y) + 100;
brighter.estimate(x, 0, 640).
estimate(y, 0, 480);
Pipeline p(brighter);
p.auto_schedule(target);
p.compile_to_static_library("./lib_dummy", {input}, "", target);

arm-64 is what Halide uses for aarch64, so your target is fine. To use neon instructions, you need to be vectorizing something. Not sure if the autoscheduler is doing that or not (it should be!). Try not autoscheduling and instead just saying:
brighter.vectorize(x, 8);

Related

Trouble getting started with Metal shader compilation

I'm having trouble getting started with Metal's shader compilation.
How to make a MTLLibrary that can link to a MTLDynamicLibrary (or MTLLinkedFunctions), in particular a library that declares extern functions that are to be resolved at runtime when providing preloadedLibraries (or linkedFunctions) in the compute pipeline descriptor? For example, I can compile the following to air using xcrun metal (with option -c), but then invoking xcrun metallib (even with option --split-module-without-linking) gives the error LLVM ERROR: Undefined symbol: _Z3addjj. In other words, how do I make a 'partially bound' metal library?
// shader.h
extern uint add(uint a, uint b);
/// shader.metal
#include "shader.h"
kernel void kernel_func(uint gid [[ thread_position_in_grid ]]) { add(gid,2); }
WWDC2021 mentions this extern technique, but the Dynamic Library Code Sample from the previous year doesn't use extern (or the installName), so I don't make sense of it.
When creating an executable library that uses a dynamic library, there are two points where you must include the dynamic library (I thought there was only one).
The process is different depending on whether the executable source is compiled at build or runtime. I'll describe for the case of runtime, because I haven't yet figured out the case for the executable library created from a metallib file.
The first point is when you compile the executable, where you must include the dynamic library in the libraries field of the CompileOptions. The library is there at this point just as a dummy, to check that you have a dynamic library that defines the declarations allowing for proper linkage, though that linkage doesn't occur at this stage, just the checking.
The second point is when you create the pipeline state, where you must include the dynamic library in the preloadedLibraries field of the pipeline descriptor. This time, the dynamic library is not a dummy but the real library you plan to use, as it will be linked with the executable during pipeline creation.

msvc trying to compile cv::Matx<float,3,1> as a 4-element vector

Using MSVC 2017, OpenCV 3.4. Code
typedef Vec3f localcolor;
inline double lensqd(const localcolor & c) {
return c.ddot(c);
}
Get
error C2338: Matx should have at least 4 elements. channels >= 4
note: while compiling class template member function 'cv::Matx<float,3,1>::Matx(_Tp,_Tp,_Tp,_Tp)'
when compiling the ddot function.
The compiler is trying to instantiate a 3-element vector with 4 initializers. I can't see anything in the OCV source code that would make this happen.
So do I file a bug report with MS?
And how do you suggest I get a working build? The code is this way because I sometimes want
typedef Vec4f localcolor;
which BTW compiles without error.
Could you show the ddot function ?
I recently had the same error by attempting to initialize a Vec3f with 4 elements.

PCL 1.8.1 crash on cloud delete after StatisticalOutliersRemoval

I have an application which is build using several DLL files.
I'm trying to perform PCL's statistical outliers removal using the following code:
PointCloudWithRGBNormalsPtr pclCloud(new PointCloudWithRGBNormals());
ConvertPointCloudToPCL(in_out_cloud /*my own structure which includes xyz, rgb, nx ny nz*/, *pclCloud);
pcl::StatisticalOutlierRemoval<PointXYZRGBNormal> sor;
sor.setInputCloud(pclCloud);
sor.setMeanK(10);
sor.setStddevMulThresh(1.0);
sor.filter(*pclCloud);
ConvertPointCloudToPCL:
static void ConvertPointCloudToPCL(const std::vector<Cloud3DrgbN> &in, PointCloudWithRGBNormals &output)
{
for (auto it = in.begin(); it != in.end(); it++)
{
const Cloud3DrgbN &p3d = *it;;
PointXYZRGBNormal p;
p.x = p3d.x;
p.y = p3d.y;
p.z = p3d.z;
p.normal_x = p3d.nX;
p.normal_y = p3d.nY;
p.normal_z = p3d.nZ;
p.r = p3d.r;
p.g = p3d.g;
p.b = p3d.b;
output.push_back(p);
}
}
For some reason, if I call this function from 1 of my dlls it works as it should. However, there's 1 dll that if I call it from it, when pclCloud goes out of scope, I'm getting an exception from Eigen's Memory.h file at the handmade_aligned_free function
I'm using Windows 10 64-bit, pcl 1.8.1 and Eigen 3.3 (tried 3.3.4, same thing)
Update:
After further digging, I've found that EIGEN_MALLOC_ALREADY_ALIGNED was set to 0 because I'm using AVX2 in my "problematic" DLL. I'm still not sure though why using Eigen's "handmade" aligned malloc/free causes this crash.
There seems to be a known issue (see this) with Eigen, PCL & AVX
Well I found the problem and how to solve it.
It seems that the DLLs that comes with the "All in 1 installer" for windows weren't compiled with AVX/AVX2 support.
When linking these libraries with my own DLLs, the ones that compiled using the AVX, this mismatch caused Eigen to use the different types of allocations and freeing of memory causing the crash.
I compiled PCL from source using AVX2 and linked these library and everything works.
It's worth mentioning that the DLL that worked before now has issues since it doesn't have AVX and PCL now do.

Unresolved extern when compiling OpenCL to PTX using Clang?

I'm following the instructions on this SO answer but when I try to run the resulting PTX file I get the follow error in clBuild
ptxas fatal : Unresolved extern function 'get_group_id'
In the PTX file I have the following for every OpenCL function call I use
.func (.param .b64 func_retval0) get_group_id
(
.param .b32 get_group_id_param_0
)
;
The above isn't present in the PTX files created by the OpenCL runtime when I provide it with a CL file. Instead it has the proper special register.
Following these instructions (links against a different libclc library) gives me a segmentation fault during the LLVM IR to PTX compilation with the following error:
fatal error: error in backend: Cannot cast between two non-generic address spaces
Are those instructions still valid? Is there something else I should be doing?
I'm using the latest version of libclc, Clang 3.7, and Nvidia driver 352.39
The problem is that llvm does not provide an OpenCL device code library. llvm however provides the intrinsics for getting the IDs of a GPU thread. Now you have to write your own implantations of get_global_id etc. using clang's builtins and compile it to llvm bitcode with the nvptx target. Before you lower your IR to PTX you use llvm-link to link your device library with your compiled OpenCL module and that's it.
A example how you would write such a function:
#define __ptx_mad(a,b,c) ((a)*(b)+(c))
__attribute__((always_inline)) unsigned int get_global_id(unsigned int dimindx) {
switch (dimindx) {
case 0: return __ptx_mad(__nvvm_read_ptx_sreg_ntid_x(), __nvvm_read_ptx_sreg_ctaid_x(), __nvvm_read_ptx_sreg_tid_x());
case 1: return __ptx_mad(__nvvm_read_ptx_sreg_ntid_y(), __nvvm_read_ptx_sreg_ctaid_y(), __nvvm_read_ptx_sreg_tid_y());
case 2: return __ptx_mad(__nvvm_read_ptx_sreg_ntid_z(), __nvvm_read_ptx_sreg_ctaid_z(), __nvvm_read_ptx_sreg_tid_z());
default: return 0;
}
}

openCV 3.0, openCL and meanShiftFiltering

Based on the changes in openCV 3.0 and openCL, I can not seem to get pyrMeanShiftFiltering to work using openCL. I know that ocl::meanShiftFiltering was supported in openCV 2.4.10. The two functions below take the same amount of time to execute.
How can I even check which functions in openCV 3.0 are supported under openCL? Any suggestions?
#include <opencv2/core/ocl.hpp> //attempting to use openCL
using namespace cv;
using namespace ocl;
void meanShiftOCL()
{
setUseOpenCL(true)
UMat in, out;
imread("./images/img.png").copyTo(in);
pyrMeanShiftFiltering(in, out, 40, 20, 3);
}
//not using openCL
void meanShift()
{
Mat in, out;
imread("./images/img.png").copyTo(in);
pyrMeanShiftFiltering(in, out, 40, 20, 3);
}
I'm not sure that there is simple way to determine it with given OpenCV binaries, but you can recompile OpenCV yourself with additional define (can be specified in cmake):
CV_OPENCL_RUN_VERBOSE
With this define every function for which OpenCL implementation is available will print to console (stdout) the following message:
<function name>: OpenCL implementation is running
Regarded to your question - currently pyrMeanShiftFiltering doesn't have OpenCL implementation, as I know.

Resources