How to make an operation similar to _mm_extract_epi8 with non-immediate input? - sse

What I want is extracting a value from vector using a variable scalar index.
Like _mm_extract_epi8 / _mm256_extract_epi8 but with non-immediate input.
(There are some results in the vector, the one with the given index is found out to be the true result, the rest are discarded)

Especially, if index is in a GPR, the easiest way is probably to store val to memory and then movzx it into another GPR. Sample implementation using C:
uint8_t extract_epu8var(__m256i val, int index) {
union {
__m256i m256;
uint8_t array[32];
} tmp;
tmp.m256 = val;
return tmp.array[index];
Godbolt translation (note that a lot of overhead happens for stack alignment -- if you don't have an aligned temporary storage area, you could just vmovdqu instead of vmovdqa):

So far the best option seem to be using _mm_shuffle_epi8 for SSE
uint8_t extract_epu8var(__m128i val, int index) {
return (uint8_t)_mm_cvtsi128_si32(
_mm_shuffle_epi8(val, _mm_cvtsi32_si128(index)));
Unfortunately this does not scale well for AVX. vpshufb does not shuffle across lanes. There is a cross lane shuffle _mm256_permutevar8x32_epi32, but the resulting stuff seem to be complicated:
uint8_t extract_epu8var(__m256i val, int index) {
int index_low = index & 0x3;
int index_high = (index >> 2);
return (uint8_t)(_mm256_cvtsi256_si32(_mm256_permutevar8x32_epi32(
val, _mm256_zextsi128_si256(_mm_cvtsi32_si128(index_high))))
>> (index_low << 3));


About extending a Look Up Table at compile time

I'd like to extend my instrumental Profiler in order to avoid it affect too much performances.
Im my current implementation, I'm using a ProfilerHelper taking one string, which is put whereever you want in the profiling f().
The ctor is starting the measurement and the dector is closing it, logging the Delta in an unordered_map entry, which is key is the string.
Now, I'd like to turn all of that into a faster stuff.
First of all, I'd like to create a string LUT (Look Up Table) contaning the f()s names at compile time, and turn the unordered_map to a plain vector which is paired by the string function LUT.
Now the question is: I've managed to create a LUT but std::string_view, but I cannot find a way to extend it at compile time.
A first rought trial sounds like this:
template<unsigned N>
constexpr auto LUT() {
std::array<std::string_view, N> Strs{};
for (unsigned n = 0; n < N; n++) {
Strs[n] = "";
return Strs;
constexpr std::array<std::string_view, 0> StringsLUT { LUT<0>() };
constexpr auto AddString(std::string_view const& Str)
constexpr auto Size = StringsLUT.size();
std::array<std::string_view, Size + 1> Copy{};
for (auto i = 0; i < Size; ++i)
Copy[i] = StringsLUT[i];
Copy[Size] = Str;
return Copy;
int main()
constexpr auto Strs = AddString(__builtin_FUNCTION());
//for (auto const Str : Strs)
std::cout << Strs[0] << std::endl;
So my idea should be to recall the AddString whenever needed in my f()s to be profiled, extending this list at compile time.
But of course I should take the returned Copy and replace the StringsLUT everytime, to land to a final StringsLUT with all the f() names inside it.
Is there a way to do that at compile time?
Sorry, but I'm just entering the magic "new" world of constexpr applied to LUT right in these days.
Tx for your support in advance.

Search for sequence in Uint8List

Is there a fast (native) method to search for a sequence in a Uint8List?
/// Return index of first occurrence of seq in list
int indexOfSeq(Uint8List list, Uint8List seq) {
EDIT: Changed List<int> into Uint8List
No. There is no built-in way to search for a sequence of elements in a list.
I am also not aware of any dart:ffi based implementations.
The simplest approach would be:
extension IndexOfElements<T> on List<T> {
int indexOfElements(List<T> elements, [int start = 0]) {
if (elements.isEmpty) return start;
var end = length - elements.length;
if (start > end) return -1;
var first = elements.first;
var pos = start;
while (true) {
pos = indexOf(first, pos);
if (pos < 0 || pos > end) return -1;
for (var i = 1; i < elements.length; i++) {
if (this[pos + i] != elements[i]) {
return pos;
This has worst-case time complexity O(length*elements.length). There are several more algorithms with better worst-case complexity, but they also have larger constant factors and more expensive pre-computations (KMP, BMH). Unless you search for the same long list several times, or do so in a very, very long list, they're unlikely to be faster in practice (and they'd probably have an API where you compile the pattern first, then search with it.)
You could use dart:ffi to bind to memmem from string.h as you suggested.
We do the same with binding to malloc from stdlib.h in package:ffi (source).
final DynamicLibrary stdlib = Platform.isWindows
: DynamicLibrary.process();
final PosixMalloc posixMalloc =
stdlib.lookupFunction<Pointer Function(IntPtr), Pointer Function(int)>('malloc');
Edit: as lrn pointed out, we cannot expose the inner data pointer of a Uint8List at the moment, because the GC might relocate it.
One could use dart_api.h and use the FFI to pass TypedData through the FFI trampoline as Dart_Handle and use Dart_TypedDataAcquireData from the dart_api.h to access the inner data pointer.
(If you want to use this in Flutter, we would need to expose Dart_TypedDataAcquireData and Dart_TypedDataReleaseData in dart_api_dl.h I've filed to track this.)
Alternatively, could address so that we could just expose the inner data pointer of a Uint8List directly in the FFI trampoline.

ArrayFire seq to int c++

Imagine a gfor with a seq j...
If I need to use the value of the instance j as a index, who can I do that?
something like:
vector<double> a(n);
gfor(seq j, n){
//Do some calculation and save this on someValue
a[j] = someValue;
Someone can help me (again) ?
I've found a solution for this...
if someone had a better option, feel free to post...
First, create a seq with the same size of your gfor instances.
Then, convert that seq in a array.
Now, take the value of that line on array (it's equals the index)
seq sequencia(0, 200);
af::array sqc = sequencia;
//Inside the gfor loop
countLoop = (int) sqc(j).scalar<float>();
Your approach works, but breaks gfors parallelization as converting the index to a scalar forces it to be written from the gpu back to the host, slamming the breaks on the GPU.
You want to do it more like this :
af::array a(200);
gfor(seq j, 200){
//Do some calculation and save this on someValue
a[j] = af::array(someValue); // for someValue a primitive type, say float
// ... Now we're safe outside the parallel loop, let's grab the array results
float results[200]; // Copy array from GPU to host, populating a c-type array

Print cv::Mat opencv

I am trying to print cv::Mat which contains my image. However whenever I print the Mat using cout, a 2D array printed into my text file. I want to print one one pixel in one line only. How can i print line wise pixels from cv::Mat.
A generic for_each loop, you could use it to print your data
*#brief implement details of for_each_channel, user should not use this function
template<typename T, typename UnaryFunc>
UnaryFunc for_each_channel_impl(cv::Mat &input, int channel, UnaryFunc func)
int const rows = input.rows;
int const cols = input.cols;
int const channels = input.channels();
for(int row = 0; row != rows; ++row){
auto *input_ptr = input.ptr<T>(row) + channel;
for(int col = 0; col != cols; ++col){
input_ptr += channels;
return func;
use it like
for_each_channel_impl<uchar>(input, 0, [](uchar a){ std::cout<<(size_t)a<<", "; });
you could do some optimization to continuous channel, then it may looks like
*#brief apply stl like for_each algorithm on a channel
* #param
* T : the type of the channel(ex, uchar, float, double and so on)
* #param
* channel : the channel need to apply for_each algorithm
* #param
* func : Unary function that accepts an element in the range as argument
*#return :
* return func
template<typename T, typename UnaryFunc>
inline UnaryFunc for_each_channel(cv::Mat &input, int channel, UnaryFunc func)
if(input.channels() == 1 && input.isContinuous()){
return for_each_continuous_channels<T>(input, func);
return for_each_channel_impl<T>(input, channel, func);
This kind of generic loopsave me a lot of times, I hope you find it helpful.If there are
any bugs, or you have better idea, please tell me.
I would like to design some generic algorithms for opencl too, sadly it do not support
template, I hope one day CUDA will become an open standard, or opencl will support template.
This works for any number of channels as long as the channels type are base on byte, non-byte
channel may not work.

Java 2ME Swapping Vector elements

I want to sort my vector but I need to use a swap function in order to do that... Is there any pre-defined methods like Collections.swap(vector, index1, index2) in java2me?
Thats pretty simple actually
private static void swap(Vector src, int i, int j)
Object tmp = src.elementAt(i);
src.setElementAt(src.elementAt(j), i);
src.setElementAt(tmp, j);
Well, there's certainly a Collections.sort()
Edit: And since I'm apparently blind, as Bala R points out below, there's a swap as well, exactly as you describe.
