Do SIMD instructions require data to be aligned?

Do SIMD instructions require data to be aligned? - memory

I'm confused with logical and physical memory alignments.
To use vector instructions efficiently such as AVX and SSE, we may need data to be aligned.
Does this say that data aligned in virtual memory is also aligned properly in physical memory?
If yes, how does the compiler do the good job?

Related

Aligning memory on 16-byte and 32-byte boundaries

I'm doing several operations using SIMD instructions (SSE and AVX). As I understand, SSE instructions work best with 16-byte aligned memory, and AVX instructions work best with 32-byte aligned memory.
Would it be safe to always allocate memory aligned to 32-byte boundaries for optimal use with both SSE and AVX?
Are there ever any cases where 32-byte aligned memory is not also 16-byte aligned?

Are there ever any cases where 32-byte aligned memory is not also 16-byte aligned?
Alignment just means that the address is a multiple of 32. Any multiple of 32 is also a multiple of 16.
The first google hit for "alignment" is wikipedia, and you can follow the links to https://en.wikipedia.org/wiki/Data_structure_alignment#Definitions, which explains this in lots of detail.

Add the same image with different offsets to the accumulating image on GPU

As the title states, I am trying to add the same image with different offsets, stored in a list, to the accumulating image.
The current implementation performs this on a CPU, and with some intrinsics it can be quite fast.
However, with larger images (2048x2048) and many offsets in the list (~10000), the performance is not satisfactory.
My question is, can the accumulation of the image with different offsets be efficiently implemented on a GPU?

Yes, you can. The results will be likely much faster than on CPU. The trick is to not send the data for each addition, and to not even launch a new kernel for each addition: the kernel you have should do some decent number of offset additions at once, at least 16 but possibly a few hundred, depending on your typical list size (and you can have more than one kernel of course).

Performace loss for using non-power of two textures

Is there any performance loss for using non-power-of-two textures under iOS? I have not noticed any my in quick benchmarks. I can save quite a bit of active memory by dumping them all together since there is a lot of wasted padding (despite texture packing). I don't care about the older hardware that can't use them.

This can vary widely depending on the circumstances and your particular device. On iOS, the loss is smaller if you use NEAREST filtering rather than LINEAR, but it isn't huge to begin with (think 5-10%).

Fast, reliable focus score for camera frames

I'm doing real-time frame-by-frame analysis of a video stream in iOS.
I need to assign a score to each frame for how in focus it is. The method must be very fast to calculate on a mobile device and should be fairly reliable.
I've tried simple things like summing after using an edge detector, but haven't been impressed by the results. I've also tried using the focus scores provided in the frame's metadata dictionary, but they're significantly affected by the brightness of the image, and much more device-specific.
What are good ways to calculate a fast, reliable focus score?

Poor focus means that edges are not very sharp, and small details are lost. High JPEG compression gives very similar distortions.
Compress a copy of your image heavily, unpack and calculate the difference with the original. Intense difference, even at few spots, should mean that the source image had sharp details that are lost in compression. If difference is relatively small everywhere, the source was already fuzzy.
The method can be easily tried in an image editor. (No, I did not yet try it.) Hopefully iPhone has an optimized JPEG compressor already.

A simple answer that human visual system probably uses is to implemnt focusing on top of edge
Tracking. Thus if a set of edges can be tracked across a visual sequence one can work with intensity profile
Of these edges only to detrmine when it the steepest.

From a theoretical point of view, blur manifests as a lost of the high frequency content. Thus, you can just use do a FFT and check the relative frequency distribution. iPhone uses ARM Cortex chips which have NEON instructions that can be used for an efficient FFT implementation.
#9000's suggestion of heavily compressed JPEG has the effect of taking a very small number of the largest wavelet coefficients will usually result in what's in essence a low pass filter.

Consider different kind of edges: e.g. peaks versus step edges. The latter will still be present regardless of focus. To isolate the former use non max suppression in the direction of gradient. As a focus score use the ratio of suppressed edges at two different resolutions.

The direction of stack growth and heap growth

In some systems, the stack grows in upward direction whereas the heap grows in downward direction and in some systems, the stack grows in downward direction and the heap grows in upward direction. But, Which is the best design ? Are there any programming advantages for any of those two specific designs ? Which is most commonly used and why has it not been standardized to follow a single approach ? Are they helpful/targeted for certain specific scenarios. If yes, what are they ?

Heaps only "grow" in a direction in very naive implementations. As Paul R. mentions, the direction a stack grows is defined by the hardware - on Intel CPUs, it always goes toward smaller addresses "i.e. 'Up'"

I have read the works of Miro Samek and various other embedded gurus and It seems that they are not in favor of dynamic allocation on embedded systems. That is probably due to complexity and the potential for memory leaks. If you have a project that absolutely can't fail, you will probably want to avoid using Malloc thus the heap will be small. Other non mission critical systems could be just the opposite. I don't think there would be a standard approach.

Maybe it is just dependent on the processor: If it supports the stack going upward or downward?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Do SIMD instructions require data to be aligned? - memory

I'm confused with logical and physical memory alignments. To use vector instructions efficiently such as AVX and SSE, we may need data to be aligned. Does this say that data aligned in virtual memory is also aligned properly in physical memory? If yes, how does the compiler do the good job?

Related

Aligning memory on 16-byte and 32-byte boundaries

Add the same image with different offsets to the accumulating image on GPU

Performace loss for using non-power of two textures

Fast, reliable focus score for camera frames

The direction of stack growth and heap growth

Categories

Resources