I want to speed up image processing using the hough circle detection.
// For all rows in image:
for y:=0 to AnalysisBitmap.Height-1 do
begin
// For all pixel in one row :
for x:=0 to AnalysisBitmap.Width-1 do
begin
// Is there a point ?
if IsPixel(x,y, AnalysisBitmap, 128 ) then
begin
for theta:=0 to max_theta do
begin
TestPoint.x := round ( x - r * cos(theta*PI/max_theta) );
TestPoint.y := round ( y - r * sin(theta*PI/max_theta));
if ((testPoint.x < ImageWidth) and (testPoint.x > 0 ) and
(testPoint.y < ImageHeight ) and (testPoint.y > 0 ) ) then Inc(aHoughResult[TestPoint.x,TestPoint.y]);
end;
end;
end;
end;
As the VCL Bitmap is not thread safe I guess I can only do parallel processing of the inner Theta Loop ?
What is the best Approach to Speed up this code .
Yes, it is enough to parallelize the inner cycle only. Don't forget to organize right sharing of aHoughResult, for example - with critical section.
In the newest Delphi versions you can use both OTL and inbuilt System.Threading.TParallel possibilites.
The most important speedup (I think) - fill the table with round(r*cos(theta*PI/max_theta)) values and use it inside the cycles.
Related
I am trying to render YUV images on a FMX Form. I have been studying the following examples: https://github.com/grijjy/JustAddCode/tree/master/GpuProgramming
I have managed to render YUVNV12 and YUV420 images (which I decode from video files using FFMPEG library) using Delphi VCL + Direct3D. In case of VCL, I can find enough examples in C/C++ code which I can translate to Delphi (VCL).
I tried the following to render (YUV) images:
SDL2 library: works fine on VCL. Unusable on Android together with Delphi.
PXL library: works fine on Windows and Android (I am not able to test on other platforms: iOS, macOS...) except I need to convert YUV -> RGB before rendering. This is too slow on Android (phone/tablet).
Delphi VCL project using DirectX API calls, converted source from C/C++ examples which works fine up to HD (rec.709) images. I was not able yet to render images with HDR (rec.2020) images.
The final goal to be achieved is rendering YUV images without converting them to RGB/BGR via CPU on a FMX Form/Component, so it will be usable on multiple platforms.
In the example "03Texture", I added 2 textures: 1 for Y-plane and 1 for UV-plane:
property TextureY : TTexture read FTextureY write SetTextureY;
property TextureUV: TTexture read FTextureUV write SetTextureUV;
I made some changes in procedure "HandleImageChanged":
procedure TImageMaterialSource.HandleImageChanged(Sender: TObject);
begin
//TImageMaterial(Material).Texture := TTexture.Create;
//TImageMaterial(Material).Texture.PixelFormat := FMX.types.tpixelformat.a;
if n12 <> nil
then begin
if TImageMaterial(Material).TextureY = nil
then begin
TImageMaterial(Material).TextureY := TTexture.Create;
TImageMaterial(Material).TextureY.PixelFormat := FMX.types.tpixelformat.L;
TImageMaterial(Material).TextureY.SetSize(n12.pitch, n12.height);
end;
if TImageMaterial(Material).TextureUV = nil
then begin
TImageMaterial(Material).TextureUV := TTexture.Create;
TImageMaterial(Material).TextureUV.PixelFormat := FMX.types.tpixelformat.LA;
TImageMaterial(Material).TextureUV.SetSize(n12.pitch, n12.height div 2);
end;
TImageMaterial(Material).TextureY.UpdateTexture(n12.Y, n12.width);
TImageMaterial(Material).TextureUV.UpdateTexture(n12.UV, n12.width);
end;
As you can see I am trying to update Y and UV textures hoping that they are being sent to the GPU. I used "FMX.types.tpixelformat.L -> 1byte" for textureY for which I used DXGI_FORMAT_R8_UNORM on my VCL code for DirectX. For textureUV, "FMX.types.tpixelformat.LA - 2bytes" <- DXGI_FORMAT_R8G8_UNORM - I don't know if these are good choices but just for trying...
Vertex shader:
struct VS_INPUT
{
float4 Pos : POSITION;
float2 Tex : TEXCOORD;
};
struct VS_OUTPUT
{
float4 Pos : SV_POSITION;
float2 Tex : TEXCOORD;
};
VS_OUTPUT main(VS_INPUT input)
{
return input;
}
PixelShader:
Texture2D Texture;
Texture2D TextureY;
Texture2D TextureUV;
SamplerState theSampler;
struct PixelShaderInput
{
float4 pos : SV_POSITION;
float2 tex : TEXCOORD0;
float4 color : COLOR0;
};
float4 main(PixelShaderInput input) : SV_TARGET
{
const float3 offset = {0.0, -0.501960814, -0.501960814};
const float3 Rcoeff = {1.0000, 0.0000, 1.4020};
const float3 Gcoeff = {1.0000, -0.3441, -0.7141};
const float3 Bcoeff = {1.0000, 1.7720, 0.0000};
float4 Output;
float3 yuv;
yuv.x = TextureY.Sample(theSampler, input.tex).x;
yuv.yz = TextureUV.Sample(theSampler, input.tex).yz;
yuv += offset;
Output.r = dot(yuv, Rcoeff);
Output.g = dot(yuv, Gcoeff);
Output.b = dot(yuv, Bcoeff);
Output.a = 1.0f;
//return Output * input.color;
return float4(Output);
//return Texture.Sample(Sampler, input.tex);
}
function for reading a NV12 file:
PNV12Frame = ^TNV12Frame;
TNV12Frame = record
width,
height,
pitch:Cardinal;
Y:PByte;
UV:PByte;
end;
function ReadNV12FromFile(fn:TFileName):PNV12Frame;
var f:TFileStream;
xsize,
readBytes:Integer;
nv12Frame:PNV12Frame;
begin
f := TFileStream.Create(fn, fmOpenRead);
//FILE *file = nullptr;
//sprintf_s(buf, "content\\16.nv12");
//fopen_s(&file, buf, "rb");
xsize := sizeof(TNV12Frame);
nv12Frame := GetMemory(xsize);
FillChar(nv12Frame^, xsize, 0);
//readBytes := fread(nv12Frame, size, 1, file);
f.Position := 0;
readBytes := f.Read(nv12frame^, xsize);
xsize := nv12Frame.pitch * nv12Frame.height;
nv12Frame.Y := GetMemory(xsize); //(BYTE *)malloc(size);
readBytes := f.ReadData(nv12Frame.Y, xsize);
xsize := nv12Frame.pitch * nv12Frame.height div 2;
nv12Frame.UV := GetMemory(xsize); //(BYTE *)malloc(size);
readBytes := f.ReadData(nv12Frame.UV, xsize);
f.Free;
//fclose(file);
Result := nv12Frame;
end;
How to set / do I need to set/ any equalent of the following which are used on (VCL D3D) API calls?
const vertexDesc:array[0..2] of D3D11_INPUT_ELEMENT_DESC =
(
( SemanticName:'POSITION' ;SemanticIndex: 0;Format: DXGI_FORMAT_R32G32B32_FLOAT
;InputSlot: 0; AlignedByteOffset: 0; InputSlotClass: D3D11_INPUT_PER_VERTEX_DATA;
InstanceDataStepRate: 0 ),
( SemanticName:'TEXCOORD' ;SemanticIndex: 0;Format: DXGI_FORMAT_R32G32_FLOAT
;InputSlot: 0; AlignedByteOffset: 12; InputSlotClass: D3D11_INPUT_PER_VERTEX_DATA;
InstanceDataStepRate: 0 ),
( SemanticName:'COLOR' ;SemanticIndex: 0;Format:
DXGI_FORMAT_R32G32B32A32_FLOAT;InputSlot: 0; AlignedByteOffset: 20; InputSlotClass:
D3D11_INPUT_PER_VERTEX_DATA; InstanceDataStepRate: 0 )
);
Vertices:array [0..NUMVERTICES-1] of TVERTEX =
(
(Pos:(x:-1.0; y:-1.0; z:0); TexCoord:(x:0.0; y:1.0)),
(Pos:(x:-1.0; y: 1.0; z:0); TexCoord:(x:0.0; y:0.0)),
(Pos:(x: 1.0; y:-1.0; z:0); TexCoord:(x:1.0; y:1.0)),
(Pos:(x: 1.0; y:-1.0; z:0); TexCoord:(x:1.0; y:1.0)),
(Pos:(x:-1.0; y: 1.0; z:0); TexCoord:(x:0.0; y:0.0)),
(Pos:(x: 1.0; y: 1.0; z:0); TexCoord:(x:1.0; y:0.0))
);
As you can see, I need example/explanation for dummies. I just started experimenting on GPU programming, DirectX rendering etc. Another thing is I am trying to use as few as possible external libraries. For example: I used SDL2 and tried BASS libraries for audio but later I achieved to play audio (files/streams) using WaveAudio on Windows and AudioTrack on Android which seem to work perfectly for now.
---Edit---
Creation of a TTexture with TPixelFormat.L has no sense because this one is not being converted to a DXGI Format. So currently I give up usage of TFORM3D and TPlane like in the example "03Texture".
The temporary solution is as followed(Windows DX11):
Using PXL library.
Create 1 TTexture of PixelFormat L8 which is being converted to
DXGI_FORMAT_R8_UNORM which is defined in PXL Library.
Set the height of the Texture to double of the picture.
Writing Y + U + V plane to the texture. see: https://github.com/yabadabu/dx11_video_texture
Using the pixelshader from the link above (3) for FEffectTexturedL declared in PXL.Canvas.DX11
The following task is to achieve the same for Android/OpenGL, after testing with different YUV types (YUV 601,709,JPEG, NV12 601, ...).
----------edit2----->
The temporary solution looks like it's working fine on windows but it's too slow again on Android OpenGL, even using the same approach as described in EDIT1.
The only difference between windows and android approach is that I have to copy data twice because on Android I cannot update the Texture Data from a Thread directly which is possible on Windows.
On the other hand, twice copying is not a problem on windows.
After a lot of tweaking, I am considering to leave Delphi/FMX for creating a video player, at least for mobile devices.
I hope not to get down voted for this, but it seems there is not a good/fast enough solution for this.
It's possible to render (video) (YUV) images on a FMX Form or Component using some tweaks or libraries.
When the image size is larger than FHD, it's too slow on Android (tablet/phone). I have no Idea how fast/slow it is on iOS or on a an Android TV.
TALVideoplayerSurface ,which uses Exoplayer, is as slow/fast as other approaches.
Exoplayer examples for Android Studio is fast enough when playing UHD videos on Android.
Currently it seems like that, Android Studio is the best choice, for creating a video player.
I am developing a program that solves a system of equations. When it gives me the results, it is like: "x1= 1,36842". I'd like to get the fraction of that "1,36842", so I wrote this code.
procedure TForm1.Button1Click(Sender: TObject);
var numero,s:string;
a,intpart,fracpart,frazfatta:double;
y,i,mcd,x,nume,denomin,R:integer;
begin
a:=StrToFloat(Edit1.Text); //get the value of a
IntPart := Trunc(a); // here I get the numerator and the denominator
FracPart := a-Trunc(a);
Edit2.Text:=FloatToStr(FracPart);
numero:='1';
for i:= 1 to (length(Edit2.Text)-2) do
begin
numero:=numero+'0';
end; //in this loop it creates a string that has many 0 as the length of the denominator
Edit3.text:=FloatToStr(IntPart);
y:=StrToInt(numero);
x:=StrToInt(Edit3.Text);
while y <> 0 do
begin
R:= x mod y;
x:=y;
y:=R;
end;
mcd:=x; //at the end of this loop I have the greatest common divisor
nume:= StrToInt(Edit3.Text) div mcd;
denomin:= StrToInt(numero) div mcd;
Memo1.Lines.Add('fraction: '+IntToStr(nume)+'/'+IntToStr(denomin));
end;
It doesn't work correctly because the fraction that it gives to me is wrong. Could anyone help me please?
Your code cannot work because you are using binary floating point. And binary floating point types cannot represent the decimal numbers that you are trying to represent. Representable binary floating point numbers are of the form s2e where s is the significand and e is the exponent. So, for example, you cannot represent 0.1 as a binary floating point value.
The most obvious solution is to perform the calculation using integer arithmetic. Don't call StrToFloat at all. Don't touch floating point arithmetic. Parse the input string yourself. Locate the decimal point. Use the number of digits that follow to work out the decimal scale. Strip off any leading or trailing zeros. And do the rest using integer arithmetic.
As an example, suppose the input is '2.79'. Convert that, by processing the text, into numerator and denominator variables
Numerator := 279;
Denominator := 100;
Obviously you'd have to code string parsing routines rather than use integer literals, but that is routine.
Finally, complete the problem by finding the gcd of these two integers.
The bottom line is that to represent and operate on decimal data you need a decimal algorithm. And that excludes binary floating point.
I recommend defining a function GreaterCommonDivisor function first (wiki reference)
This is going to be Java/C like code since I'm not familiar with Delphi
let
float x = inputnum // where inputnum is a float
// eg. x = 123.56
Then, multiplying
int n = 1;
while(decimalpart != 0){// or cast int and check if equal-> (int)x == x
x = x * 10;
decimalpart = x % 1;
// or a function getting the decimal part if the cast does work
n *= 10;
}
// running eg. x = 123.56 now x = 12356
// n = 100
Then you should have (float)x/n == inputnum at this point eg. (12356/100 == 123.56)
This mean you have a fraction that may not be simpified at this point. All you do now is implement and use the GCD function
int gcd = GreaterCommonDivisor(x, n);
// GreaterCommonDivisor(12356, 100) returns 4
// therefore for correct implementation gcd = 4
x /= gcd; // 12356 / 4 = 3089
n /= gcd; // 100 / 4 = 25
This should be quick and simple to implement, but:
Major Pitfalls:
Float must be terminating. For example expected value for 0.333333333333333333 won't be rounded to 1/3
Float * n <= max_int_value, otherwise there will be a overflow, there are work around this, but there may be another solutions more fitting to these larger numbers
Continued fractions can be used to find good rational approximations to real numbers. Here's an implementation in JavaScript, I'm sure it's trivial to port to Delphi:
function float2rat(x) {
var tolerance = 1.0E-6;
var h1=1; var h2=0;
var k1=0; var k2=1;
var b = x;
do {
var a = Math.floor(b);
var aux = h1; h1 = a*h1+h2; h2 = aux;
aux = k1; k1 = a*k1+k2; k2 = aux;
b = 1/(b-a);
} while (Math.abs(x-h1/k1) > x*tolerance);
return h1+"/"+k1;
}
For example, 1.36842 is converted into 26/19.
You can find a live demo and more information about this algorithm on my blog.
#Joni
I tried 1/2 and the result was a "division by zero" error;
I correct the loop adding:
if b - a = 0 then BREAK;
To avoid
b:= 1 / (b - a);
Source is either PNG or GIF where the pixels that should be "colorized" are white. Background can be either black or transparent, whichever is easiest.
Now I'd like to cut out a rectangular part of the source, and AND it with the palette color (gif) or RGB color (png) of the "brush", to "stamp" it out on a TImage/TCanvas with that color.
Probably one of those lazy questions where RTFM would do. But if you have a nice solution please share :)
I tried Daud's PNGImage lib, but I can't even get it loading the source image. Is there a trick to using it?
The solution needs to work on D7 and up, XP and up.
do i understand you want to change the white color with some other color?
if that is so i think you should check the image pixel by pixel and check what color is the pixel and change it if is white.
thats how you can loop through image
var
iX : Integer;
Line: PByteArray;
...
Line := Image1.ScanLine[0]; // We are scanning the first line
iX := 0;
// We can't use the 'for' loop because iX could not be modified from
// within the loop
repeat
Line[iX] := Line[iX] - $F; // Red value
Line[iX + 1] := Line[iX] - $F; // Green value
Line[iX + 2] := Line[iX] - $F; // Blue value
Inc(iX, 3); // Move to next pixel
until iX > (Image1.Width - 1) * 3;
Here's code that show how to reads the Red and Blue values and switched them.
var
btTemp: Byte; // Used to swap colors
iY, iX: Integer;
Line : PByteArray;
...
for iY := 0 to Image1.Height - 1 do begin
Line := Image1.ScanLine[iY]; // Read the current line
repeat
btSwap := Line[iX]; // Save red value
Line[iX] := Line[iX + 2]; // Switch red with blue
Line[iX + 2] := btSwap; // Switch blue with previously saved red
// Line[iX + 1] - Green value, not used in example
Inc(iX, 3);
until iX > (Image1.Width - 1) * 3;
end;
Image1.Invalidate; // Redraw bitmap after everything's done
but this is for bitmap image only.
if this is useful try to convert your image to bitmap and from then manipulate it.
The area chart (image) has a few data series, which are charted with different colors. We know the image size and co-ordinates of each lable on x-Axis, is it possible to discover the series of y-Axis by image recongition? Can anybody shed some light?
If you know the y-axis scale, it should be possible.
To screenscrape, you could first filter your image with a color filter for each of the series.
Second step would be to gather the coordinates of all remaining pixels in your temporary image and transform them these to the scale needed.
given
a pixel at coordinates x,y
the offset of the charts Origin in image pixels xoffset, yoffset
the Scale of you chart axis xscale, yscale
you could calculate the data for this pixel (pseudocode)
pixelData.x := (x - xoffset) * xscale
pixeldata.y := (y - yoffset) * yscale
And afterwards, do some interpolation if your series line is more then one pixel wide (for example get the average data for all pixels in a single column or so).
Update1: Pseudocode for naive color filter filtering out red charts
//set up desired color levels to filter out
redmin := 240;
redmax := 255
bluemin := 0;
bluemax := 0;
greenmin := 0
greenmax := 0;
//load source bitmap
myBitmap := LoadBitmap("Chartfile.bmp");
//loop over bitmap pixels
for iX := 0 to myBitmap.width-1 do
for iY := 0 myBitmap.height-1 do
begin
myColorVal := myBitmap.GetPixels(iX, iY);
//if the pixel color is inside your target color range, store it
if ((mycolorVal.r >=redmin) and (myColorVal.r <= redmax)) and
((mycolorVal.g >=greenmin) and (myColorVal.g <= greenmax)) and
((mycolorVal.b >=bluemin) and (myColorVal.b <= bluemax)) then
storeDataValue(iX, iY); //performs the value scaling operation mentioned above
end;
Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?
The bitmaps are 8bpp and typically 2048x2400x8bpp
Currently I do this by simply copying with argument inversion, roughly (pseudo code:
for x = 0 to 2048-1
for y = 0 to 2048-1
dest[x][y]=src[y][x];
(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)
GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.
Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.
Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler
Follow up to Nils' answer
Image 2048x2700 -> 2700x2048
Compiler Turbo Explorer 2006 with optimization on.
Windows: Power scheme set to "Always on". (important!!!!)
Machine: Core2 6600 (2.4 GHz)
time with old routine: 32ms (step 1)
time with stepsize 8 : 12ms
time with stepsize 16 : 10ms
time with stepsize 32+ : 9ms
Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).
The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:
for n:=0 to 7 do
begin
load r0, <source+n*rowsize>
shift byte from r0 into r1
shift byte from r0 into r2
..
shift byte from r0 into r8
end;
store r1, <target>
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>
So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)
Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.
The code (validated by subtracting result from the "naieve" rotate1 implementation):
const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);
var stepsx,stepsy,restx,resty : Integer;
RowPitchSource, RowPitchTarget : Integer;
pSource, pTarget,ps1,ps2 : pchar;
x,y,i,j: integer;
rpstep : integer;
begin
RowPitchSource := source.RowPitch; // bytes to jump to next line. Can be negative (includes alignment)
RowPitchTarget := target.RowPitch; rpstep:=RowPitchTarget*stepsize;
stepsx:=source.ImageWidth div stepsize;
stepsy:=source.ImageHeight div stepsize;
// check if mod 16=0 here for both dimensions, if so -> SSE2.
for y := 0 to stepsy - 1 do
begin
psource:=source.GetImagePointer(0,y*stepsize); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
// 3 more areas to do, with dimensions
// - stepsy*stepsize * restx // right most column of restx width
// - stepsx*stepsize * resty // bottom row with resty height
// - restx*resty // bottom-right rectangle.
restx:=source.ImageWidth mod stepsize; // typically zero because width is
// typically 1024 or 2048
resty:=source.Imageheight mod stepsize;
if restx>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(source.ImageWidth-restx,0); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
for y := 0 to stepsy - 1 do
begin
for i := 0 to stepsize - 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[stepsize-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize*RowPitchSource);
dec(ptarget,stepsize);
end;
end;
if resty>0 then
begin
// one loop less, since we know this fits in one line of "blocks"
psource:=source.GetImagePointer(0,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,0);
for x := 0 to stepsx - 1 do
begin
for i := 0 to resty- 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to stepsize - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
inc(psource,stepsize);
inc(ptarget,rpstep);
end;
end;
if (resty>0) and (restx>0) then
begin
// another loop less, since only one block
psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty); // gets pointer to pixel x,y
ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
for i := 0 to resty- 1 do
begin
ps1:=#psource[rowpitchsource*i]; // ( 0,i)
ps2:=#ptarget[resty-1-i]; // (maxx-i,0);
for j := 0 to restx - 1 do
begin
ps2[0]:=ps1[j];
inc(ps2,RowPitchTarget);
end;
end;
end;
end;
Update 2 Generics
I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it :-)
Update 3 Generics
Works now in XE10
Update 4
In 2017 i did some work on a assembler version for 8x8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8x8 block iterations into pseudo larger ones like 64x64. Now it is whole lines again and that is wasteful.
Yes, there are faster ways to do this.
Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.
You can improve this rotation algorithm if you improve the locality of your memory accesses.
A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.
E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):
// this is the outer-loop that breaks your image rotation
// into chunks of 8x8 pixels each:
for (int block_x = 0; block_x < 2048; block_x+=8)
{
for (int block_y = 0; blocky_y < 2048; block_y+=8)
{
// this is the inner-loop that processes a block
// of 8x8 pixels.
for (int x= 0; x<8; x++)
for (int y=0; y<8; y++)
dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
}
}
There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.
Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.
EDIT:
Just checked:
With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.
Seems like it's a good idea to experiment with different block-sizes.
Here is the (very simple) test-program I've used:
#include <stdio.h>
#include <windows.h>
char temp1[2048*2048];
char temp2[2048*2048];
void rotate1 (void)
{
int x,y;
for (y=0; y<2048; y++)
for (x=0; x<2048; x++)
temp2[2048*y+x] = temp1[2048*x+y];
}
void rotate2 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=8)
for (bx=0; bx<2048; bx+=8)
for (y=0; y<8; y++)
for (x=0; x<8; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
void rotate3 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=16)
for (bx=0; bx<2048; bx+=16)
for (y=0; y<16; y++)
for (x=0; x<16; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
int main (int argc, char **args)
{
int i, t1;
t1 = GetTickCount();
for (i=0; i<20; i++) rotate1();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate2();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate3();
printf ("%d\n", GetTickCount()-t1);
}
If you can use C++ then you may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You might be able to improve it by copying in cache-aligned blocks rather than by rows, as at the moment the stride of either src dest will be a miss ( depending whether delphi is row major or column major ).
If the image isn't square, you can't do in-place. Even if you work in square images, the transform isn't conducive to in-place work.
If you want to try to do things a little faster, you can try to take advantage of the row strides to make it work, but I think the best you would do is to read 4 bytes at a time in a long from the source and then write it into four consecutive rows in the dest. That should cut some of your overhead, but I wouldn't expect more than a 5% improvement.