## Tuesday 5 August 2008

### image rotation

As I am yet to post any meaningful code I thought I'd start with a cuda image rotation tutorial.

To keep things simple we are going to use the standard sin/cos (or matrix) rotation method and not the sheer method.

Firstly we need to calculate how big the rotated image will be - we do this on the CPU as it is a single calculation:

assuming we want to rotate by theta degrees we first need to convert our angle to radians.

Use the standard conversion:   float thetaRadians = (2*pi*angle)/360;

where pi is defined elsewhere, usually 3.1416  on bigger images you may need to specify a bit more precision.

now calculate the cos and sin of our angle in radians:

float cosTheta = (float)cos(radians);

float sinTheta = (float)cos(radians);

note that we are using standard float for values and not doubles. Before cuda 2.0 doubles were not supported on the device.

Right, now that we have our angles sorted out let us work out our rotated image size using standard trig - as bitmaps are rectangles we just need to calculate 3 points and then find the minimum and maximum values for the x and y co-ordinates = size of our new image:

float new1X = (-originalImageH*sinTheta);

float new1Y  = (originalImageH*cosTheta);

float new2X = (originalImageW*cosTheta - originalImageH*sinTheta);

float new2Y = (originalImageH*cosThere + originalImageW*sinTheta);

float new3X = (originalImageW*cosTheta);

float new3Y = (originalImageW*sinTheta);

now find the minimum and maximum point pairs - left as an exercise :)  hint:  min and max functions are your friends.

using the min and max values calculate the size of our bitmap

int destinationW = (int)(ceil)(maxx - minx);

int destinationH = (int)(ceil)(maxy - miny);

We use ceil as the min and max values are floats and we need to roundup to ensure we have enough space for all the pixels.

Ok boring cpu stuff is over time for some cuda code:

lets allocate a chunk of memory big enough to store our destination bitmap in:

uchar4 *d_Rotated;

CUDA_SAFE_CALL( cudaMalloc((void **)&d_Rotated,  (destinationW * destinationH * 4)) );
CUDA_SAFE_CALL( cudaMemset(d_Rotated, 255, (destinationW * destinationH * 4) ));

I am assuming you have already copied your source bitmap to the device - let use call the pointer to it: d_Src

in the first line we allocate enough space on the device. *4 as our bitmap is in RGBA format - usually a good idea as it aligns the memory reads/writes nicely.

On the second line we set the entire destination bitmap to white - you can choose any colour, black also being a common one. This will be the area outside the rotated source bitmap.

We now need our kernel that will actually do the rotation, it will need to know the width and height of both the source and destination bitmaps as well as a pointer to their memory locations. We also need to tell our kernel how much to rotate the image so we can pass in our radiansTheta value OR even better our precalculated sinTheta and cosTheta which will save us a few clock cycles in our kernel.

Here is the entire kernel which we will analyse step by step.

__global__  void rotateImage_Kernel(uchar4* origimageData,int originalW,int originalH,uchar4* destimageData,int destinationW,int destinationH, float sinTheta,float cosTheta,float minx,float miny)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;�
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

float xpos = floor(((float)ix+(float)minx)*cosTheta+((float)iy+(float)miny)*sinTheta);
float ypos = floor(((float)iy+(float)miny)*cosTheta-((float)ix+(float)minx)*sinTheta);�
�
if ((((int)xpos>=0) && ((int)xpos<originalW)) && (((int)ypos>=0) && ((int)ypos<originalH)))
{
destimageData[iy*destinationW+ix].x = origimageData[(int)(floor(ypos*(float)originalW+xpos))].z;

destimageData[iy*destinationW+ix].y = origimageData[(int)(floor(ypos*(float)originalW+xpos))].y;
destimageData[iy*destinationW+ix].z = origimageData[(int)(floor(ypos*(float)originalW+xpos))].x;
//don't bother about alpha channel for now
}

}

How simple is that? The beauty of CUDA is the way each thread can handle an element or a group of elements. The above kernel has an possibility of diverging the warp but we need to ensure we are still within the source image bounds.

The first two lines are the the most important to understand. They work out which part of the destination image the thread is calculating. This will vary according to how you set up your grid and blocks. This example uses the trivial case of a square block which will be shown later. This kernel does assume that ix and iy wont exceed the allocated destination memory size - ie the destination image should be a multiple of block size in each direction. If it is not - dont put a conditional in a kernel if possible - rather change the destination image size to fit.

The next two lines use our precalculated sin and cos values to work out which part of the source image goes onto this bit of the destination image. Sin and Cos on a cuda device only take 4 clock cycles but as each is called twice we may as well use the precalculated ones.

Now we have a quick conditional to check if we are overrunning our source image size - this may result in a warp divergence for some blocks.

And finally we just copy the pixel RGBA channels from the source to the the destination (at the correct positions).

To call our kernel let us set up our grid and block sizes

We are using a block size of 64

dim3 gridRotate(iDivUp(destinationW, 8), iDivUp(destinationH, 8));

We break up our destination size into blocks of size 8

As an exercise try and play with the block and grid sizes to measure changes in performance etc. Dont forget to change the ix and iy calculation inside the kernel. Try and use __mul24 where possible as 32bit integer multiplications are slow on the device.

Now let us call our kernel.

rotateImage_Kernel<<<gridRotate, threads>>>(d_Src,originalW,originalH,d_Rotated,destinationW,destinationH, sinTheta,cosTheta, minx, miny);

And thats it - you now have a nicely rotated bitmap sitting in the devices memory. Its up to you to get it back to the host and display it :)

[gallery]

(images have been scaled to fit on web site - the kernel above does not resize / scale them)