Monday, 30 June 2008

GPULIB Where-Woes

No one can live without the where() function. Is it faster on the GPU? Here is my attempt to time it:

pro test_gpuWhere
gpuinit
A = float(round(randomu(s,1000000)))
st = systime(2)
for i=0,99 do begin
gpuPutArr, A, A_gpu
ind = where(A,count)
endfor
print,count
CPUtime = systime(2)-st
st = systime(2)
_ = temporary(count)
for i=0,99 do begin
gpuPutArr, A, A_gpu
gpuWhere,A_gpu,ind_gpu,count
endfor
print,count
GPUtime = systime(2)-st
print, 'Speedup: ',CPUtime/GPUtime
gpuFree,A_gpu
gpuFree,ind_gpu
end

Several things to notice here:

A is an array of "float flags", i.e. 1.0s and 0.0s. Everything, repeat, everything in GPULIB is float. Also the count parameter in gpuWhere returns the sum of A, not the number of nonzero elements.

Before reusing count in gpuWhere I have to undefine it, because it is returned as longint from where(). GPULIB will report an error otherwise.

In the loop, I have to use putArr,A,A_gpu each time, because gpuWhere overwrites its first argument.

So is all this worth the trouble?

% Compiled module: TEST_GPUWHERE.
500933
500933.
Speedup: 1.3521798

Evidently not.

GPULIB rude awakening

Turns out I had set the GPU into some kind of undefined state by incorrectly using the gpuWhere procedure in the session in the last post. Here's the correct output from test_gpuMatrix_Multiply:

Length : 1000
CPU time: 0.047000170
GPU time: 0.030999899
Speedup : 1.5161395
Length : 10000
CPU time: 0.34399986
GPU time: 0.26500010
Speedup : 1.2981122
Length : 100000
CPU time: 3.0780001
GPU time: 2.6250000
Speedup : 1.1725715
Length : 1000000
CPU time: 30.734000
GPU time: 26.282000
Speedup : 1.1693935

It was way too good to be true. Ah well.

OK, what about simple array products?

pro test_gpuMult
; initialize
gpuinit
; array of 1000000 3-element observation vectors (rows)
A = randomu(s,3,1000000)
gpuPutArr,A,A_gpu
; calculate square of A on the CPU
start = systime(2)
for j=0L,99 do C = A*A
CPUtime = systime(2)-start
print, 'CPU time: ',CPUtime
; now on the GPU
start = systime(2)
for j=0L,99 do gpuMult,A_gpu,A_gpu,C_gpu
GPUtime = systime(2)-start
print,'GPU time: ', GPUtime
print,'Speedup : ', CPUtime/GPUtime
gpuFree,A_gpu
; check that the results are the same
gpuGetArr,C_gpu,C1
print,'Check:'
print, total(C-C1)
gpuFree,C_gpu
end

This gives a speedup of 10:

% Compiled module: TEST_GPUMULT.
CPU time: 2.1100001
GPU time: 0.20300007
Speedup : 10.394086
Check:
0.000000

Saturday, 28 June 2008

GPULIB continued

It is often necessary to calculate the covariance matrix of a multispectral image from the array of observation vectors (pixels). The program

pro test_gpuMatrix_Multiply
; initialize
gpuinit
for i=3,6 do begin
; array of 6-element observation vectors (rows)
A = randomu(s,6,10L^i)
gpuPutArr,A,A_gpu
; calculate covariance matrix C on the CPU
start = systime(2)
for j=0L,99 do C = $
Matrix_Multiply(A,A,/btranspose)
CPUtime = systime(2)-start
print, 'Length : ',10L^i
print, 'CPU time: ',CPUtime
; now on the GPU
start = systime(2)
for j=0L,999 do $
gpuMatrix_Multiply,A_gpu,A_gpu,C_gpu,/btranspose
GPUtime = systime(2)-start
print,'GPU time: ', GPUtime/10
print,'Speedup : ', 10*CPUtime/GPUtime
gpuFree,A_gpu
endfor
; check that the results are the same
gpuGetArr,C_gpu,C1
print,'Check:'
print, determ(C),determ(C1)
gpuFree,C_gpu
end

produces

% Compiled module: TEST_GPUMATRIX_MULTIPLY.
Length : 1000
CPU time: 0.030999899
GPU time: 0.0062000036
Speedup : 4.9999808
Length : 10000
CPU time: 0.25000000
GPU time: 0.0046999931
Speedup : 53.191567
Length : 100000
CPU time: 0.96900010
GPU time: 0.0046999931
Speedup : 206.17054
Length : 1000000
CPU time: 9.5620000
GPU time: 0.0032000065
Speedup : 2988.1190
Check:
6.33251e+030 6.33251e+030

So for instance the covariance matrix of a 1000x1000x6 multispectral image could be calculated 3000 times faster on the GPU!? My ENVI/IDL change detection extension iterates on the weighted covariance matrix of a bitemporal image. Sooo ...

Thursday, 26 June 2008

GPULIB Trials (and Tribulations)

With active and friendly help from Tech-X Corporation I've been experimenting with GPULIB, an IDL/Matlab/Python ... interface to NVIDIA's CUDA. I was (and still am) hoping to speed up some of the ENVI/IDL extensions for remote sensing image analysis described in my book.

So I've started blogging (at my age!) to document my experiences for anyone who wants to take a similar plunge into the world of "massively parallel computing" on his or her home PC.

Mine is:

Intel Pentium 4 650, 3400 MHz (17 x 200)
Intel Lakeport-G i945G
2048 MB (DDR2 SDRAM)
NVIDIA GeForce 8600 GT (512 MB)

Having gotten a Windows XP build from Tech-X (I evidently don't have the right C++ compiler to do my own builds), I did the following

-Copied GPULIB.DLM and GPULIB.DLL to C:\Programme\ITT\IDL70\bin\bin.x86 (my DLL/DLM Path)

-Placed GPUINIT.PRO in my IDL path

-Downloaded and installed the latest NVIDIA drivers for my graphics card

-Downloaded the CUDA Toolkit version 1.1 for Windows XP 32bit

-Copied the DLLs (including CUDART.DLL) to C:\Programme\ITT\IDL70\bin\bin.x86

-Started IDL 7.0 (with ENVI 4.5)

-Ran the benchmark program BENCH.PRO

Here's what it said:

% Compiled module: BENCH.
0.756607 2.33993 0.196372 0.516154 0.0442747 0.839950
0.756607 2.33993 0.196372 0.516154 0.0442747 0.839950
N iter = 50
CPU Time = 9.2029998
GPU Time = 0.10900021
Speedup = 84.431032 (emphasis added)

Wow! If that isn't exciting I don't know what is. This with a cheap, passively-cooled graphics card!

Of course, what BENCH.PRO actually does is calculate the log gamma function of a 1000000-element array 50 times, something I seldom have occasion to do in my ENVI extensions.

And that's where the story starts. Stay tuned.