Saturday, 19 July 2008

GPULIB (Clustering speedup)

The clustering program that I've been re-working with GPULib fits a mixture of K multivariate Gaussians to a cloud of n N-dimensional image pixel vectors using the Expectation Maximization algorithm:

1. Initialize n x K cluster membership
probability matrix U

2. Maximization step: calculate
K N-component mean vectors,
K NxN covariance matrices,
K mixture coefficients

3. Expectation step: recalcuate
U and normalize.

4. If U hasn't changed significantly,
stop, else go to 2.

Step 3 requires inversion of the cluster covariance matrices, which has to be done on the CPU. Apart from that, I still get into trouble every time I use gpuWhere (don't know why yet). So, since I have to check U for zeroes, I have to do its normalization on the CPU as well. Here is my attempt at normalization on the GPU, which usually crashes IDL:

; normalize U
gpuMatrix_Multiply,U_gpu,onesKxK_gpu,den_gpu
gpuEq,den_gpu,fltarr(n,K),zeroes_gpu
gpuWhere,zeroes_gpu,ind_gpu,count
if count gt 0 then $
gpuSubscript,ind_gpu,den_gpu,$
fltarr(count)+1,/LHS
gpuDiv,U_gpu,den_gpu,U_gpu
gpuFree,[den_gpu,ind_gpu,zeroes_gpu]

Speedup (so far) on a 1000 x 1000 x 4 image clustered into 8 classes (i.e. n=1,000,000, N=4, K=8) is a factor of 2.8.

Thursday, 10 July 2008

GPULIB (trip to never-neverland)

I seem to be making a little progress with the clustering algorithm. One problem is that I have to invert a covariance matrix on each iteration, which means jumping back to the CPU to do it. This probably slows things down. Doesn't CUBLAS provide matrix inversion?

Anyway, while coding I stumbled into a horrible trap:

gpuPutArr,A,A_gpu
gpuView,A_gpu,n,m,Av_gpu
;...
;lots of code, so I kind of forgot
;what Av_gpu is
;...
gpuFree,A_gpu
gpuFree,Av_gpu ; very, very bad thing to do

This put me Somewhere Over the Rainbow. Nothing, but nothing works anymore on the GPU until I exit IDL and restart. Took a long time to get back to earth. Incidently, another way to get up there with Judy Garland is to forget that

gpuWhere,A_gpu,ind_gpu,count

overwrites A_gpu.

Tuesday, 8 July 2008

GPULIB succes story (cont'd)

Regarding my last post, Peter Messmer at Tech-X pointed out to me that I should use gpuView rather than shifting subarrays around with gpuSubArr. Doesn't make things much faster in this case, but the code is nicer:

while i lt num_rows do begin
; ggi = GG[*,i*num_cols:(i+1)*num_cols-1]
gpuView,GG_gpu,i*num_cols*num_bands,$
num_bands*num_cols,ggi_gpu
gpuReform,ggi_gpu,num_bands,num_cols
; KK = transpose(G2)##(intarr(num_cols)+1)
gpuMatrix_Multiply,onesnc_gpu,G2_gpu,$
KK_gpu,/btranspose
; KK = KK + transpose(intarr(m)+1)##$
; GG2[i*num_cols:(i+1)*num_cols-1]
gpuView,GG2_gpu,i*num_cols,num_cols,gg2i_gpu
gpuMatrix_Multiply,gg2i_gpu,$
onesm_gpu,res_gpu,/btranspose
gpuAdd,KK_gpu,res_gpu,KK_gpu
; KK = KK - 2*G##transpose(ggi)
gpuMatrix_Multiply,ggi_gpu,G_gpu,$
res_gpu,/atranspose
gpuAdd,1.0,KK_gpu,-2.0,res_gpu,0.0,KK_gpu
; KK = exp(-gma*KK)
gpuExp,1.0,-gma,KK_gpu,0.0,0.0,KK_gpu
; image[*,i,*] = alpha[*,0:npc-1]##KK
gpuMatrix_Multiply,KK_gpu,alpha_gpu,res1_gpu
gpuGetArr,res1_gpu,res
image[*,i,*]=res
i++ ; next row
endwhile

With my new-found skills ;-) I want next to have a go at image clustering. If you cluster all of the pixel vectors in a multispectral image (typically millions) then things can get pretty slow.

Tuesday, 1 July 2008

GPULIB success story (at last)

In kernelized PCA (nonlinear principal components analysis) you have to diagonalize a large (say 1000x1000) Gram matrix (real, symmetric kernel matrix) and then project the original image along the most significant eigenvectors. There is no way to do diagonalizations on the GPU (yet?), but the projection part is also very time consuming. Here is my solution using GPULIB. The commented statements are the original code on the CPU:

start_time=systime(2)
while i lt num_rows do begin
; ggi = GG[*,i*num_cols:(i+1)*num_cols-1]
gpuSubArr,GG_gpu,[0L,-1L],$
[i*num_cols,(i+1)*num_cols-1],$
ggi_gpu,[0L,-1L],[0L,-1L]
; KK = transpose(G2)##(intarr(num_cols)+1)
gpuMatrix_Multiply,onesnc_gpu,$
G2_gpu,KK_gpu,/btranspose
; KK = KK + transpose(intarr(m)+1)##$
; GG2[i*num_cols:(i+1)*num_cols-1]
gpuSubArr,GG2_gpu,[i*num_cols,(i+1)*num_cols-1],$
gg2i_gpu,[0L,-1L]
gpuMatrix_Multiply,gg2i_gpu,onesm_gpu,res_gpu,$
/btranspose
gpuAdd,KK_gpu,res_gpu,KK_gpu
; KK = KK - 2*G##transpose(ggi)
gpuMatrix_Multiply,ggi_gpu,G_gpu,res_gpu,$
/atranspose
gpuAdd,1.0,KK_gpu,-2.0,res_gpu,0.0,KK_gpu
; KK = exp(-gma*KK)
gpuExp,1.0,-gma,KK_gpu,0.0,0.0,KK_gpu
; image[*,i,*] = alpha[*,0:npc-1]##KK
gpuMatrix_Multiply,KK_gpu,alpha_gpu,res1_gpu
gpuGetArr,res1_gpu,res
image[*,i,*]=res
i++
endwhile
print,'elapsed time: ',systime(2)-start_time

Here is the time taken to project a 1000x1000x6 image on the GPU

elapsed time: 12.093000

and on the CPU

elapsed time: 197.39100

for a speedup of 16. Not bad.