CUDA loop on matlab

up vote
0
down vote

favorite

I have been playing around with parallelization both using CUDA and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran CUDA code

Matlab parfor code

Matlab Cuda (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran CUDA:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran CUDA: 2.4 seconds

Matlab parfor: 1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

up vote
0
down vote

favorite

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran CUDA code

Matlab parfor code

Matlab Cuda (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran CUDA:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran CUDA: 2.4 seconds

Matlab parfor: 1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

up vote
0
down vote

favorite

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran CUDA code

Matlab parfor code

Matlab Cuda (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran CUDA:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran CUDA: 2.4 seconds

Matlab parfor: 1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.

Below, I am sharing three codes and benchmarks:

Fortran OpenMP code

Fortran CUDA code

Matlab parfor code

Matlab Cuda (?) this is the one I don't know how to do.

Fortran OpenMP:

program rbc



 use omp_lib     ! For timing

 use tools

 implicit none



 real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &

                     rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;

integer, parameter :: nz = 4, nk=4800;

real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);

real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);

real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;

integer :: i, iz, ik, cnt;

logical :: ind(nk);

real(kind=8) :: start, finish   ! For timing

real :: tmpmax, c1  





call omp_set_num_threads(12)





!Grid for productivity z



! [1 x 4] grid of values for z

call linspace(zmin,zmax,nz,zgrid)

zgrid = exp(zgrid)

! [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757

tran_z(1,2) = 0.00324265

tran_z(1,3) = 0

tran_z(1,4) = 0

tran_z(2,1) = 0.000385933

tran_z(2,2) = 0.998441

tran_z(2,3) = 0.00117336

tran_z(2,4) = 0

tran_z(3,1) = 0

tran_z(3,2) = 0.00117336

tran_z(3,3) = 0.998441

tran_z(3,4) = 0.000385933

tran_z(4,1) = 0

tran_z(4,2) = 0

tran_z(4,3) = 0.00324265

tran_z(4,4) = 0.996757



! Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));



! [1 x 4800] grid of possible values of k

call linspace(kmin, kmax, nk, kgrid)





! Compute initial wealth c0(k,z)

do iz=1,nz

  c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;

end do



dif = 10000

tol = 1e-8

cnt = 1



do while(dif>tol)

    !$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)    

    do ik=1,nk;        

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do

    !$omp end parallel do

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do





end program

Fortran CUDA:

Just replace the mainloop syntax on the above code with:

do while(dif>tol)

    !$acc kernels

    !$acc loop gang

        do ik=1,nk;        

         !$acc loop gang

          do iz = 1,nz;

          tmpmax = -huge(0.)



          do i = 1,nk

             c1 = c0(ik,iz) - kgrid(i)

             if(c1<0) exit

             c1 = c1**(1-eta)/(1-eta)+ev(i,iz)

             if(tmpmax<c1) tmpmax = c1

          end do

          v(ik,iz) = tmpmax

       end do



    end do



    !$acc end kernels

    ev = beta*matmul(v,tran_z)

    dif = maxval(abs(v-v0))

    v0 = v

    if(mod(cnt,1)==0) write(*,*) cnt, ':', dif

        cnt = cnt+1

end do

Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).

tic;

beta = 0.984; 

eta = 2; 

alpha = 0.35; 

delta = 0.01;

rho = 0.95;

sigma = 0.005;

zmin=-0.0480384;

zmax=0.0480384;

nz = 4;

nk=4800;



v=zeros(nk,nz); 

v0=zeros(nk,nz);

ev=zeros(nk,nz);

c0=zeros(nk,nz);



%Grid for productivity z



%[1 x 4] grid of values for z

zgrid = linspace(zmin,zmax,nz);

zgrid = exp(zgrid);

% [4 x 4] Markov transition matrix of z

tran_z(1,1) = 0.996757;

tran_z(1,2) = 0.00324265;

tran_z(1,3) = 0;

tran_z(1,4) = 0;

tran_z(2,1) = 0.000385933;

tran_z(2,2) = 0.998441;

tran_z(2,3) = 0.00117336;

tran_z(2,4) = 0;

tran_z(3,1) = 0;

tran_z(3,2) = 0.00117336;

tran_z(3,3) = 0.998441;

tran_z(3,4) = 0.000385933;

tran_z(4,1) = 0;

tran_z(4,2) = 0;

tran_z(4,3) = 0.00324265;

tran_z(4,4) = 0.996757;



% Grid for capital k



kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));

kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));



% [1 x 4800] grid of possible values of k

kgrid = linspace(kmin, kmax, nk);



% Compute initial wealth c0(k,z)

for iz=1:nz

  c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;

end 



dif = 10000;

tol = 1e-8;

cnt = 1;



while dif>tol



    parfor ik=1:nk

          for iz = 1:nz

          tmpmax = -intmax;



          for i = 1:nk

             c1 = c0(ik,iz) - kgrid(i);

             if (c1<0) 

                 continue

             end 

             c1 = c1^(1-eta)/(1-eta)+ev(i,iz);

             if tmpmax<c1 

                 tmpmax = c1;

             end

          end 

          v(ik,iz) = tmpmax;

          end 



    end 

    ev = beta*v*tran_z;

    dif = max(max(abs(v-v0)));

    v0 = v;

    if mod(cnt,1)==0 

        fprintf('%1.5f :  %1.5f n', [cnt dif])

    end

        cnt = cnt+1;

end 





toc

Matlab CUDA:

The time comparison between codes:

Fortran OpenMP: 83.1 seconds 

Fortran CUDA: 2.4 seconds

Matlab parfor: 1182 seconds

Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.

matlab cuda openmp openacc

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

edited Nov 19 at 11:55

francescalus

16.7k73256

edited Nov 19 at 11:55

francescalus

16.7k73256

edited Nov 19 at 11:55

francescalus

16.7k73256

asked Nov 19 at 11:50

phdstudent

469725

asked Nov 19 at 11:50

phdstudent

469725

asked Nov 19 at 11:50

phdstudent

469725

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

1

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02

You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20

You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51

I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52

There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

|
show 1 more comment

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374053%2fcuda-loop-on-matlab%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

yipCO0G8uLr,e,Gr G5tNPK2pFaAlI6fmM,D1kWWGc2V7Lys

搜尋此網誌

Ytukyg