CUDA loop on matlab











up vote
0
down vote

favorite












I have been playing around with parallelization both using CUDA and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.



At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.



Below, I am sharing three codes and benchmarks:




  1. Fortran OpenMP code

  2. Fortran CUDA code

  3. Matlab parfor code

  4. Matlab Cuda (?) this is the one I don't know how to do.


Fortran OpenMP:



program rbc

use omp_lib ! For timing
use tools
implicit none

real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish ! For timing
real :: tmpmax, c1


call omp_set_num_threads(12)


!Grid for productivity z

! [1 x 4] grid of values for z
call linspace(zmin,zmax,nz,zgrid)
zgrid = exp(zgrid)
! [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757
tran_z(1,2) = 0.00324265
tran_z(1,3) = 0
tran_z(1,4) = 0
tran_z(2,1) = 0.000385933
tran_z(2,2) = 0.998441
tran_z(2,3) = 0.00117336
tran_z(2,4) = 0
tran_z(3,1) = 0
tran_z(3,2) = 0.00117336
tran_z(3,3) = 0.998441
tran_z(3,4) = 0.000385933
tran_z(4,1) = 0
tran_z(4,2) = 0
tran_z(4,3) = 0.00324265
tran_z(4,4) = 0.996757

! Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));

! [1 x 4800] grid of possible values of k
call linspace(kmin, kmax, nk, kgrid)


! Compute initial wealth c0(k,z)
do iz=1,nz
c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;
end do

dif = 10000
tol = 1e-8
cnt = 1

do while(dif>tol)
!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do
!$omp end parallel do
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


end program


Fortran CUDA:



Just replace the mainloop syntax on the above code with:



do while(dif>tol)
!$acc kernels
!$acc loop gang
do ik=1,nk;
!$acc loop gang
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do

!$acc end kernels
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).



tic;
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;

v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);

%Grid for productivity z

%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;

% Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));

% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);

% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

dif = 10000;
tol = 1e-8;
cnt = 1;

while dif>tol

parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;

for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end

end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f n', [cnt dif])
end
cnt = cnt+1;
end


toc


Matlab CUDA:



This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?



The time comparison between codes:



Fortran OpenMP: 83.1 seconds 
Fortran CUDA: 2.4 seconds
Matlab parfor: 1182 seconds


Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.










share|improve this question
























  • As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
    – Dev-iL
    Nov 19 at 12:02










  • You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
    – Nicky Mattsson
    Nov 19 at 12:20












  • You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
    – Ander Biguri
    Nov 19 at 12:51












  • I am using 2018b. Any reference that I can check?
    – phdstudent
    Nov 19 at 12:52






  • 1




    There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
    – Brice
    Nov 19 at 13:50















up vote
0
down vote

favorite












I have been playing around with parallelization both using CUDA and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.



At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.



Below, I am sharing three codes and benchmarks:




  1. Fortran OpenMP code

  2. Fortran CUDA code

  3. Matlab parfor code

  4. Matlab Cuda (?) this is the one I don't know how to do.


Fortran OpenMP:



program rbc

use omp_lib ! For timing
use tools
implicit none

real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish ! For timing
real :: tmpmax, c1


call omp_set_num_threads(12)


!Grid for productivity z

! [1 x 4] grid of values for z
call linspace(zmin,zmax,nz,zgrid)
zgrid = exp(zgrid)
! [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757
tran_z(1,2) = 0.00324265
tran_z(1,3) = 0
tran_z(1,4) = 0
tran_z(2,1) = 0.000385933
tran_z(2,2) = 0.998441
tran_z(2,3) = 0.00117336
tran_z(2,4) = 0
tran_z(3,1) = 0
tran_z(3,2) = 0.00117336
tran_z(3,3) = 0.998441
tran_z(3,4) = 0.000385933
tran_z(4,1) = 0
tran_z(4,2) = 0
tran_z(4,3) = 0.00324265
tran_z(4,4) = 0.996757

! Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));

! [1 x 4800] grid of possible values of k
call linspace(kmin, kmax, nk, kgrid)


! Compute initial wealth c0(k,z)
do iz=1,nz
c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;
end do

dif = 10000
tol = 1e-8
cnt = 1

do while(dif>tol)
!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do
!$omp end parallel do
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


end program


Fortran CUDA:



Just replace the mainloop syntax on the above code with:



do while(dif>tol)
!$acc kernels
!$acc loop gang
do ik=1,nk;
!$acc loop gang
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do

!$acc end kernels
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).



tic;
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;

v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);

%Grid for productivity z

%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;

% Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));

% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);

% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

dif = 10000;
tol = 1e-8;
cnt = 1;

while dif>tol

parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;

for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end

end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f n', [cnt dif])
end
cnt = cnt+1;
end


toc


Matlab CUDA:



This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?



The time comparison between codes:



Fortran OpenMP: 83.1 seconds 
Fortran CUDA: 2.4 seconds
Matlab parfor: 1182 seconds


Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.










share|improve this question
























  • As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
    – Dev-iL
    Nov 19 at 12:02










  • You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
    – Nicky Mattsson
    Nov 19 at 12:20












  • You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
    – Ander Biguri
    Nov 19 at 12:51












  • I am using 2018b. Any reference that I can check?
    – phdstudent
    Nov 19 at 12:52






  • 1




    There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
    – Brice
    Nov 19 at 13:50













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have been playing around with parallelization both using CUDA and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.



At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.



Below, I am sharing three codes and benchmarks:




  1. Fortran OpenMP code

  2. Fortran CUDA code

  3. Matlab parfor code

  4. Matlab Cuda (?) this is the one I don't know how to do.


Fortran OpenMP:



program rbc

use omp_lib ! For timing
use tools
implicit none

real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish ! For timing
real :: tmpmax, c1


call omp_set_num_threads(12)


!Grid for productivity z

! [1 x 4] grid of values for z
call linspace(zmin,zmax,nz,zgrid)
zgrid = exp(zgrid)
! [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757
tran_z(1,2) = 0.00324265
tran_z(1,3) = 0
tran_z(1,4) = 0
tran_z(2,1) = 0.000385933
tran_z(2,2) = 0.998441
tran_z(2,3) = 0.00117336
tran_z(2,4) = 0
tran_z(3,1) = 0
tran_z(3,2) = 0.00117336
tran_z(3,3) = 0.998441
tran_z(3,4) = 0.000385933
tran_z(4,1) = 0
tran_z(4,2) = 0
tran_z(4,3) = 0.00324265
tran_z(4,4) = 0.996757

! Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));

! [1 x 4800] grid of possible values of k
call linspace(kmin, kmax, nk, kgrid)


! Compute initial wealth c0(k,z)
do iz=1,nz
c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;
end do

dif = 10000
tol = 1e-8
cnt = 1

do while(dif>tol)
!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do
!$omp end parallel do
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


end program


Fortran CUDA:



Just replace the mainloop syntax on the above code with:



do while(dif>tol)
!$acc kernels
!$acc loop gang
do ik=1,nk;
!$acc loop gang
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do

!$acc end kernels
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).



tic;
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;

v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);

%Grid for productivity z

%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;

% Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));

% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);

% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

dif = 10000;
tol = 1e-8;
cnt = 1;

while dif>tol

parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;

for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end

end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f n', [cnt dif])
end
cnt = cnt+1;
end


toc


Matlab CUDA:



This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?



The time comparison between codes:



Fortran OpenMP: 83.1 seconds 
Fortran CUDA: 2.4 seconds
Matlab parfor: 1182 seconds


Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.










share|improve this question















I have been playing around with parallelization both using CUDA and OpenMP in Fortran. I am now trying to do the same in matlab. I find it very interesting that it seems to be very hard to paralelize a loop using GPUs in matlab. Apparently the only way to do it is to by using arrayfun function. But I might be wrong.



At a conceptual level, I am wondering why is the GPU usage in matlab not more straightforward than in fortran. At a more practical level, I am wondering how to use GPUs on the simple code below.



Below, I am sharing three codes and benchmarks:




  1. Fortran OpenMP code

  2. Fortran CUDA code

  3. Matlab parfor code

  4. Matlab Cuda (?) this is the one I don't know how to do.


Fortran OpenMP:



program rbc

use omp_lib ! For timing
use tools
implicit none

real, parameter :: beta = 0.984, eta = 2, alpha = 0.35, delta = 0.01, &
rho = 0.95, sigma = 0.005, zmin=-0.0480384, zmax=0.0480384;
integer, parameter :: nz = 4, nk=4800;
real :: zgrid(nz), kgrid(nk), t_tran_z(nz,nz), tran_z(nz,nz);
real :: kmax, kmin, tol, dif, c(nk), r(nk), w(nk);
real, dimension(nk,nz) :: v=0., v0=0., ev=0., c0=0.;
integer :: i, iz, ik, cnt;
logical :: ind(nk);
real(kind=8) :: start, finish ! For timing
real :: tmpmax, c1


call omp_set_num_threads(12)


!Grid for productivity z

! [1 x 4] grid of values for z
call linspace(zmin,zmax,nz,zgrid)
zgrid = exp(zgrid)
! [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757
tran_z(1,2) = 0.00324265
tran_z(1,3) = 0
tran_z(1,4) = 0
tran_z(2,1) = 0.000385933
tran_z(2,2) = 0.998441
tran_z(2,3) = 0.00117336
tran_z(2,4) = 0
tran_z(3,1) = 0
tran_z(3,2) = 0.00117336
tran_z(3,3) = 0.998441
tran_z(3,4) = 0.000385933
tran_z(4,1) = 0
tran_z(4,2) = 0
tran_z(4,3) = 0.00324265
tran_z(4,4) = 0.996757

! Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)**(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)**(1/(alpha-1));

! [1 x 4800] grid of possible values of k
call linspace(kmin, kmax, nk, kgrid)


! Compute initial wealth c0(k,z)
do iz=1,nz
c0(:,iz) = zgrid(iz)*kgrid**alpha + (1-delta)*kgrid;
end do

dif = 10000
tol = 1e-8
cnt = 1

do while(dif>tol)
!$omp parallel do default(shared) private(ik,iz,i,tmpmax,c1)
do ik=1,nk;
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do
!$omp end parallel do
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


end program


Fortran CUDA:



Just replace the mainloop syntax on the above code with:



do while(dif>tol)
!$acc kernels
!$acc loop gang
do ik=1,nk;
!$acc loop gang
do iz = 1,nz;
tmpmax = -huge(0.)

do i = 1,nk
c1 = c0(ik,iz) - kgrid(i)
if(c1<0) exit
c1 = c1**(1-eta)/(1-eta)+ev(i,iz)
if(tmpmax<c1) tmpmax = c1
end do
v(ik,iz) = tmpmax
end do

end do

!$acc end kernels
ev = beta*matmul(v,tran_z)
dif = maxval(abs(v-v0))
v0 = v
if(mod(cnt,1)==0) write(*,*) cnt, ':', dif
cnt = cnt+1
end do


Matlab parfor:
(I know the code below could be made faster by using vectorized syntax, but the whole point of the exercise is to compare loop speeds).



tic;
beta = 0.984;
eta = 2;
alpha = 0.35;
delta = 0.01;
rho = 0.95;
sigma = 0.005;
zmin=-0.0480384;
zmax=0.0480384;
nz = 4;
nk=4800;

v=zeros(nk,nz);
v0=zeros(nk,nz);
ev=zeros(nk,nz);
c0=zeros(nk,nz);

%Grid for productivity z

%[1 x 4] grid of values for z
zgrid = linspace(zmin,zmax,nz);
zgrid = exp(zgrid);
% [4 x 4] Markov transition matrix of z
tran_z(1,1) = 0.996757;
tran_z(1,2) = 0.00324265;
tran_z(1,3) = 0;
tran_z(1,4) = 0;
tran_z(2,1) = 0.000385933;
tran_z(2,2) = 0.998441;
tran_z(2,3) = 0.00117336;
tran_z(2,4) = 0;
tran_z(3,1) = 0;
tran_z(3,2) = 0.00117336;
tran_z(3,3) = 0.998441;
tran_z(3,4) = 0.000385933;
tran_z(4,1) = 0;
tran_z(4,2) = 0;
tran_z(4,3) = 0.00324265;
tran_z(4,4) = 0.996757;

% Grid for capital k

kmin = 0.95*(1/(alpha*zgrid(1)))*((1/beta)-1+delta)^(1/(alpha-1));
kmax = 1.05*(1/(alpha*zgrid(nz)))*((1/beta)-1+delta)^(1/(alpha-1));

% [1 x 4800] grid of possible values of k
kgrid = linspace(kmin, kmax, nk);

% Compute initial wealth c0(k,z)
for iz=1:nz
c0(:,iz) = zgrid(iz)*kgrid.^alpha + (1-delta)*kgrid;
end

dif = 10000;
tol = 1e-8;
cnt = 1;

while dif>tol

parfor ik=1:nk
for iz = 1:nz
tmpmax = -intmax;

for i = 1:nk
c1 = c0(ik,iz) - kgrid(i);
if (c1<0)
continue
end
c1 = c1^(1-eta)/(1-eta)+ev(i,iz);
if tmpmax<c1
tmpmax = c1;
end
end
v(ik,iz) = tmpmax;
end

end
ev = beta*v*tran_z;
dif = max(max(abs(v-v0)));
v0 = v;
if mod(cnt,1)==0
fprintf('%1.5f : %1.5f n', [cnt dif])
end
cnt = cnt+1;
end


toc


Matlab CUDA:



This is what I have no clue how to code. Is using arrayfun the only way of doing this? In fortran is so simple to move from OpenMP to OpenACC. Isn't there an easy way in Matlab of going from parfor to GPUs loops?



The time comparison between codes:



Fortran OpenMP: 83.1 seconds 
Fortran CUDA: 2.4 seconds
Matlab parfor: 1182 seconds


Final remark, I should say the codes above solve a simple Real Business Cycle Model and were written based on this.







matlab cuda openmp openacc






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 at 11:55









francescalus

16.7k73256




16.7k73256










asked Nov 19 at 11:50









phdstudent

469725




469725












  • As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
    – Dev-iL
    Nov 19 at 12:02










  • You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
    – Nicky Mattsson
    Nov 19 at 12:20












  • You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
    – Ander Biguri
    Nov 19 at 12:51












  • I am using 2018b. Any reference that I can check?
    – phdstudent
    Nov 19 at 12:52






  • 1




    There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
    – Brice
    Nov 19 at 13:50


















  • As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
    – Dev-iL
    Nov 19 at 12:02










  • You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
    – Nicky Mattsson
    Nov 19 at 12:20












  • You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
    – Ander Biguri
    Nov 19 at 12:51












  • I am using 2018b. Any reference that I can check?
    – phdstudent
    Nov 19 at 12:52






  • 1




    There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
    – Brice
    Nov 19 at 13:50
















As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02




As for "easy" ways, there's the GPU coder, but it requires a toolbox. Other than that, there's an example in the MATLAB documentation that compares these things.
– Dev-iL
Nov 19 at 12:02












You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20






You are thinking about it wrongly. Vectorized syntax is MATLABs way of optimizing a loop - whether it is on the CPU or the GPU. So the easiest way to use the GPU is to use gpuArray() to put everything on the GPU and then use the classical vectorized syntax. Then arrayfun is the more tedious alternative if you cannot write it in a vectorized manner.
– Nicky Mattsson
Nov 19 at 12:20














You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51






You can not code In MATLAB and CUDA. until 2018b. The newer very specialized toolbox allows to write kernels in MATLAB, but prior editions only allowed for very specific functions to be run using CUDA. I personally write the code in CUDA and mex it.
– Ander Biguri
Nov 19 at 12:51














I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52




I am using 2018b. Any reference that I can check?
– phdstudent
Nov 19 at 12:52




1




1




There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50




There is no point doing performance checks with Matlab while refusing to use the most basic optimization which is vectorization. There is a big overhead with each operation, incurred at each iteration with a for loop but only once with a vectorized op (which hides a loop anyway).
– Brice
Nov 19 at 13:50

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374053%2fcuda-loop-on-matlab%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374053%2fcuda-loop-on-matlab%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

To store a contact into the json file from server.js file using a class in NodeJS

Redirect URL with Chrome Remote Debugging Android Devices

Dieringhausen