Why is Armadillo so slow compared to a C-style array in a simple row-wise computationnal task

I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.

More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.

I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.

Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm. Also tried to use ARMA_NO_DEBUG without success.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 2000000; //number of rows

    unsigned ncols = 100; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]

    arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.

    arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < nrows; i++) {

        const arma::rowvec current_line = huge_mat.row(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < ncols; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, ncols-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";



    //------------------PLAIN C Array

    double *Mat_full;

    double *output;

    unsigned int i,j,k;

    double mean=0, stdd=0;

    double sq_diff_sum = 0, sum=0;

    double diff = 0;



    Mat_full = (double *) malloc(ncols * nrows * sizeof(double));

    output = (double *) malloc(nrows * ncols * sizeof(double));



    std::vector< std::vector<double> > V(huge_mat.n_rows);



    //Some UGLY copy from arma::mat to double* using a vector:

    for (size_t i = 0; i < huge_mat.n_rows; ++i)

        V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));



    //then dump to Mat_full array:

    for (i=0; i < V.size(); i++)

        for (j=0; j < V[i].size(); j++)

            Mat_full[i + huge_mat.n_rows * j] = V[i][j];



    t1 = high_resolution_clock::now();



    for(i=0; i < nrows; i++)

        for(j=0; j < ncols; j++)

        {

            //compute mean of subset-------------------

            sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    sum = sum + Mat_full[i+k*nrows];

                }

            mean = sum / (ncols-1);



            //compute standard deviation of subset-----

            sq_diff_sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    diff = Mat_full[i+k*nrows] - mean;

                    sq_diff_sum += diff * diff;

                }

            stdd = sqrt(sq_diff_sum / (ncols-2));



            //export to plain C array:

            output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);

        }



    t2 = high_resolution_clock::now();

    duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "C ARRAY: " << duration << " secsn";

}

In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows the plain C tends to be (very much) faster. For a simple test using this
setup i got:

ARMADILLO: 111 secs

C ARRAY: 79 secs

in execution time.

EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs now), thus showing some improvments over working row-wise:

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 100; //number of rows

    unsigned ncols = 2000000; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]

    arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.

    arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < ncols; i++) {

        const arma::colvec current_line = huge_mat.col(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < nrows; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, nrows-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";

}

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

2

First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11

1

Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29

1

Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50

2

This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53

2

@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33

|
show 5 more comments

I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 2000000; //number of rows

    unsigned ncols = 100; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]

    arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.

    arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < nrows; i++) {

        const arma::rowvec current_line = huge_mat.row(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < ncols; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, ncols-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";



    //------------------PLAIN C Array

    double *Mat_full;

    double *output;

    unsigned int i,j,k;

    double mean=0, stdd=0;

    double sq_diff_sum = 0, sum=0;

    double diff = 0;



    Mat_full = (double *) malloc(ncols * nrows * sizeof(double));

    output = (double *) malloc(nrows * ncols * sizeof(double));



    std::vector< std::vector<double> > V(huge_mat.n_rows);



    //Some UGLY copy from arma::mat to double* using a vector:

    for (size_t i = 0; i < huge_mat.n_rows; ++i)

        V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));



    //then dump to Mat_full array:

    for (i=0; i < V.size(); i++)

        for (j=0; j < V[i].size(); j++)

            Mat_full[i + huge_mat.n_rows * j] = V[i][j];



    t1 = high_resolution_clock::now();



    for(i=0; i < nrows; i++)

        for(j=0; j < ncols; j++)

        {

            //compute mean of subset-------------------

            sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    sum = sum + Mat_full[i+k*nrows];

                }

            mean = sum / (ncols-1);



            //compute standard deviation of subset-----

            sq_diff_sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    diff = Mat_full[i+k*nrows] - mean;

                    sq_diff_sum += diff * diff;

                }

            stdd = sqrt(sq_diff_sum / (ncols-2));



            //export to plain C array:

            output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);

        }



    t2 = high_resolution_clock::now();

    duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "C ARRAY: " << duration << " secsn";

}

ARMADILLO: 111 secs

C ARRAY: 79 secs

in execution time.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 100; //number of rows

    unsigned ncols = 2000000; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]

    arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.

    arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < ncols; i++) {

        const arma::colvec current_line = huge_mat.col(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < nrows; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, nrows-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";

}

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

2

First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11

1

Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29

1

Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50

2

This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53

2

@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33

|
show 5 more comments

I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 2000000; //number of rows

    unsigned ncols = 100; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]

    arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.

    arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < nrows; i++) {

        const arma::rowvec current_line = huge_mat.row(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < ncols; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, ncols-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";



    //------------------PLAIN C Array

    double *Mat_full;

    double *output;

    unsigned int i,j,k;

    double mean=0, stdd=0;

    double sq_diff_sum = 0, sum=0;

    double diff = 0;



    Mat_full = (double *) malloc(ncols * nrows * sizeof(double));

    output = (double *) malloc(nrows * ncols * sizeof(double));



    std::vector< std::vector<double> > V(huge_mat.n_rows);



    //Some UGLY copy from arma::mat to double* using a vector:

    for (size_t i = 0; i < huge_mat.n_rows; ++i)

        V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));



    //then dump to Mat_full array:

    for (i=0; i < V.size(); i++)

        for (j=0; j < V[i].size(); j++)

            Mat_full[i + huge_mat.n_rows * j] = V[i][j];



    t1 = high_resolution_clock::now();



    for(i=0; i < nrows; i++)

        for(j=0; j < ncols; j++)

        {

            //compute mean of subset-------------------

            sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    sum = sum + Mat_full[i+k*nrows];

                }

            mean = sum / (ncols-1);



            //compute standard deviation of subset-----

            sq_diff_sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    diff = Mat_full[i+k*nrows] - mean;

                    sq_diff_sum += diff * diff;

                }

            stdd = sqrt(sq_diff_sum / (ncols-2));



            //export to plain C array:

            output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);

        }



    t2 = high_resolution_clock::now();

    duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "C ARRAY: " << duration << " secsn";

}

ARMADILLO: 111 secs

C ARRAY: 79 secs

in execution time.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 100; //number of rows

    unsigned ncols = 2000000; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]

    arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.

    arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < ncols; i++) {

        const arma::colvec current_line = huge_mat.col(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < nrows; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, nrows-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";

}

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 2000000; //number of rows

    unsigned ncols = 100; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]

    arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.

    arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < nrows; i++) {

        const arma::rowvec current_line = huge_mat.row(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < ncols; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, ncols-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";



    //------------------PLAIN C Array

    double *Mat_full;

    double *output;

    unsigned int i,j,k;

    double mean=0, stdd=0;

    double sq_diff_sum = 0, sum=0;

    double diff = 0;



    Mat_full = (double *) malloc(ncols * nrows * sizeof(double));

    output = (double *) malloc(nrows * ncols * sizeof(double));



    std::vector< std::vector<double> > V(huge_mat.n_rows);



    //Some UGLY copy from arma::mat to double* using a vector:

    for (size_t i = 0; i < huge_mat.n_rows; ++i)

        V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));



    //then dump to Mat_full array:

    for (i=0; i < V.size(); i++)

        for (j=0; j < V[i].size(); j++)

            Mat_full[i + huge_mat.n_rows * j] = V[i][j];



    t1 = high_resolution_clock::now();



    for(i=0; i < nrows; i++)

        for(j=0; j < ncols; j++)

        {

            //compute mean of subset-------------------

            sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    sum = sum + Mat_full[i+k*nrows];

                }

            mean = sum / (ncols-1);



            //compute standard deviation of subset-----

            sq_diff_sum = 0;

            for(k = 0; k < ncols; k++)

                if(k!=j)

                {

                    diff = Mat_full[i+k*nrows] - mean;

                    sq_diff_sum += diff * diff;

                }

            stdd = sqrt(sq_diff_sum / (ncols-2));



            //export to plain C array:

            output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);

        }



    t2 = high_resolution_clock::now();

    duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "C ARRAY: " << duration << " secsn";

}

ARMADILLO: 111 secs

C ARRAY: 79 secs

in execution time.

#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <algorithm>

#include <armadillo>

#include <chrono>



using namespace std::chrono;



/***************************

 * main()

 ***************************/

int main( int argc, char *argv )

{

    unsigned nrows = 100; //number of rows

    unsigned ncols = 2000000; //number of cols



    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix



    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]

    arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.

    arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.



    high_resolution_clock::time_point t1 = high_resolution_clock::now();



    //compute some normalization over each value of line of this huge matrix:

    for(unsigned i=0; i < ncols; i++) {

        const arma::colvec current_line = huge_mat.col(i); //extract current line



        //for each observation in current_line:

        for(unsigned j=0; j < nrows; j++) {



            //Take care of side effects first:

            if( j == 0 )

                inds = current_line(arma::span(1, nrows-1));

            else

                if( j == 1 ) {

                    inds(0) = current_line(0);

                    inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );

                } else

                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );



            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.

            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );

        }

    }



    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    auto duration = duration_cast<seconds>( t2 - t1 ).count();

    std::cout << "ARMADILLO: " << duration << " secsn";

}

c++ arrays matrix armadillo

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

edited Nov 27 '18 at 10:24

asked Nov 23 '18 at 15:04

Grasshoper

527

asked Nov 23 '18 at 15:04

Grasshoper

527

asked Nov 23 '18 at 15:04

Grasshoper

527

2

First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11

1

Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29

1

Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50

2

This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53

2

@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33

|
show 5 more comments

2

First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11

1

Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29

1

Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50

2

This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53

2

@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33

First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11

Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29

Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50

This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53

@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33

|
show 5 more comments

1 Answer
1

active

oldest

votes

The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.

To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448997%2fwhy-is-armadillo-so-slow-compared-to-a-c-style-array-in-a-simple-row-wise-comput%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

add a comment |

To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

add a comment |

To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

answered Nov 27 '18 at 10:56

The Quantum Physicist

12k748101

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

add a comment |

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg