speed of different loop implementations

gde23

Hello,

i need to perform a lot of matrix * vector multiplications and want to find out what is the best way to store the data.
the vector / matrix are eigen3 objects.

For testing i have implemented some useless loops and i get different timings for each of them

big eigen3 matrix where i pick rows:
(at work i had to do a lot of matlab the last years so i expected this to be the fastest)

void calcMatrix(Matrix4 M, Matrix1000 raysM)
{
    Vector4 ray;
    #pragma omp parallel for
    for(int j=0;j<10000;j++)
    {
        for(int i = 0;i<1000;i++)
        {
            ray = raysM.row(i);
            ray = M*ray;
        }
    }
}

using a QList with eigen3 vectors:

void calcList(Matrix4 M, QList<Vector4> *raysL)
{
    Vector4 ray;
    #pragma omp parallel for
    for(int i=0;i<1000;i++)
    {
        for(int j = 0;j<10000;j++)
        {
            ray = raysL->at(i);
            ray = M*ray;
        }
    }
}

an QList of Objects that contain the eigen3 vectors:

void calcListClass(Matrix4 M, QList<RayClass> *raysC)
{
    Vector4 ray;
    #pragma omp parallel for
    for(int i=0;i<1000;i++)
    {
        for(int j = 0;j<10000;j++)
        {
            ray = raysC->at(i).pos;
            ray = M*ray;
        }
    }
}

an QList of Objects that contain the eigen3 vectors and have a method (trace) to compute the useless loop:

void calcListClassMethod(Matrix4 M, QList<RayClass> *raysC)
{
    RayClass ray;
    #pragma omp parallel for
    for(int i=0;i<1000;i++)
    {
        ray = raysC->at(i);
        ray.trace(M);
    }
}

when i measure the time each computation takes with QElapseTimer( ) i get following results:

Eigen3: 7120 milliseconds
QList: 5458 milliseconds
RayClass: 5425 milliseconds
RayClassWithMethod: 5088 milliseconds

it seems that the Onb.method( ) one is the fastest.
But i want to understand why. And is there maybe an even faster version that is possible??

Thanks in advance

SGaist

Hi,

You should rather use a QVector if you want to go the Qt way. It should perform better than QList.

gde23

@SGaist :Thanks for the quick answer.
I tested QVector as well as std::vector for the container, and get more or less the same result as for the QList in all cases:
QVector seems to be slightly faster however the difference is less than 1%

Eigen3 4x1000__________ 61874 milliseconds
Eigen2 4x1 QList_________49248 milliseconds
RayClass QList__________49127 milliseconds
RayClass QVector________49536 milliseconds
RayClassMethode QList____ 47555 milliseconds
RayClassMethode QVector__ 47347 milliseconds
RayClassMethode std::vector_ 47126 milliseconds

i think i will implemet the real algorithm and test it again with the different

kshegunov

Instead of doing matrix-vector multiplications in a loop do a single matrix-matrix multiplication and drop the OpenMP stuff. Eigen (if that's the library you're using) already features threading internally and makes use of the extensions your processor supports. Put your vectors as columns in a rectangular matrix (4x1000) and do the multiplication with the 4x4 matrix from the left. The resulting (multiplied) vectors will be the columns of the produced (4x1000) rectangular matrix. Basically:

void calcMatrix(const Matrix<qreal, 4, 4> & M, Matrix<qreal, 4, 1000> & rays)
{
    rays = M * rays;
}

gde23

@kshegunov Thanks. That is really a lot faster.
However i'm getting in trouble for large matrices (4x10000).

I get following error:

/usr/include/eigen3/Eigen/src/Core/DenseStorage.h:33: error: 'OBJECT_ALLOCATED_ON_STACK_IS_TOO_BIG' is not a member of 'Eigen::internal::static_assertion<false>' EIGEN_STATIC_ASSERT(Size * sizeof(T) <= EIGEN_STACK_ALLOCATION_LIMIT, OBJECT_ALLOCATED_ON_STACK_IS_TOO_BIG);

The matrices i created should not be on the stack, so i think eigen allocates some memory on the stack internally? Can this be changes?

mrjj

@gde23 said in speed of different loop implementations:

OBJECT_ALLOCATED_ON_STACK_IS_TOO_BIG

Google tells me you can do
#define EIGEN_STACK_ALLOCATION_LIMIT 1000000
before including Eigen/Core
To alter the limit.
If that is enough, I cant tell :)

VRonin

#define EIGEN_STACK_ALLOCATION_LIMIT 0 removes the limit completely not sure it this will just cause stack-overflow anyway as that is a flag designed to check for this kind of problems at compile time instead of runtime

gde23

@mrjj Thanks, that solved the problem

kshegunov

Don't mess with the stack! Instead make your (big) matrix, the one holding the vectors, dynamically sized (i.e. allocated on the heap). Use:

Matrix<qreal, 4, Dynamic>

instead of a fixed number for the columns number. And don't forget to initialize it before using. Follow the documentation for more details.

Kind regards.

VRonin

@kshegunov Can I upvote you 10 times?

kshegunov

Yes. I allow it. :]