After the relase of EPD 6.0 now linking numpy agains the Intel MKL library (10.2), I wanted to have some insight about the performance impact of the MKL usage.

**What impact does the MKL have on numpy performance ?**

I have very roughly started a basic benchmark comparing EPD 5.1 with EPD 6.0. The former is using numpy 1.3 with BLAS and the latter numpy 1.4 with the MKL. I am using a Thinkpad T60 with an Intel dual-core 2Ghz CPU running Windows 32bit.

! The benchmarking methodology is really poor and can be made much more realistic but it gives a first insight.

Contrary to what I said at the last LFPUG meeting on Wednesday, you can control the maximal number of threads used by the system using the OMP_NUM_THREADS environment variables. I have updated the benchmark script to show its value when running it.

Here are some results :

**Testing linear algebra functions**

I took some of the often used methods and barely compared the cpu time using the ipython timeit command.

Example 1 : eigenvalues

def test_eigenvalue(): i= 500 data = random((i,i)) result = numpy.linalg.eig(data)

The results are interesting 752ms for the MKL version versus 3376 for the ATLAS. That is a 4.5x faster. Testing the very same code on Matlab 7.4 (R2007a) gives a timing of 790ms.

Example 2 : single value decompositions

def test_svd(): i = 1000 data = random((i,i)) result = numpy.linalg.svd(data) result = numpy.linalg.svd(data, full_matrices=False)

Results are 4608ms with the MKL versus 15990ms without. This is nearly 3.5x faster.

Example 3 : matrix inversion

def test_inv(): i = 1000 data = random((i,i)) result = numpy.linalg.inv(data)

Results are 418ms with the MKL versus 1457ms without. This is 3.5x faster

Example 4 : det()

def test_det(): i=1000 data = random((i,i)) result = numpy.linalg.det(data)

Results are 186ms with the MKL versus 400ms without. This is 2x faster.

Example 5 : dot()

def test_dot(): i = 1000 a = random((i, i)) b = numpy.linalg.inv(a) result = numpy.dot(a, b) - numpy.eye(i)

Results are 666ms with the MKL versus 2444ms without. This is 3.5x faster.

**Conclusion :**

Linear algebra functions show a clear performance improvement. I am open to collect more information on that if you have some home made benchmarking. If the amount of information, we should consider publishing the results as official benchmark somewhere.

Function | Without MKL | With MKL | Speed up |
---|---|---|---|

test_eigenvalue | 3376ms | 752ms | 4.5x |

test_svd | 15990ms | 4608ms | 3.5x |

test_inv | 1457ms | 418ms | 3.5x |

test_det | 400ms | 186ms | 2x |

test_dot | 2444ms | 666ms | 3.5x |

For those of you wanting to test your environment, feel free to use the script here below.

You can test your installation using the following code :

""" Benchmark script to be used to evaluate the performance improvement of the MKL Copyright (c) 2010, Didrik Pinte <dpinte@enthought.com> All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ import os import sys import timeit import numpy from numpy.random import random def test_eigenvalue(): """ Test eigen value computation of a matrix """ i = 500 data = random((i,i)) result = numpy.linalg.eig(data) def test_svd(): """ Test single value decomposition of a matrix """ i = 1000 data = random((i,i)) result = numpy.linalg.svd(data) result = numpy.linalg.svd(data, full_matrices=False) def test_inv(): """ Test matrix inversion """ i = 1000 data = random((i,i)) result = numpy.linalg.inv(data) def test_det(): """ Test the computation of the matrix determinant """ i = 1000 data = random((i,i)) result = numpy.linalg.det(data) def test_dot(): """ Test the dot product """ i = 1000 a = random((i, i)) b = numpy.linalg.inv(a) result = numpy.dot(a, b) - numpy.eye(i) # Test to start. The dict is the value I had with the MKL using EPD 6.0 and without MKL using EPD 5.1 tests = {test_eigenvalue : (752., 3376.), test_svd : (4608., 15990.), test_inv : (418., 1457.), test_det : (186.0, 400.), test_dot : (666., 2444.) } # Setting the following environment variable in the shell executing the script allows # you limit the maximal number threads used for computation THREADS_LIMIT_ENV = 'OMP_NUM_THREADS' def start_benchmark(): print """Benchmark is made against EPD 6.0,NumPy 1.4 with MKL and EPD 5.1, NumPy 1.3 with ATLAS on a Thinkpad T60 with Intel CoreDuo 2Ghz CPU on Windows Vista 32 bit """ if os.environ.has_key(THREADS_LIMIT_ENV): print "Maximum number of threads used for computation is : %s" % os.environ[THREADS_LIMIT_ENV] print ("-" * 80) print "Starting timing with numpy %s\nVersion: %s" % (numpy.__version__, sys.version) print "%20s : %10s - %5s / %5s" % ("Function", "Timing [ms]", "MKL", "No MKL") for fun, bench in tests.items(): t = timeit.Timer(stmt="%s()" % fun.__name__, setup="from __main__ import %s" % fun.__name__) res = t.repeat(repeat=3, number=1) timing = 1000.0 * sum(res)/len(res) print "%20s : %7.1f ms - %3.2f / %3.2f" % (fun.__name__, timing, bench[0]/timing, bench[1]/timing) if __name__ == '__main__': start_benchmark()

The Matlab bench function is the following :

disp('Testing some linear algebra functions'); disp('Eig');tic;data=rand(500,500);eig(data);toc; disp('Svd');tic;data=rand(1000,1000);[u,s,v]=svd(data);s=svd(data);toc; disp('Inv');tic;data=rand(1000,1000);result=inv(data);toc; disp('Det');tic;data=rand(1000,1000);result=det(data);toc; disp('Dot');tic;a=rand(1000,1000);b=inv(a);result=a*b-eye(1000);toc; disp('Done');

Hello,

Nice post. I’m interested to know which version of Intel MKL you were using. Would you be intereted in sharing your results on the Intel MKL forum?

Todd,

The MKL version is 10.2 (the one coming with IFortran 11.1.051).

I am happy to share the information but will probably continue updating this page has I get more tests and information.

Thanks dpinte. I’m sure others would be interested to know about your results.

http://software.intel.com/en-us/forums/intel-math-kernel-library/

I look forward to your updates.

Todd, the link is posted here

http://software.intel.com/en-us/forums/showthread.php?t=71563

Didrik

[…] vs Matlab performance Further to the post I wrote on the MKL performance improvement on NumPy, I have tried to get some figures comparing it to Matlab. Here are some results. Any suggestion to […]

[…] I thought above is bad example and I searched for one of many comparisons available, i.e. first Google hit: https://dpinte.wordpress.com/2010/01/15/numpy-performance-improvement-with-the-mkl/ […]

[…] Ps: Nel fare i test ho preso in parte ispirazione da questo articolo […]

[…] GCC compiler suite, purely because of the cost. Intel’s compilers are not cheap! But they are independently reported to have some performance gains for Intel processors, and I managed to get them massively reduced […]

Sorry to bother you, but I wonder if you might attach a license to this code? It’s quite useful and I’d like to include it in a system for which an explicit license would be required.

John, I’ve added BSD license to the code. Feel free to use it!

Many thanks, Didrik. Very appreciated.

Is the script complete? The indentation is broken the way the blog renders for me in chrome and when I try to run it just expects an indented block at start_benchmark()