Intel Parallel Studio XE 2016: High Performance for HPC Applications and Big Data Analytics
Written by James Reinders
Intel Parallel Studio XE 2016, launched on August 25, 2015, is the latest installment in Intel’s developer toolkit for high performance computing (HPC) and technical computing applications. This suite of compilers, libraries, debugging facilities, and analysis tools, targets Intel architecture, including support for the latest Intel Xeon processors (codenamed Skylake) and Intel® Xeon Phi™ processors (codenamed Knights Landing). Intel Parallel Studio XE 2016 helps software developers design, build, verify and tune code in Fortran, C++, C and Java.
There are four things that I like to highlight when I describe this year’s tool release:
- Intel Data Analytics Acceleration Library
- Vectorization Advisor
- MPI Performance Snapshot
- High performance support for industry standards, the latest processors, operating systems and their related development environments.Intel Data Analytics Acceleration Library (Intel® DAAL)
Data Scientists are finding Intel® DAAL very exciting because it helps speed big data analytics. It’s designed for use with popular data platforms including Hadoop*, Spark, R, and Matlab, for highly efficient data access. We’ve seen Intel DAAL accelerate PCA by 4-7X ,and a customer that has seen 200X for the Alternating Least Square prediction algorithm, when compared with the latest open source Spark + MLlib. (details for both claims are in my blog about DAAL). Intel DAAL was created by the renowned team that creates the Intel® Math Kernel Library (Intel MKL). Intel DAAL can be thought of as “Intel MKL for Big Data” – but it is actually much more! Many more details on Intel DAAL, including ways to download it today for free are in my blog about DAAL. Intel DAAL is available for Linux, OS X and Windows.
Vectorisation is the process of using SIMD instructions in processors. In the quest to “modernise” applications to get top performance out of any modern processor, a software developer needs to tackle multithreading, vectorisation and fabric scaling. Intel Advisor XE 2016 provides tools to help with multithreading and vectorisation:
Vectorization Advisor is an analysis tool that helps identify loops that will benefit the most from vectorisation by identifying obstacles to vectorisation that are particular to your program, explore the benefit of alternative data organisation, and increase the confidence that transformations, aimed to increase vectorisation, will preserve the correctness of your original program.
Threading Advisor is a threading design and prototyping tool that lets you analyse, design, tune, and check threading design options rapidly.
Threading Advisor has gained a reputation in the past five years for helping find the right choice for multithreading an application more quickly and without costly oversights. The experience of refining this advisor has helped us to create this new advisor for vectorisation with our knowledge, based on customer feedback, of the best ways to give advice based on a program analysis.
Vector Advisor cannot tell you anything we could not show you how to do yourself. However, when I teach ‘vectorisation’ I tend to rattle off a list of things to check. Each item that I suggest to “check” involves using a tool in a particular way. Bringing that into one tool, makes life easier and definitely makes the process faster and more efficient. One of the key Vectorization Advisor features is a Survey Report that offers integrated compiler report data and performance data all in one place, including GUI-embedded advice on how to fix vectorisation issues specific to your code. This page augments that GUI-embedded advice with links to web-based vectorisation resources.
MPI Performance Snapshot
The MPI Performance Snapshot is a scalable lightweight performance tool for MPI applications. It collects a variety of MPI application statistics (such as communication, activity, and load balance) and presents it in an easy-to-read format. The tool is not available separately but is provided as part of the Intel® Parallel Studio XE 2016 Cluster Edition.
The MPI Performance Snapshot helps deal with the following problems as it relates to analysis of MPI application when scaling out to thousands of ranks:
- The size of clusters continue to grow so applications are getting more and more scalable.
- Large amounts of data are collected when doing profiling at larger scale which in turn can easily become unmanageable.
- It’s hard to identify which are the key metrics to track when you gather so large amounts of data.
By addressing these three items, MPI Performance Snapshot improves scaling to at least 32K ranks which is an order of magnitude above what is tolerable with the prior Intel Trace Analyzer and Collector. Therefore, we can now recommend when aiming to optimise a large scale run (anything above one thousand MPI ranks), we suggesting starting with the MPI Performance Snapshot capability first to figure out where you need to dig deeper (which processes are slowing the application down, where are the peaks in memory usage, etc.). Then, do another run with the Intel Trace Analyzer and Collector on a subset of selected ranks to get a more detailed per-process information in order to visualise how a communication algorithm is implemented and if see if there are apparent bottlenecks.
MPI Performance Snapshot combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to provide you with high-level categorisation of your application: MPI vs. OpenMP load imbalance info, memory usage, and a break-down of MPI vs. computation vs. serial time.
Click here to learn more about Intel Parallel Studio XE 2015