Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.
Java* has long been a favorite developer language—one that is gaining traction with enterprise, embedded, and Internet of Things (IoT) applications. Intel® VTune™ Amplifier power and performance profiler (part of the Intel® software tools suite) has been profiling managed code like Java and .NET for quite some time. Python* has also become a popular programming language due to its ease of programming and ability to integrate systems more efficiently.
With the growth in popularity of these two different programming languages, it’s important to make sure applications are effectively using the available CPU capabilities. In response to these demands, Intel VTune Amplifier has extended its profiling capability to both Java- and Python-based applications. Here we explore using Intel VTune Amplifier to gain more understanding about how an application is performing.
What is Intel VTune Amplifier?
Intel VTune Amplifier is a performance and power profiler that can help you discover the specific modules/processes that are taking more CPU time (hotspots). It can also help catch issues at the microarchitectural level (e.g., cache misses, page walks, TLB issues, and others) through a sampling technology with minimal overhead.
Why Application Performance Profiling?
Application performance profiling is necessary to figure out the various code blocks that are consuming more CPU clock cycles. This kind of information can be retrieved by manually inserting timing APIs. But manual inspection and discovering the module that is causing problems takes more time when your project has many modules. Intel VTune Amplifier offers many profilers that can help drill down to the issue.
Intel VTune Amplifier Features
Intel VTune Amplifier has various features that make it intuitive and easy to use. It:
- Uses sampling technology, which has a minimal overhead compared to instrumentation used for profiling.
- Is an easy-to-use profiler with various grouping/filters/caller-callee options, which can help to focus on the problematic code.
- Can provide the information at source level and assembly level.
- Can provide various types of microarchitectural information about the application.
Profiling a Java Application
There are four steps to profiling a Java application:
- Create an Intel VTune Amplifier Project.
- Select the Application/Process to Profile.
- Select the type of analysis.
- Collect the results and interpret.
Create an Intel VTune Amplifier Project on a Windows* System
First, set up the environment variables with the amplxe-vars batch file. For example, if you installed Intel VTune Amplifier using the default directory, type:
C:\[Program Files]\IntelSWTools\VTune Amplifier XE\amplxe-vars.bat
The batch file displays the product name and the build number.
Next, launch Intel VTune Amplifier. For a standalone GUI interface, run the amplxe-gui
command. Or, for command-line interface, run the amplxe-cl
command.
Create an Intel VTune Amplifier project (standalone version only):
- Click the menu button in the right corner and go to New > Project.
- Specify the project name and location in the Create Project dialogue box.
Note: If you are working on a Linux* platform, see Getting Started with Intel® VTune™ Amplifier XE 2016 for Linux* OS to learn how to create a new project.
Select the Application/Process to Profile
In the Analysis Target tab, select a target system from the left pane and select an analysis target type from the right pane. You can:
- Launch an application to be profiled using Launch Application.
- Attach the profiler to an application that is already running using Attach to Process.
- Observe an application’s interaction with system calls by selecting Profile System.
If you select Launch Application, provide the path to the java.exe file in the Application field and a set of parameters like "Java options" or "prop files" in the Application parameters (Figure 1). Additionally, you can give the set of environment variables that your application might need to run in the User-Defined Environment variables field. Alternatively, define a *.bat file with the various environment variables, java.exe binary, and parameters and provide this *.bat file in the Application field rather than the java.exe file.
Figure 1 - Choose target and analysis type
Select the Type of Analysis
Switch to the Analysis Type tab, select the Basic Hotspots analysis type, and click Start.
Collect the Results and Interpret
We have used the SPECjbb2000 benchmark for this example.
Once the data collection and finalization are done, a Summary tab opens showing the various metrics. Let’s interpret some of the metrics here:
- In the first section of the Summary Tab (Figure 2), we see the Elapsed Time of the application (wall clock time from start to stop of an application), CPU Time (time during which the application was actively utilizing the CPU), Total Thread Count (number of threads that were deployed in our application).
Figure 2 - Elapsed time
- The second section of the Summary tab (Figure 3) shows Top Hotspots, which are the functions that account for the most CPU time. In our example, we have the top five hotspots, with the first hotspot accounting for 13.165 seconds. However, we can observe that we have both [Dynamic code] and [Compiled Java code] under the Module column, which are Java methods that cannot be attributed to any module in terms of native binary. All Java user methods are attributed to the [Compiled Java code] module. Also, JVM can generate some methods for its internal usage and report them to the Java profiler. Such methods are attributed to the [Dynamic code] module.
Figure 3 - Top hotspots
The third section of the Summary tab (Figure 4) includes the CPU Usage Histogram, which visualizes the number of logical cores being used to run the application. The color coding signifies how Intel VTune Amplifier classifies the number of cores used by the application as Idle, Poor, OK, and Ideal.
Figure 4 - CPU usage histogram
We can retrieve more details by looking at the Bottom-Up tab (Figure 5), where we have the timeline view to visualize the workload on the threads with various groupings and filters and focus on a specific time interval. We can further drill down to the sources by double-clicking on the function/module of interest.
Figure 5 - Bottom-Up tab
To quickly start optimizing code, it is helpful to drill down to a specific source file location. In addition to the source, we can also see the highlighted the corresponding assembly generated for a given source line (Figure 6).
Figure 6 - Assembly generated for a source line
Besides knowing the modules that are consuming most of the CPU resources using the Basic Hotspots analysis type, we can find the context switches and transitions between threads using the Concurrency Analysis type (Figure 7). The yellow lines signify the transitions between threads.
Figure 7 - Concurrency analysis
Use the General Exploration analysis type to perform an architectural analysis to deep dive and understand the various issues related to caches, TLBs, page walks, false sharing, and many more.
Profiling a Python Application
Python profiling is a new feature in the 2017 version of Intel VTune Amplifier.
See the article, "Getting Your Python* Code to Run Faster Using Intel® VTune™ Amplifier XE," in Issue 25 of Parallel Universe magazine to learn how to get started with Intel VTune Amplifier for profiling Python applications. The article focuses on mixed-mode profiling, resolving symbols of *.pyd modules, and setting up the compiler and linker flags for Cython* compilation.
Mixed-Mode Profiling
Python is an interpreting language and doesn’t use a compiler to generate binary execution code. Cython is also an interpreting language (but a C-extension), and it can be built to native code. Intel VTune Amplifier 2017 can fully support reporting hotspot functions in both Python and Cython code.
Cython works in a two-step process:
- Convert a *.pyx (Cython) file into a *.c file
- Compile this *.c file into a *.pyd file (equivalent of a .so on Linux)
The sample being used for this example is a matrix multiplication of 512*512 size.
Below are the steps for generating the *.pyd files and using VTune Amplifier to resolve the symbols:
- Set up the setup.py file with extensions and compiler and linker flags, which allows you to use Cython to convert the *.pyx source code from Python to C.
#setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension('Matrix_mul', sources=["Matrix_mul.pyx"],
extra_compile_args=['/Z7'],
extra_link_args=['/DEBUG']),
],
)
In the setup.py file above, extensions were created with compiler flags /Z7
and linker flags /DEBUG
to generate the debug symbols. Including flags is important if we want to drill down to the source level in Intel VTune Amplifier. If you are working on Linux platforms, use -g both in extra_compile_args and extra_link_args.
- Call Python with the following command to convert it to C using Cython:
python setup.py build_ext
This command generates the corresponding *.c and *.pyd module. In our case, we should see Matrix_mul.c and Matrix_mul.pyd. Since we have also requested the generation of symbolic information, we should see a *.pdb file as well.
- Create a Python script importing the Matrix_mul.pyd module as shown below :
#simple.py
import Matrix_mul
Matrix_mul.func();
- Use the Basic Hotspots analysis type in Intel VTune Amplifier 2017 to analyze the application with simple.py as the parameter to the Python interpreter (Figure 8).
Figure 8 - Choose target and analysis type
- Once the Basic Hotspots analysis completes, the modules/functions can be seen on the Bottom-Up tab (Figure 9).
Figure 9 - Modules/functions
In this example, the Matrix_mul.pyd module (converted module) is one of the top hotspots that is consuming most of the CPU time. In the Filter section at the bottom of the page, we have selected "Only user functions" as we were interested in analyzing/profiling the user code alone. To view the user- and system-level functions and their interaction, select "User/system functions." We can further drill down into each of these source files to figure out the issue at the specific source line of these modules. We can also see the corresponding assembly for the C code being generated. By double-clicking the first user-function hotspot above, we get the corresponding source file as shown in Figure 10.
Figure 10 - Clicking the first user-function hotspot
Using the Source View, we can relate the generated C code for a corresponding Python code (Line 1052-1058, corresponds to Python code). Using this data, a user can focus on a problematic piece/module of Python code identified by Intel VTune Amplifier and begin optimization.
Use the Caller/Callee tab to see the caller-callee for the selected function/module (Figure 11).
Figure 11 - Caller/Callee tab
Summary
Intel VTune Amplifier can be used to narrow down performance analysis to the specific code that is consuming more CPU time (hotspots) in Java or Python applications. This is possible for both pure Python application as well as mixed Python/Cython code.