Tuesday, February 19, 2008


In software engineering, performance analysis (a field of dynamic program analysis) is the investigation of a program's behavior using information gathered as the program runs, as opposed to static code analysis. The usual goal of performance analysis is to determine which parts of a program to optimize for speed or memory usage.

Use of profilers
Profiler-driven program analysis on Unix dates back to at least 1979, when Unix systems included a basic tool "prof" that listed each function and how much of program execution time it used. In 1982, gprof extended the concept to a complete call graph analysis (Gprof: a Call Graph Execution Profiler [1])
In 1994, Amitabh Srivastava and Alan Eustace of Digital Equipment Corporation published a paper describing ATOM [2]. ATOM is a platform for converting a program into its own profiler. That is, at compile time, it inserts code into the program to be analyzed. That inserted code outputs analysis data. This technique, modifying a program to analyze itself, is known as "instrumentation".
In 2004, both the Gprof and ATOM papers appeared on the list of the 20 most influential PLDI papers of all time. [3]

History

The Devil is an Ass Profiler Types based on Output
Flat profilers compute the average call times, from the calls, and do not breakdown the call times based on the callee or the context.

Flat profiler
Call Graph profilers show the call times, and frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.

Call-Graph profiler

Event based profilers
Some profilers operate by sampling. A sampling profiler probes the target program's program counter at regular intervals using operating system interrupts. Sampling profiles are typically less accurate and specific, but allow the target program to run at near full speed.
Some profilers instrument the target program with additional instructions to collect the required information. Instrumenting the program can cause changes in the performance of the program, causing inaccurate results and heisenbugs. Instrumenting can potentially be very specific but slows down the target program as more specific information is collected.
The resulting data are not exact, but a statistical approximation. The actual amount of error is usually more than one sampling period. In fact, if a value is n times the sampling period, the expected error in it is the square-root of n sampling periods. [4]
Some of the most commonly used statistical profilers are GNU's gprof, Oprofile and SGI's Pixie.

Statistical profilers

Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes.
Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify
Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM
Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind
Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst
Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON
Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON Profiler (computer science) Instrumentation
When a sequential program has an infinite loop, the simplest way to find the problem is to run it under a debugger, halt it with a "pause" button (not a breakpoint), and examine the Call stack. Each statement (or instruction) on the call stack is a function call, except for the one at the "bottom" of the stack. One of those statements is in an infinite loop, which can be found by single stepping and examining the context of each statement.
The method works even if the running time is finite. First the program is modified, if necessary, to make it take more than a few seconds, perhaps by adding a temporary outer loop. Then, while the program is doing whatever seems to take too long, it is randomly halted, and a record is made of the call stack. The process is repeated to get additional samples of the call stack. At the same time, the call stacks are compared, so as to find any statements that appear on more than one. Any such statement, if a way can be found to invoke it much less frequently or eliminate it, reduces execution time by the fraction of time it resided on the call stack. Once that is done, the entire process can be repeated, up to several times, usually resulting in significant cumulative speedups. This method is called "random halting" or "deep sampling".
In making these performance-enhancing modifications, one gets the sense that one is fixing bugs of a type that only make a program slow, not wrong. A name for this type of bug is "slug" (slowness bug). Programs as first written generally contain both bugs and slugs. Bugs are usually removed during program testing, while slugs are usually not, unless performance analysis is employed during development.
There are different kinds of slugs. Generally, things that could be done intentionally to make a program run longer can also occur unintentionally. One commonly accepted kind of slug is a "hot spot", which is a tight inner loop where the program counter spends much of its time. For example, if one often finds at the bottom of the call stack a linear search algorithm instead of binary search, this would be a true hot spot slug. However, if another function is called in the search loop, such as string compare, that function would be found at the bottom of the stack, and the call to it in the loop would be at the next level up. In this case, the loop would not be a hot spot, but it would still be a slug. In all but the smallest programs, hot spot slugs are rare, but slugs are quite common.
Another way to slow down software is to use data structures that are too general for the problem at hand. For example, if a collection of objects remains small, a simple array with linear search could be much faster than something like a "dictionary" class, complete with hash coding. With this kind of slug, the program counter is most often found in system memory allocation and freeing routines as the collections are being constructed and destructed.
Another common motif is that a powerful function is written to collect a set of useful information (from a database, for example). Then that function is called multiple times, rather than taking the trouble to save the results from a prior call. A possible explanation for this could be that it is beyond a programmer's comprehension that a function call might take a million times as long to execute as an adjacent assignment statement. A contributing factor could be "information hiding", in which external users of a module can be ignorant of what goes on inside it.
At this time, there are certain misconceptions in performance analysis. One is that timing is important. Knowing how much time is spent in functions is good for reporting improvements, but it provides only vague help in finding problems. The information that matters is the fraction of time that individual statements reside on the call stack.
Another misconception is that statistical precision matters. Typical slugs sit on the call stack between 5 and 95 percent of the time. The larger they are, the fewer samples are needed to find them. As in sport fishing, the object is to catch them first, and measure them later, if ever.
As an example, the iteration of slug removal tends to go something like this: Slug X1 could be taking 50% of the time, and X2 could be taking 25% of the time. If X1 is removed, execution time is cut in half, at which point X2 takes 50% of the time. If on the first pass X2 is removed instead, the time is only reduced by 1/4, but then X1 is seen as taking 67% of the time, so it is even more obvious, and can be removed. Either way, removing both X1 and X2 reduces execution time by 75%, so the remaining slugs are four times larger. This "magnification effect" allows the process to continue through X3, X4, and so on until all the easily-removed slugs have been fixed.

Simple manual technique

Dunlavey, "Performance Tuning: Slugging It Out!", Dr. Dobb's Journal, Vol 18, #12, November 1993, pp 18-26.
Dunlavey, "Building Better Applications: a Theory of Efficient Software Development" International Thomson Publishing ISBN 0-442-01740-5, 1994. See also

List of Profilers

RapiTime is a performance analysis and instrumentation toolset, developed by Rapita Systems Ltd., that provides a solution to the problem of determining worst-case execution times (WCET). Ada

gprof The GNU Profiler, part of GNU Binutils (which are part of the GNU project); you can use some visualization tools called VCG tools and combine both of them using Call Graph Drawing Interface (CGDI); a second solution is kprof. More for C/C++ but works well for other languages.
Valgrind is a GPL'd system for debugging and profiling x86-Linux programs. You can automatically detect many memory management and threading bugs. alleyoop is a front-end for valgrind. It works for any language and the assembler.
VTune(tm) Performance Analyzer is Intel's family of commercial performance analyzers for Windows and Linux executables on Intel CPUs. It has command-line tools, a standalone environment and plugins for Microsoft Visual Studio and Eclipse.
PurifyPlus is a commercial family of performance analysis tools from IBM's Rational unit. For Linux, UNIX and Windows.
Visual Studio Team System Profiler is Microsoft's commercial profiler offering
Shark is Apple's free performance analyzer for Macintosh executables.
Intel Software Tools, Intel Analyzers, Compilers, Libraries
PerfSuite is an open source collection of tools and libraries for application performance analysis on Linux that includes aggregate performance measurement and statistical profiling using hardware performance counters.
OProfile statistical, kernel based GPL profiler for Linux
CodeAnalyst is AMD's free performance analyzer for Windows programs on AMD hardware. Linux version available.
Profiler for use with Digital Mars C, C++ and D compilers.
Sysprof is a statistical profiler for Linux that uses a kernel module to profile all running processes. Its user interface shows the processes' call trees graphically.
PAPI is a portable interface (in the form of a library) to hardware performance counters on modern microprocessors.
DynaProf is a performance analysis tool designed to insert performance measurement instrumentation directly into a running applications' address space at run time. It can measure real-time as well as any hardware performance metrics available through the PAPI.
TAU is an integrated performance instrumentation, measurement and analysis toolkit that uses PAPI. Automatic source-to-source instrumentation is implemented in TAU using the PDT Program Database Toolkit package. TAU provides both profiling and tracing capabilities. TAU uses the OTF (Open Trace Format) for scalable tracing for use with the Vampir/Vampir Server trace visualizers.
AQTime is a commercial Windows profiler for Microsoft, Borland, Intel, Compaq and GNU compilers.
DynInst Homepage is an api to allow dynamic injection of code into a running program.
GlowCode is a commercial Windows profiler managed and native code built using Microsoft Visual Studio 2005, 2003, 2002, and 6.0 on Windows XP, Windows Server 2003, and Windows Vista.
Ants "more intuitive than AQTime"
RapiTime worst-case execution time (WCET) analysis tool See entry under Ada above. RapiTime also provides performance analysis of C programs. C and C++

TAU toolkit Portable performance evaluation toolkit. Supports Fortran loop-level, memory and I/O instrumentation with profiling and tracing.
KOJAK Automatic trace-based performance bottleneck detection tool. Supports compiler based automatic instrumentation.
SCALASCA Scalable performance bottleneck detection tool. Derived from KOJAK. Fortran

Eclipse Test and Performance Tools Platform Project (TPTP) is the Eclipse IDE profiler plugin.
NetBeans Profiler is the NetBeans IDE profiler pack.
JProfiler, a Java application performance analysis tool which includes a memory debugger, thread analyzer and CPU profiler.
JRat Java Runtime Analysis Toolkit a LGPL profiler
YourKit a profiler for Java and .NET framework.
JProbe, a profiler by Quest Software that is now part of the JProbe suite which also includes tools such as a memory debugger.
JInspired's JXInsight for distributed trace, profile and transaction analysis
JProfiler by ProSyst Java

Visual Studio Team System Profiler is Microsoft's commercial profiler offering
CLR Profiler is a free CLR profiler provided by Microsoft for CLR applications.
IJW Profiler is a currently-free lightweight profiler for .NET produced by IJW Software (NZ)
NProf is a statistical open source profiler for .NET
Xenocode Fox Profiler is a commercial low-overhead profiler with integrated code decompilation.
JetBrains dotTrace Profiler is a commercial performance and memory profiler for .NET
Red Gate ANTS Profiler is a commercial performance and memory profiler.
Mercury Diagnostics Profiler is a commercial profiler for .NET produced by HP
SpeedTrace Profiler is an accurate and fast commercial profiler for .NET .NET

ProDelphi Profiler for Delphi 2..7, 2005, 2006 and Turbo Delphi (Win 32 applications)
GpProfile is an open source instrumenting profiler for Delphi 2, 3, 4 and 5 Object Pascal (Delphi)

Devel::DProf the original Perl code profiler
Devel::Profiler a more recent Perl code profiler Perl

Xdebug Debugger and Profiler Tool for PHP
Advanced PHP Debugger (APD) Full-featured engine-level profiler/debugger PHP

VB Watch Profiler is a commercial profiler for Visual Basic 6.0. Various languages

NetQoS Performance Center is a suite of network performance management tools for WAN environments

No comments: