mirror of
https://github.com/Zenithsiz/ftmemsim-valgrind.git
synced 2026-02-03 18:13:01 +00:00
- There were detailed descriptions of all the tools in the Quick Start Guide, the Manual introduction, and the start of each tool chapter. To avoid duplication/overlap, I removed these altogether from the Quick Start Guide, and shortened them in the intro. - Improved the description of what errors Memcheck can find. - Made all tool chapters start with "Overview" section, for consistency. - Made the "run with --tool=XXX" bit consistent in each tool chapter. - Made all tool chapter titles match the description given when running them. - Added BBV to the User Manual intro. - Generally clarified, updated, and future-proofed various bits of text in the Quick Start Guide and User Manual introduction. Also: - Changed Nulgrind's start-up description to "the minimal Valgrind tool". - Fixed some punctuation in the usage message. git-svn-id: svn://svn.valgrind.org/valgrind/trunk@10652
1022 lines
42 KiB
XML
1022 lines
42 KiB
XML
<?xml version="1.0"?> <!-- -*- sgml -*- -->
|
|
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
|
|
[ <!ENTITY % cl-entities SYSTEM "cl-entities.xml"> %cl-entities; ]>
|
|
|
|
<chapter id="cl-manual" xreflabel="Callgrind Manual">
|
|
<title>Callgrind: a call-graph generating cache profiler</title>
|
|
|
|
|
|
<para>To use this tool, you must specify
|
|
<computeroutput>--tool=callgrind</computeroutput> on the
|
|
Valgrind command line.</para>
|
|
|
|
<sect1 id="cl-manual.use" xreflabel="Overview">
|
|
<title>Overview</title>
|
|
|
|
<para>Callgrind is a profiling tool that can
|
|
construct a call graph for a program's run.
|
|
By default, the collected data consists of
|
|
the number of instructions executed, their relationship
|
|
to source lines, the caller/callee relationship between functions,
|
|
and the numbers of such calls.
|
|
Optionally, a cache simulator (similar to cachegrind) can produce
|
|
further information about the memory access behavior of the application.
|
|
</para>
|
|
|
|
<para>The profile data is written out to a file at program
|
|
termination. For presentation of the data, and interactive control
|
|
of the profiling, two command line tools are provided:</para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><command>callgrind_annotate</command></term>
|
|
<listitem>
|
|
<para>This command reads in the profile data, and prints a
|
|
sorted lists of functions, optionally with source annotation.</para>
|
|
<!--
|
|
<para>You can read the manpage here: <xref
|
|
linkend="callgrind-annotate"/>.</para>
|
|
-->
|
|
<para>For graphical visualization of the data, try
|
|
<ulink url="&cl-gui;">KCachegrind</ulink>, which is a KDE/Qt based
|
|
GUI that makes it easy to navigate the large amount of data that
|
|
Callgrind produces.</para>
|
|
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><command>callgrind_control</command></term>
|
|
<listitem>
|
|
<para>This command enables you to interactively observe and control
|
|
the status of currently running applications, without stopping
|
|
the application. You can
|
|
get statistics information as well as the current stack trace, and
|
|
you can request zeroing of counters or dumping of profile data.</para>
|
|
<!--
|
|
<para>You can read the manpage here: <xref linkend="callgrind-control"/>.</para>
|
|
-->
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<para>To use Callgrind, you must specify
|
|
<computeroutput>--tool=callgrind</computeroutput> on the Valgrind
|
|
command line.</para>
|
|
|
|
<sect2 id="cl-manual.functionality" xreflabel="Functionality">
|
|
<title>Functionality</title>
|
|
|
|
<para>Cachegrind collects flat profile data: event counts (data reads,
|
|
cache misses, etc.) are attributed directly to the function they
|
|
occurred in. This cost attribution mechanism is
|
|
called <emphasis>self</emphasis> or <emphasis>exclusive</emphasis>
|
|
attribution.</para>
|
|
|
|
<para>Callgrind extends this functionality by propagating costs
|
|
across function call boundaries. If function <code>foo</code> calls
|
|
<code>bar</code>, the costs from <code>bar</code> are added into
|
|
<code>foo</code>'s costs. When applied to the program as a whole,
|
|
this builds up a picture of so called <emphasis>inclusive</emphasis>
|
|
costs, that is, where the cost of each function includes the costs of
|
|
all functions it called, directly or indirectly.</para>
|
|
|
|
<para>As an example, the inclusive cost of
|
|
<computeroutput>main</computeroutput> should be almost 100 percent
|
|
of the total program cost. Because of costs arising before
|
|
<computeroutput>main</computeroutput> is run, such as
|
|
initialization of the run time linker and construction of global C++
|
|
objects, the inclusive cost of <computeroutput>main</computeroutput>
|
|
is not exactly 100 percent of the total program cost.</para>
|
|
|
|
<para>Together with the call graph, this allows you to find the
|
|
specific call chains starting from
|
|
<computeroutput>main</computeroutput> in which the majority of the
|
|
program's costs occur. Caller/callee cost attribution is also useful
|
|
for profiling functions called from multiple call sites, and where
|
|
optimization opportunities depend on changing code in the callers, in
|
|
particular by reducing the call count.</para>
|
|
|
|
<para>Callgrind's cache simulation is based on the
|
|
<ulink url="&cg-tool-url;">Cachegrind tool</ulink>. Read
|
|
<ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first.
|
|
The material below describes the features supported in addition to
|
|
Cachegrind's features.</para>
|
|
|
|
<para>Callgrind's ability to detect function calls and returns depends
|
|
on the instruction set of the platform it is run on. It works best
|
|
on x86 and amd64, and unfortunately currently does not work so well
|
|
on PowerPC code. This is because there are no explicit call or return
|
|
instructions in the PowerPC instruction set, so Callgrind has to rely
|
|
on heuristics to detect calls and returns.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.basics" xreflabel="Basic Usage">
|
|
<title>Basic Usage</title>
|
|
|
|
<para>As with Cachegrind, you probably want to compile with debugging info
|
|
(the -g flag), but with optimization turned on.</para>
|
|
|
|
<para>To start a profile run for a program, execute:
|
|
<screen>valgrind --tool=callgrind [callgrind options] your-program [program options]</screen>
|
|
</para>
|
|
|
|
<para>While the simulation is running, you can observe execution with
|
|
<screen>callgrind_control -b</screen>
|
|
This will print out the current backtrace. To annotate the backtrace with
|
|
event counts, run
|
|
<screen>callgrind_control -e -b</screen>
|
|
</para>
|
|
|
|
<para>After program termination, a profile data file named
|
|
<computeroutput>callgrind.out.<pid></computeroutput>
|
|
is generated, where <emphasis>pid</emphasis> is the process ID
|
|
of the program being profiled.
|
|
The data file contains information about the calls made in the
|
|
program among the functions executed, together with events of type
|
|
<command>Instruction Read Accesses</command> (Ir).</para>
|
|
|
|
<para>To generate a function-by-function summary from the profile
|
|
data file, use
|
|
<screen>callgrind_annotate [options] callgrind.out.<pid></screen>
|
|
This summary is similar to the output you get from a Cachegrind
|
|
run with <computeroutput>cg_annotate</computeroutput>: the list
|
|
of functions is ordered by exclusive cost of functions, which also
|
|
are the ones that are shown.
|
|
Important for the additional features of Callgrind are
|
|
the following two options:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><option>--inclusive=yes</option>: Instead of using
|
|
exclusive cost of functions as sorting order, use and show
|
|
inclusive cost.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><option>--tree=both</option>: Interleave into the
|
|
top level list of functions, information on the callers and the callees
|
|
of each function. In these lines, which represents executed
|
|
calls, the cost gives the number of events spent in the call.
|
|
Indented, above each function, there is the list of callers,
|
|
and below, the list of callees. The sum of events in calls to
|
|
a given function (caller lines), as well as the sum of events in
|
|
calls from the function (callee lines) together with the self
|
|
cost, gives the total inclusive cost of the function.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Use <option>--auto=yes</option> to get annotated source code
|
|
for all relevant functions for which the source can be found. In
|
|
addition to source annotation as produced by
|
|
<computeroutput>cg_annotate</computeroutput>, you will see the
|
|
annotated call sites with call counts. For all other options,
|
|
consult the (Cachegrind) documentation for
|
|
<computeroutput>cg_annotate</computeroutput>.
|
|
</para>
|
|
|
|
<para>For better call graph browsing experience, it is highly recommended
|
|
to use <ulink url="&cl-gui;">KCachegrind</ulink>.
|
|
If your code
|
|
has a significant fraction of its cost in <emphasis>cycles</emphasis> (sets
|
|
of functions calling each other in a recursive manner), you have to
|
|
use KCachegrind, as <computeroutput>callgrind_annotate</computeroutput>
|
|
currently does not do any cycle detection, which is important to get correct
|
|
results in this case.</para>
|
|
|
|
<para>If you are additionally interested in measuring the
|
|
cache behavior of your
|
|
program, use Callgrind with the option
|
|
<option><xref linkend="opt.simulate-cache"/>=yes.</option>
|
|
However, expect a further slow down approximately by a factor of 2.</para>
|
|
|
|
<para>If the program section you want to profile is somewhere in the
|
|
middle of the run, it is beneficial to
|
|
<emphasis>fast forward</emphasis> to this section without any
|
|
profiling, and then switch on profiling. This is achieved by using
|
|
the command line option
|
|
<option><xref linkend="opt.instr-atstart"/>=no</option>
|
|
and running, in a shell,
|
|
<computeroutput>callgrind_control -i on</computeroutput> just before the
|
|
interesting code section is executed. To exactly specify
|
|
the code position where profiling should start, use the client request
|
|
<computeroutput><xref linkend="cr.start-instr"/></computeroutput>.</para>
|
|
|
|
<para>If you want to be able to see assembly code level annotation, specify
|
|
<option><xref linkend="opt.dump-instr"/>=yes</option>. This will produce
|
|
profile data at instruction granularity. Note that the resulting profile
|
|
data
|
|
can only be viewed with KCachegrind. For assembly annotation, it also is
|
|
interesting to see more details of the control flow inside of functions,
|
|
ie. (conditional) jumps. This will be collected by further specifying
|
|
<option><xref linkend="opt.collect-jumps"/>=yes</option>.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="cl-manual.usage" xreflabel="Advanced Usage">
|
|
<title>Advanced Usage</title>
|
|
|
|
<sect2 id="cl-manual.dumps"
|
|
xreflabel="Multiple dumps from one program run">
|
|
<title>Multiple profiling dumps from one program run</title>
|
|
|
|
<para>Sometimes you are not interested in characteristics of a full
|
|
program run, but only of a small part of it, for example execution of one
|
|
algorithm. If there are multiple algorithms, or one algorithm
|
|
running with different input data, it may even be useful to get different
|
|
profile information for different parts of a single program run.</para>
|
|
|
|
<para>Profile data files have names of the form
|
|
<screen>
|
|
callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threadID</emphasis>
|
|
</screen>
|
|
</para>
|
|
<para>where <emphasis>pid</emphasis> is the PID of the running
|
|
program, <emphasis>part</emphasis> is a number incremented on each
|
|
dump (".part" is skipped for the dump at program termination), and
|
|
<emphasis>threadID</emphasis> is a thread identification
|
|
("-threadID" is only used if you request dumps of individual
|
|
threads with <option><xref linkend="opt.separate-threads"/>=yes</option>).</para>
|
|
|
|
<para>There are different ways to generate multiple profile dumps
|
|
while a program is running under Callgrind's supervision. Nevertheless,
|
|
all methods trigger the same action, which is "dump all profile
|
|
information since the last dump or program start, and zero cost
|
|
counters afterwards". To allow for zeroing cost counters without
|
|
dumping, there is a second action "zero all cost counters now".
|
|
The different methods are:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para><command>Dump on program termination.</command>
|
|
This method is the standard way and doesn't need any special
|
|
action on your part.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><command>Spontaneous, interactive dumping.</command> Use
|
|
<screen>callgrind_control -d [hint [PID/Name]]</screen> to
|
|
request the dumping of profile information of the supervised
|
|
application with PID or Name. <emphasis>hint</emphasis> is an
|
|
arbitrary string you can optionally specify to later be able to
|
|
distinguish profile dumps. The control program will not terminate
|
|
before the dump is completely written. Note that the application
|
|
must be actively running for detection of the dump command. So,
|
|
for a GUI application, resize the window, or for a server, send a
|
|
request.</para>
|
|
<para>If you are using <ulink url="&cl-gui;">KCachegrind</ulink>
|
|
for browsing of profile information, you can use the toolbar
|
|
button <command>Force dump</command>. This will request a dump
|
|
and trigger a reload after the dump is written.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><command>Periodic dumping after execution of a specified
|
|
number of basic blocks</command>. For this, use the command line
|
|
option <option><xref linkend="opt.dump-every-bb"/>=count</option>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><command>Dumping at enter/leave of specified functions.</command>
|
|
Use the
|
|
option <option><xref linkend="opt.dump-before"/>=function</option>
|
|
and <option><xref linkend="opt.dump-after"/>=function</option>.
|
|
To zero cost counters before entering a function, use
|
|
<option><xref linkend="opt.zero-before"/>=function</option>.</para>
|
|
<para>You can specify these options multiple times for different
|
|
functions. Function specifications support wildcards: eg. use
|
|
<option><xref linkend="opt.dump-before"/>='foo*'</option> to
|
|
generate dumps before entering any function starting with
|
|
<emphasis>foo</emphasis>.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><command>Program controlled dumping.</command>
|
|
Insert
|
|
<computeroutput><xref linkend="cr.dump-stats"/>;</computeroutput>
|
|
at the position in your code where you want a profile dump to happen. Use
|
|
<computeroutput><xref linkend="cr.zero-stats"/>;</computeroutput> to only
|
|
zero profile counters.
|
|
See <xref linkend="cl-manual.clientrequests"/> for more information on
|
|
Callgrind specific client requests.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>If you are running a multi-threaded application and specify the
|
|
command line option <option><xref linkend="opt.separate-threads"/>=yes</option>,
|
|
every thread will be profiled on its own and will create its own
|
|
profile dump. Thus, the last two methods will only generate one dump
|
|
of the currently running thread. With the other methods, you will get
|
|
multiple dumps (one for each thread) on a dump request.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="cl-manual.limits"
|
|
xreflabel="Limiting range of event collection">
|
|
<title>Limiting the range of collected events</title>
|
|
|
|
<para>For aggregating events (function enter/leave,
|
|
instruction execution, memory access) into event numbers,
|
|
first, the events must be recognizable by Callgrind, and second,
|
|
the collection state must be switched on.</para>
|
|
|
|
<para>Event collection is only possible if <emphasis>instrumentation</emphasis>
|
|
for program code is switched on. This is the default, but for faster
|
|
execution (identical to <computeroutput>valgrind --tool=none</computeroutput>),
|
|
it can be switched off until the program reaches a state in which
|
|
you want to start collecting profiling data.
|
|
Callgrind can start without instrumentation
|
|
by specifying option <option><xref linkend="opt.instr-atstart"/>=no</option>.
|
|
Instrumentation can be switched on interactively
|
|
with <screen>callgrind_control -i on</screen>
|
|
and off by specifying "off" instead of "on".
|
|
Furthermore, instrumentation state can be programatically changed with
|
|
the macros <computeroutput><xref linkend="cr.start-instr"/>;</computeroutput>
|
|
and <computeroutput><xref linkend="cr.stop-instr"/>;</computeroutput>.
|
|
</para>
|
|
|
|
<para>In addition to enabling instrumentation, you must also enable
|
|
event collection for the parts of your program you are interested in.
|
|
By default, event collection is enabled everywhere.
|
|
You can limit collection to a specific function
|
|
by using
|
|
<option><xref linkend="opt.toggle-collect"/>=function</option>.
|
|
This will toggle the collection state on entering and leaving
|
|
the specified functions.
|
|
When this option is in effect, the default collection state
|
|
at program start is "off". Only events happening while running
|
|
inside of the given function will be collected. Recursive
|
|
calls of the given function do not trigger any action.</para>
|
|
|
|
<para>It is important to note that with instrumentation switched off, the
|
|
cache simulator cannot see any memory access events, and thus, any
|
|
simulated cache state will be frozen and wrong without instrumentation.
|
|
Therefore, to get useful cache events (hits/misses) after switching on
|
|
instrumentation, the cache first must warm up,
|
|
probably leading to many <emphasis>cold misses</emphasis>
|
|
which would not have happened in reality. If you do not want to see these,
|
|
start event collection a few million instructions after you have switched
|
|
on instrumentation.</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="cl-manual.cycles" xreflabel="Avoiding cycles">
|
|
<title>Avoiding cycles</title>
|
|
|
|
<para>Informally speaking, a cycle is a group of functions which
|
|
call each other in a recursive way.</para>
|
|
|
|
<para>Formally speaking, a cycle is a nonempty set S of functions,
|
|
such that for every pair of functions F and G in S, it is possible
|
|
to call from F to G (possibly via intermediate functions) and also
|
|
from G to F. Furthermore, S must be maximal -- that is, be the
|
|
largest set of functions satisfying this property. For example, if
|
|
a third function H is called from inside S and calls back into S,
|
|
then H is also part of the cycle and should be included in S.</para>
|
|
|
|
<para>Recursion is quite usual in programs, and therefore, cycles
|
|
sometimes appear in the call graph output of Callgrind. However,
|
|
the title of this chapter should raise two questions: What is bad
|
|
about cycles which makes you want to avoid them? And: How can
|
|
cycles be avoided without changing program code?</para>
|
|
|
|
<para>Cycles are not bad in itself, but tend to make performance
|
|
analysis of your code harder. This is because inclusive costs
|
|
for calls inside of a cycle are meaningless. The definition of
|
|
inclusive cost, ie. self cost of a function plus inclusive cost
|
|
of its callees, needs a topological order among functions. For
|
|
cycles, this does not hold true: callees of a function in a cycle include
|
|
the function itself. Therefore, KCachegrind does cycle detection
|
|
and skips visualization of any inclusive cost for calls inside
|
|
of cycles. Further, all functions in a cycle are collapsed into artifical
|
|
functions called like <computeroutput>Cycle 1</computeroutput>.</para>
|
|
|
|
<para>Now, when a program exposes really big cycles (as is
|
|
true for some GUI code, or in general code using event or callback based
|
|
programming style), you loose the nice property to let you pinpoint
|
|
the bottlenecks by following call chains from
|
|
<computeroutput>main()</computeroutput>, guided via
|
|
inclusive cost. In addition, KCachegrind looses its ability to show
|
|
interesting parts of the call graph, as it uses inclusive costs to
|
|
cut off uninteresting areas.</para>
|
|
|
|
<para>Despite the meaningless of inclusive costs in cycles, the big
|
|
drawback for visualization motivates the possibility to temporarily
|
|
switch off cycle detection in KCachegrind, which can lead to
|
|
misguiding visualization. However, often cycles appear because of
|
|
unlucky superposition of independent call chains in a way that
|
|
the profile result will see a cycle. Neglecting uninteresting
|
|
calls with very small measured inclusive cost would break these
|
|
cycles. In such cases, incorrect handling of cycles by not detecting
|
|
them still gives meaningful profiling visualization.</para>
|
|
|
|
<para>It has to be noted that currently, <command>callgrind_annotate</command>
|
|
does not do any cycle detection at all. For program executions with function
|
|
recursion, it e.g. can print nonsense inclusive costs way above 100%.</para>
|
|
|
|
<para>After describing why cycles are bad for profiling, it is worth
|
|
talking about cycle avoidance. The key insight here is that symbols in
|
|
the profile data do not have to exactly match the symbols found in the
|
|
program. Instead, the symbol name could encode additional information
|
|
from the current execution context such as recursion level of the
|
|
current function, or even some part of the call chain leading to the
|
|
function. While encoding of additional information into symbols is
|
|
quite capable of avoiding cycles, it has to be used carefully to not cause
|
|
symbol explosion. The latter imposes large memory requirement for Callgrind
|
|
with possible out-of-memory conditions, and big profile data files.</para>
|
|
|
|
<para>A further possibility to avoid cycles in Callgrind's profile data
|
|
output is to simply leave out given functions in the call graph. Of course, this
|
|
also skips any call information from and to an ignored function, and thus can
|
|
break a cycle. Candidates for this typically are dispatcher functions in event
|
|
driven code. The option to ignore calls to a function is
|
|
<option><xref linkend="opt.fn-skip"/>=function</option>. Aside from
|
|
possibly breaking cycles, this is used in Callgrind to skip
|
|
trampoline functions in the PLT sections
|
|
for calls to functions in shared libraries. You can see the difference
|
|
if you profile with <option><xref linkend="opt.skip-plt"/>=no</option>.
|
|
If a call is ignored, its cost events will be propagated to the
|
|
enclosing function.</para>
|
|
|
|
<para>If you have a recursive function, you can distinguish the first
|
|
10 recursion levels by specifying
|
|
<option><xref linkend="opt.separate-recs-num"/>=function</option>.
|
|
Or for all functions with
|
|
<option><xref linkend="opt.separate-recs"/>=10</option>, but this will
|
|
give you much bigger profile data files. In the profile data, you will see
|
|
the recursion levels of "func" as the different functions with names
|
|
"func", "func'2", "func'3" and so on.</para>
|
|
|
|
<para>If you have call chains "A > B > C" and "A > C > B"
|
|
in your program, you usually get a "false" cycle "B <> C". Use
|
|
<option><xref linkend="opt.separate-callers-num"/>=B</option>
|
|
<option><xref linkend="opt.separate-callers-num"/>=C</option>,
|
|
and functions "B" and "C" will be treated as different functions
|
|
depending on the direct caller. Using the apostrophe for appending
|
|
this "context" to the function name, you get "A > B'A > C'B"
|
|
and "A > C'A > B'C", and there will be no cycle. Use
|
|
<option><xref linkend="opt.separate-callers"/>=2</option> to get a 2-caller
|
|
dependency for all functions. Note that doing this will increase
|
|
the size of profile data files.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.forkingprograms" xreflabel="Forking Programs">
|
|
<title>Forking Programs</title>
|
|
|
|
<para>If your program forks, the child will inherit all the profiling
|
|
data that has been gathered for the parent. To start with empty profile
|
|
counter values in the child, the client request
|
|
<computeroutput><xref linkend="cr.zero-stats"/>;</computeroutput>
|
|
can be inserted into code to be executed by the child, directly after
|
|
<computeroutput>fork()</computeroutput>.</para>
|
|
|
|
<para>However, you will have to make sure that the output file format string
|
|
(controlled by <option>--callgrind-out-file</option>) does contain
|
|
<option>%p</option> (which is true by default). Otherwise, the
|
|
outputs from the parent and child will overwrite each other or will be
|
|
intermingled, which almost certainly is not what you want.</para>
|
|
|
|
<para>You will be able to control the new child independently from
|
|
the parent via <computeroutput>callgrind_control</computeroutput>.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cl-manual.options" xreflabel="Command line option reference">
|
|
<title>Command line option reference</title>
|
|
|
|
<para>
|
|
In the following, options are grouped into classes, in the same order as
|
|
the output of <computeroutput>callgrind --help</computeroutput>.
|
|
</para>
|
|
<para>
|
|
Some options allow the specification of a function/symbol name, such as
|
|
<option><xref linkend="opt.dump-before"/>=function</option>, or
|
|
<option><xref linkend="opt.fn-skip"/>=function</option>. All these options
|
|
can be specified multiple times for different functions.
|
|
In addition, the function specifications actually are patterns by supporting
|
|
the use of wildcards '*' (zero or more arbitrary characters) and '?'
|
|
(exactly one arbitrary character), similar to file name globbing in the
|
|
shell. This feature is important especially for C++, as without wildcard
|
|
usage, the function would have to be specified in full extent, including
|
|
parameter signature. </para>
|
|
|
|
<sect2 id="cl-manual.options.misc"
|
|
xreflabel="Miscellaneous options">
|
|
<title>Miscellaneous options</title>
|
|
|
|
<variablelist id="cl.opts.list.misc">
|
|
|
|
<varlistentry>
|
|
<term><option>--help</option></term>
|
|
<listitem>
|
|
<para>Show summary of options. This is a short version of this
|
|
manual section.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><option>--version</option></term>
|
|
<listitem>
|
|
<para>Show version of callgrind.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.options.creation"
|
|
xreflabel="Dump creation options">
|
|
<title>Dump creation options</title>
|
|
|
|
<para>
|
|
These options influence the name and format of the profile data files.
|
|
</para>
|
|
|
|
<variablelist id="cl.opts.list.creation">
|
|
|
|
<varlistentry id="opt.callgrind-out-file" xreflabel="--callgrind-out-file">
|
|
<term>
|
|
<option><![CDATA[--callgrind-out-file=<file> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Write the profile data to
|
|
<computeroutput>file</computeroutput> rather than to the default
|
|
output file,
|
|
<computeroutput>callgrind.out.<pid></computeroutput>. The
|
|
<option>%p</option> and <option>%q</option> format specifiers
|
|
can be used to embed the process ID and/or the contents of an
|
|
environment variable in the name, as is the case for the core
|
|
option <option>--log-file</option>. See <link
|
|
linkend="manual-core.basicopts">here</link> for details.
|
|
When multiple dumps are made, the file name
|
|
is modified further; see below.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.dump-instr" xreflabel="--dump-instr">
|
|
<term>
|
|
<option><![CDATA[--dump-instr=<no|yes> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This specifies that event counting should be performed at
|
|
per-instruction granularity.
|
|
This allows for assembly code
|
|
annotation. Currently the results can only be
|
|
displayed by KCachegrind.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.dump-line" xreflabel="--dump-line">
|
|
<term>
|
|
<option><![CDATA[--dump-line=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This specifies that event counting should be performed at
|
|
source line granularity. This allows source
|
|
annotation for sources which are compiled with debug information ("-g").</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.compress-strings" xreflabel="--compress-strings">
|
|
<term>
|
|
<option><![CDATA[--compress-strings=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This option influences the output format of the profile data.
|
|
It specifies whether strings (file and function names) should be
|
|
identified by numbers. This shrinks the file,
|
|
but makes it more difficult
|
|
for humans to read (which is not recommended in any case).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.compress-pos" xreflabel="--compress-pos">
|
|
<term>
|
|
<option><![CDATA[--compress-pos=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This option influences the output format of the profile data.
|
|
It specifies whether numerical positions are always specified as absolute
|
|
values or are allowed to be relative to previous numbers.
|
|
This shrinks the file size,</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.combine-dumps" xreflabel="--combine-dumps">
|
|
<term>
|
|
<option><![CDATA[--combine-dumps=<no|yes> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>When multiple profile data parts are to be generated, these
|
|
parts are appended to the same output file if this option is set to
|
|
"yes". Not recommended.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.options.activity"
|
|
xreflabel="Activity options">
|
|
<title>Activity options</title>
|
|
|
|
<para>
|
|
These options specify when actions relating to event counts are to
|
|
be executed. For interactive control use
|
|
<computeroutput>callgrind_control</computeroutput>.
|
|
</para>
|
|
|
|
<variablelist id="cl.opts.list.activity">
|
|
|
|
<varlistentry id="opt.dump-every-bb" xreflabel="--dump-every-bb">
|
|
<term>
|
|
<option><![CDATA[--dump-every-bb=<count> [default: 0, never] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Dump profile data every <count> basic blocks.
|
|
Whether a dump is needed is only checked when Valgrind's internal
|
|
scheduler is run. Therefore, the minimum setting useful is about 100000.
|
|
The count is a 64-bit value to make long dump periods possible.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.dump-before" xreflabel="--dump-before">
|
|
<term>
|
|
<option><![CDATA[--dump-before=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Dump when entering <function></para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.zero-before" xreflabel="--zero-before">
|
|
<term>
|
|
<option><![CDATA[--zero-before=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Zero all costs when entering <function></para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.dump-after" xreflabel="--dump-after">
|
|
<term>
|
|
<option><![CDATA[--dump-after=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Dump when leaving <function></para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.options.collection"
|
|
xreflabel="Data collection options">
|
|
<title>Data collection options</title>
|
|
|
|
<para>
|
|
These options specify when events are to be aggregated into event counts.
|
|
Also see <xref linkend="cl-manual.limits"/>.</para>
|
|
|
|
<variablelist id="cl.opts.list.collection">
|
|
|
|
<varlistentry id="opt.instr-atstart" xreflabel="--instr-atstart">
|
|
<term>
|
|
<option><![CDATA[--instr-atstart=<yes|no> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Specify if you want Callgrind to start simulation and
|
|
profiling from the beginning of the program.
|
|
When set to <computeroutput>no</computeroutput>,
|
|
Callgrind will not be able
|
|
to collect any information, including calls, but it will have at
|
|
most a slowdown of around 4, which is the minimum Valgrind
|
|
overhead. Instrumentation can be interactively switched on via
|
|
<computeroutput>callgrind_control -i on</computeroutput>.</para>
|
|
<para>Note that the resulting call graph will most probably not
|
|
contain <computeroutput>main</computeroutput>, but will contain all the
|
|
functions executed after instrumentation was switched on.
|
|
Instrumentation can also programatically switched on/off. See the
|
|
Callgrind include file
|
|
<computeroutput><callgrind.h></computeroutput> for the macro
|
|
you have to use in your source code.</para> <para>For cache
|
|
simulation, results will be less accurate when switching on
|
|
instrumentation later in the program run, as the simulator starts
|
|
with an empty cache at that moment. Switch on event collection
|
|
later to cope with this error.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.collect-atstart" xreflabel="--collect-atstart">
|
|
<term>
|
|
<option><![CDATA[--collect-atstart=<yes|no> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Specify whether event collection is switched on at beginning
|
|
of the profile run.</para>
|
|
<para>To only look at parts of your program, you have two
|
|
possibilities:</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Zero event counters before entering the program part you
|
|
want to profile, and dump the event counters to a file after
|
|
leaving that program part.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Switch on/off collection state as needed to only see
|
|
event counters happening while inside of the program part you
|
|
want to profile.</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
<para>The second option can be used if the program part you want to
|
|
profile is called many times. Option 1, i.e. creating a lot of
|
|
dumps is not practical here.</para>
|
|
<para>Collection state can be
|
|
toggled at entry and exit of a given function with the
|
|
option <xref linkend="opt.toggle-collect"/>. If you use this flag,
|
|
collection
|
|
state should be switched off at the beginning. Note that the
|
|
specification of <computeroutput>--toggle-collect</computeroutput>
|
|
implicitly sets
|
|
<computeroutput>--collect-state=no</computeroutput>.</para>
|
|
<para>Collection state can be toggled also by inserting the client request
|
|
<computeroutput><xref linkend="cr.toggle-collect"/>;</computeroutput>
|
|
at the needed code positions.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.toggle-collect" xreflabel="--toggle-collect">
|
|
<term>
|
|
<option><![CDATA[--toggle-collect=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Toggle collection on entry/exit of <function>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.collect-jumps" xreflabel="--collect-jumps">
|
|
<term>
|
|
<option><![CDATA[--collect-jumps=<no|yes> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This specifies whether information for (conditional) jumps
|
|
should be collected. As above, callgrind_annotate currently is not
|
|
able to show you the data. You have to use KCachegrind to get jump
|
|
arrows in the annotated code.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.options.separation"
|
|
xreflabel="Cost entity separation options">
|
|
<title>Cost entity separation options</title>
|
|
|
|
<para>
|
|
These options specify how event counts should be attributed to execution
|
|
contexts.
|
|
For example, they specify whether the recursion level or the
|
|
call chain leading to a function should be taken into account,
|
|
and whether the thread ID should be considered.
|
|
Also see <xref linkend="cl-manual.cycles"/>.</para>
|
|
|
|
<variablelist id="cmd-options.separation">
|
|
|
|
<varlistentry id="opt.separate-threads" xreflabel="--separate-threads">
|
|
<term>
|
|
<option><![CDATA[--separate-threads=<no|yes> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>This option specifies whether profile data should be generated
|
|
separately for every thread. If yes, the file names get "-threadID"
|
|
appended.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.separate-recs" xreflabel="--separate-recs">
|
|
<term>
|
|
<option><![CDATA[--separate-recs=<level> [default: 2] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Separate function recursions by at most <level> levels.
|
|
See <xref linkend="cl-manual.cycles"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.separate-callers" xreflabel="--separate-callers">
|
|
<term>
|
|
<option><![CDATA[--separate-callers=<callers> [default: 0] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Separate contexts by at most <callers> functions in the
|
|
call chain. See <xref linkend="cl-manual.cycles"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.skip-plt" xreflabel="--skip-plt">
|
|
<term>
|
|
<option><![CDATA[--skip-plt=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Ignore calls to/from PLT sections.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.fn-skip" xreflabel="--fn-skip">
|
|
<term>
|
|
<option><![CDATA[--fn-skip=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Ignore calls to/from a given function. E.g. if you have a
|
|
call chain A > B > C, and you specify function B to be
|
|
ignored, you will only see A > C.</para>
|
|
<para>This is very convenient to skip functions handling callback
|
|
behaviour. For example, with the signal/slot mechanism in the
|
|
Qt graphics library, you only want
|
|
to see the function emitting a signal to call the slots connected
|
|
to that signal. First, determine the real call chain to see the
|
|
functions needed to be skipped, then use this option.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.fn-group">
|
|
<term>
|
|
<option><![CDATA[--fn-group<number>=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Put a function into a separate group. This influences the
|
|
context name for cycle avoidance. All functions inside such a
|
|
group are treated as being the same for context name building, which
|
|
resembles the call chain leading to a context. By specifying function
|
|
groups with this option, you can shorten the context name, as functions
|
|
in the same group will not appear in sequence in the name. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.separate-recs-num" xreflabel="--separate-recs10">
|
|
<term>
|
|
<option><![CDATA[--separate-recs<number>=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Separate <number> recursions for <function>.
|
|
See <xref linkend="cl-manual.cycles"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.separate-callers-num" xreflabel="--separate-callers2">
|
|
<term>
|
|
<option><![CDATA[--separate-callers<number>=<function> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Separate <number> callers for <function>.
|
|
See <xref linkend="cl-manual.cycles"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect2>
|
|
|
|
<sect2 id="cl-manual.options.simulation"
|
|
xreflabel="Cache simulation options">
|
|
<title>Cache simulation options</title>
|
|
|
|
<variablelist id="cl.opts.list.simulation">
|
|
|
|
<varlistentry id="opt.simulate-cache" xreflabel="--simulate-cache">
|
|
<term>
|
|
<option><![CDATA[--simulate-cache=<yes|no> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Specify if you want to do full cache simulation. By default,
|
|
only instruction read accesses will be profiled.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.simulate-hwpref" xreflabel="--simulate-hwpref">
|
|
<term>
|
|
<option><![CDATA[--simulate-hwpref=<yes|no> [default: no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Specify whether simulation of a hardware prefetcher should be
|
|
added which is able to detect stream access in the second level cache
|
|
by comparing accesses to separate to each page.
|
|
As the simulation can not decide about any timing issues of prefetching,
|
|
it is assumed that any hardware prefetch triggered succeeds before a
|
|
real access is done. Thus, this gives a best-case scenario by covering
|
|
all possible stream accesses.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="cl-manual.clientrequests" xreflabel="Client request reference">
|
|
<title>Callgrind specific client requests</title>
|
|
|
|
<para>In Valgrind terminology, a client request is a C macro which
|
|
can be inserted into your code to request specific functionality when
|
|
run under Valgrind. For this, special instruction patterns resulting
|
|
in NOPs are used, but which can be detected by Valgrind.</para>
|
|
|
|
<para>Callgrind provides the following specific client requests.
|
|
To use them, add the line
|
|
<screen><![CDATA[#include <valgrind/callgrind.h>]]></screen>
|
|
into your code for the macro definitions.
|
|
.</para>
|
|
|
|
<variablelist id="cl.clientrequests.list">
|
|
|
|
<varlistentry id="cr.dump-stats" xreflabel="CALLGRIND_DUMP_STATS">
|
|
<term>
|
|
<computeroutput>CALLGRIND_DUMP_STATS</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Force generation of a profile dump at specified position
|
|
in code, for the current thread only. Written counters will be reset
|
|
to zero.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cr.dump-stats-at" xreflabel="CALLGRIND_DUMP_STATS_AT">
|
|
<term>
|
|
<computeroutput>CALLGRIND_DUMP_STATS_AT(string)</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Same as CALLGRIND_DUMP_STATS, but allows to specify a string
|
|
to be able to distinguish profile dumps.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cr.zero-stats" xreflabel="CALLGRIND_ZERO_STATS">
|
|
<term>
|
|
<computeroutput>CALLGRIND_ZERO_STATS</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Reset the profile counters for the current thread to zero.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cr.toggle-collect" xreflabel="CALLGRIND_TOGGLE_COLLECT">
|
|
<term>
|
|
<computeroutput>CALLGRIND_TOGGLE_COLLECT</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Toggle the collection state. This allows to ignore events
|
|
with regard to profile counters. See also options
|
|
<xref linkend="opt.collect-atstart"/> and
|
|
<xref linkend="opt.toggle-collect"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cr.start-instr" xreflabel="CALLGRIND_START_INSTRUMENTATION">
|
|
<term>
|
|
<computeroutput>CALLGRIND_START_INSTRUMENTATION</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Start full Callgrind instrumentation if not already switched on.
|
|
When cache simulation is done, this will flush the simulated cache
|
|
and lead to an artifical cache warmup phase afterwards with
|
|
cache misses which would not have happened in reality.
|
|
See also option <xref linkend="opt.instr-atstart"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cr.stop-instr" xreflabel="CALLGRIND_STOP_INSTRUMENTATION">
|
|
<term>
|
|
<computeroutput>CALLGRIND_STOP_INSTRUMENTATION</computeroutput>
|
|
</term>
|
|
<listitem>
|
|
<para>Stop full Callgrind instrumentation if not already switched off.
|
|
This flushes Valgrinds translation cache, and does no additional
|
|
instrumentation afterwards: it effectivly will run at the same
|
|
speed as the "none" tool, ie. at minimal slowdown. Use this to
|
|
speed up the Callgrind run for uninteresting code parts. Use
|
|
<xref linkend="cr.start-instr"/> to switch on instrumentation again.
|
|
See also option <xref linkend="opt.instr-atstart"/>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|