mirror of
https://github.com/Zenithsiz/ftmemsim-valgrind.git
synced 2026-02-03 18:13:01 +00:00
1512 lines
51 KiB
XML
1512 lines
51 KiB
XML
<?xml version="1.0"?> <!-- -*- sgml -*- -->
|
|
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
|
|
[ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
|
|
|
|
<!-- Referenced from both the manual and manpage -->
|
|
<chapter id="&vg-cg-manual-id;" xreflabel="&vg-cg-manual-label;">
|
|
<title>Cachegrind: a high-precision tracing profiler</title>
|
|
|
|
<para>
|
|
To use this tool, specify <option>--tool=cachegrind</option> on the Valgrind
|
|
command line.
|
|
</para>
|
|
|
|
<sect1 id="cg-manual.overview" xreflabel="Overview">
|
|
<title>Overview</title>
|
|
|
|
<para>
|
|
Cachegrind is a high-precision tracing profiler. It runs slowly, but collects
|
|
precise and reproducible profiling data. It can merge and diff data from
|
|
different runs. To expand on these characteristics:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<emphasis>Precise.</emphasis> Cachegrind measures the exact number of
|
|
instructions executed by your program, not an approximation. Furthermore,
|
|
it presents the gathered data at the file, function, and line level. This
|
|
is different to many other profilers that measure approximate execution
|
|
time, using sampling, and only at the function level.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<emphasis>Reproducible.</emphasis> In general, execution time is a better
|
|
metric than instruction counts because it's what users perceive. However,
|
|
execution time often has high variability. When running the exact same
|
|
program on the exact same input multiple times, execution time might vary
|
|
by several percent. Furthermore, small changes in a program can change its
|
|
memory layout and have even larger effects on runtime. In contrast,
|
|
instruction counts are highly reproducible; for some programs they are
|
|
perfectly reproducible. This means the effects of small changes in a
|
|
program can be measured with high precision.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
For these reasons, Cachegrind is an excellent complement to time-based profilers.
|
|
</para>
|
|
|
|
<para>
|
|
Cachegrind can annotate programs written in any language, so long as debug info
|
|
is present to map machine code back to the original source code. Cachegrind has
|
|
been used successfully on programs written in C, C++, Rust, and assembly.
|
|
</para>
|
|
|
|
<para>
|
|
Cachegrind can also simulate how your program interacts with a machine's cache
|
|
hierarchy and branch predictor. This simulation was the original motivation for
|
|
the tool, hence its name. However, the simulations are basic and unlikely to
|
|
reflect the behaviour of a modern machine. For this reason they are off by
|
|
default. If you really want cache and branch information, a profiler like
|
|
<computeroutput>perf</computeroutput> that accesses hardware counters is a
|
|
better choice.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cg-manual.profile"
|
|
xreflabel="Using Cachegrind and cg_annotate">
|
|
<title>Using Cachegrind and cg_annotate</title>
|
|
|
|
<para>
|
|
First, as for normal Valgrind use, you should compile with debugging info (the
|
|
<option>-g</option> option in most compilers). But by contrast with normal
|
|
Valgrind use, you probably do want to turn optimisation on, since you should
|
|
profile your program as it will be normally run.
|
|
</para>
|
|
|
|
<para>
|
|
Second, run Cachegrind itself to gather the profiling data.
|
|
</para>
|
|
|
|
<para>
|
|
Third, run cg_annotate to get a detailed presentation of that data. cg_annotate
|
|
can combine the results of multiple Cachegrind output files. It can also
|
|
perform a diff between two Cachegrind output files.
|
|
</para>
|
|
|
|
|
|
<sect2 id="cg-manual.running-cachegrind" xreflabel="Running Cachegrind">
|
|
<title>Running Cachegrind</title>
|
|
|
|
<para>
|
|
To run Cachegrind on a program <filename>prog</filename>, run:
|
|
<screen><![CDATA[
|
|
valgrind --tool=cachegrind prog
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>
|
|
The program will execute (slowly). Upon completion, summary statistics that
|
|
look like this will be printed:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
==17942== I refs: 8,195,070
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
The <computeroutput>I refs</computeroutput> number is short for "Instruction
|
|
cache references", which is equivalent to "instructions executed". If you
|
|
enable the cache and/or branch simulation, additional counts will be shown.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.outputfile" xreflabel="Output File">
|
|
<title>Output File</title>
|
|
|
|
<para>
|
|
Cachegrind also writes more detailed profiling data to a file. By default this
|
|
Cachegrind output file is named <filename>cachegrind.out.<pid></filename>
|
|
(where <filename><pid></filename> is the program's process ID), but its
|
|
name can be changed with the <option>--cachegrind-out-file</option> option.
|
|
This file is human-readable, but is intended to be interpreted by the
|
|
accompanying program cg_annotate, described in the next section.
|
|
</para>
|
|
|
|
<para>
|
|
The default <computeroutput>.<pid></computeroutput> suffix on the output
|
|
file name serves two purposes. First, it means existing Cachegrind output files
|
|
aren't immediately overwritten. Second, and more importantly, it allows correct
|
|
profiling with the <option>--trace-children=yes</option> option of programs
|
|
that spawn child processes.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.running-cg_annotate" xreflabel="Running cg_annotate">
|
|
<title>Running cg_annotate</title>
|
|
|
|
<para>
|
|
Before using cg_annotate, it is worth widening your window to be at least 120
|
|
characters wide if possible, because the output lines can be quite long.
|
|
</para>
|
|
|
|
<para>
|
|
Then run:
|
|
<screen>cg_annotate <filename></screen>
|
|
on a Cachegrind output file.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<!--
|
|
To produce the sample date, I did the following. Note that the single hypens in
|
|
the valgrind command should be double hyphens, but XML doesn't allow double
|
|
hyphens in comments.
|
|
|
|
gcc -g -O concord.c -o concord
|
|
valgrind -tool=cachegrind -cachegrind-out-file=concord.cgout ./concord ../cg_main.c
|
|
(to exit, type `q` and hit enter)
|
|
python ../cg_annotate concord.cgout > concord.cgann
|
|
|
|
concord.c is a small C program I wrote at university. It's a good size for an example.
|
|
-->
|
|
|
|
<sect2 id="cg-manual.the-metadata" xreflabel="The Metadata Section">
|
|
<title>The Metadata Section</title>
|
|
|
|
<para>
|
|
The first part of the output looks like this:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Metadata
|
|
--------------------------------------------------------------------------------
|
|
Invocation: ../cg_annotate concord.cgout
|
|
Command: ./concord ../cg_main.c
|
|
Events recorded: Ir
|
|
Events shown: Ir
|
|
Event sort order: Ir
|
|
Threshold: 0.1%
|
|
Annotation: on
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
It summarizes how Cachegrind and the profiled program were run.
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Invocation: the command line used to produce this output.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Command: the command line used to run the profiled program.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Events recorded: which events were recorded. By default, this is
|
|
<computeroutput>Ir</computeroutput>. More events will be recorded if cache
|
|
and/or branch simulation is enabled.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Events shown: the events shown, which is a subset of the events gathered.
|
|
This can be adjusted with the <option>--show</option> option.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Event sort order: the sort order used for the subsequent sections. For
|
|
example, in this case those sections are sorted from highest
|
|
<computeroutput>Ir</computeroutput> counts to lowest. If there are multiple
|
|
events, one will be the primary sort event, and then there can be a
|
|
secondary sort event, tertiary sort event, etc., though more than one is
|
|
rarely needed. This order can be adjusted with the <option>--sort</option>
|
|
option. Note that this does <emphasis>not</emphasis> specify the order in
|
|
which the columns appear. That is specified by the "events shown" line (and
|
|
can be changed with the <option>--show</option> option).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Threshold: cg_annotate by default omits files and functions with very low
|
|
counts to keep the output size reasonable. By default cg_annotate only
|
|
shows files and functions that account for at least 0.1% of the primary
|
|
sort event. The threshold can be adjusted with the
|
|
<option>--threshold</option> option.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Annotation: whether source file annotation is enabled. Controlled with the
|
|
<option>--annotate</option> option.
|
|
</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
If cache simulation is enabled, details of the cache parameters will be shown
|
|
above the "Invocation" line.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.the-global"
|
|
xreflabel="Global, File, and Function-level Counts">
|
|
<title>Global, File, and Function-level Counts</title>
|
|
|
|
<para>
|
|
Next comes the summary for the whole program:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Summary
|
|
--------------------------------------------------------------------------------
|
|
Ir________________
|
|
|
|
8,195,070 (100.0%) PROGRAM TOTALS
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
The <computeroutput>Ir</computeroutput> column label is suffixed with
|
|
underscores to show the bounds of the columns underneath.
|
|
</para>
|
|
|
|
<para>
|
|
Then comes file:function counts. Here is the first part of that section:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- File:function summary
|
|
--------------------------------------------------------------------------------
|
|
Ir______________________ file:function
|
|
|
|
< 3,078,746 (37.6%, 37.6%) /home/njn/grind/ws1/cachegrind/concord.c:
|
|
1,630,232 (19.9%) get_word
|
|
630,918 (7.7%) hash
|
|
461,095 (5.6%) insert
|
|
130,560 (1.6%) add_existing
|
|
91,014 (1.1%) init_hash_table
|
|
88,056 (1.1%) create
|
|
46,676 (0.6%) new_word_node
|
|
|
|
< 1,746,038 (21.3%, 58.9%) ./malloc/./malloc/malloc.c:
|
|
1,285,938 (15.7%) _int_malloc
|
|
458,225 (5.6%) malloc
|
|
|
|
< 1,107,550 (13.5%, 72.4%) ./libio/./libio/getc.c:getc
|
|
|
|
< 551,071 (6.7%, 79.1%) ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
|
|
|
|
< 521,228 (6.4%, 85.5%) ./ctype/../include/ctype.h:
|
|
260,616 (3.2%) __ctype_tolower_loc
|
|
260,612 (3.2%) __ctype_b_loc
|
|
|
|
< 468,163 (5.7%, 91.2%) ???:
|
|
468,151 (5.7%) ???
|
|
|
|
< 456,071 (5.6%, 96.8%) /usr/include/ctype.h:get_word
|
|
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
Each entry covers one file, and one or more functions within that file. If
|
|
there is only one significant function within a file, as in the first entry,
|
|
the file and function are shown on the same line separate by a colon. If there
|
|
are multiple significant functions within a file, as in the third entry, each
|
|
function gets its own line.
|
|
</para>
|
|
|
|
<para>
|
|
This example involves a small C program, and shows a combination of code from
|
|
the program itself (including functions like <function>get_word</function> and
|
|
<function>hash</function> in the file <filename>concord.c</filename>) as well
|
|
as code from system libraries, such as functions like
|
|
<function>malloc</function> and <function>getc</function>.
|
|
</para>
|
|
|
|
<para>
|
|
Each entry is preceded with a <computeroutput><</computeroutput>, which can
|
|
be useful when navigating through the output in an editor, or grepping through
|
|
results.
|
|
</para>
|
|
|
|
<para>
|
|
The first percentage in each column indicates the proportion of the total event
|
|
count is covered by this line. The second percentage, which only shows on the
|
|
first line of each entry, shows the cumulative percentage of all the entries up
|
|
to and including this one. The entries shown here account for 96.8% of the
|
|
instructions executed by the program.
|
|
</para>
|
|
|
|
<para>
|
|
The name <computeroutput>???</computeroutput> is used if the file name and/or
|
|
function name could not be determined from debugging information. If
|
|
<filename>???</filename> filenames dominate, the program probably wasn't
|
|
compiled with <option>-g</option>. If <function>???</function> function names
|
|
dominate, the program may have had symbols stripped.
|
|
</para>
|
|
|
|
<para>
|
|
After that comes function:file counts. Here is the first part of that section:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Function:file summary
|
|
--------------------------------------------------------------------------------
|
|
Ir______________________ function:file
|
|
|
|
> 2,086,303 (25.5%, 25.5%) get_word:
|
|
1,630,232 (19.9%) /home/njn/grind/ws1/cachegrind/concord.c
|
|
456,071 (5.6%) /usr/include/ctype.h
|
|
|
|
> 1,285,938 (15.7%, 41.1%) _int_malloc:./malloc/./malloc/malloc.c
|
|
|
|
> 1,107,550 (13.5%, 54.7%) getc:./libio/./libio/getc.c
|
|
|
|
> 630,918 (7.7%, 62.4%) hash:/home/njn/grind/ws1/cachegrind/concord.c
|
|
|
|
> 551,071 (6.7%, 69.1%) __strcmp_avx2:./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S
|
|
|
|
> 480,248 (5.9%, 74.9%) malloc:
|
|
458,225 (5.6%) ./malloc/./malloc/malloc.c
|
|
22,023 (0.3%) ./malloc/./malloc/arena.c
|
|
|
|
> 468,151 (5.7%, 80.7%) ???:???
|
|
|
|
> 461,095 (5.6%, 86.3%) insert:/home/njn/grind/ws1/cachegrind/concord.c
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
This is similar to the previous section, but is grouped by functions first and
|
|
files second. Also, the entry markers are <computeroutput>></computeroutput>
|
|
instead of <computeroutput><</computeroutput>.
|
|
</para>
|
|
|
|
<para>
|
|
You might wonder why this section is needed, and how it differs from the
|
|
previous section. The answer is inlining. In this example there are two entries
|
|
demonstrating a function whose code is effectively spread across more than one
|
|
file: <function>get_word</function> and <function>malloc</function>. Here is an
|
|
example from profiling the Rust compiler, a much larger program that uses
|
|
inlining more:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
> 30,469,230 (1.3%, 11.1%) <rustc_middle::ty::context::CtxtInterners>::intern_ty:
|
|
10,269,220 (0.5%) /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/raw/mod.rs
|
|
7,696,827 (0.3%) /home/njn/dev/rust0/compiler/rustc_middle/src/ty/context.rs
|
|
3,858,099 (0.2%) /home/njn/dev/rust0/library/core/src/cell.rs
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
In this case the compiled function <function>intern_ty</function> includes code
|
|
from three different source files, due to inlining. These should be examined
|
|
together. Older versions of cg_annotate presented this entry as three separate
|
|
file:function entries, which would typically be intermixed with all the other
|
|
entries, making it hard to see that they are all really part of the same
|
|
function.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.line-by-line" xreflabel="Per-line Counts">
|
|
<title>Per-line Counts</title>
|
|
|
|
<para>
|
|
By default, a source file is annotated if it contains at least one function
|
|
that meets the significance threshold. This can be disabled with the
|
|
<option>--annotate</option> option.
|
|
</para>
|
|
|
|
<para>
|
|
To continue the previous example, here is part of the annotation of the file
|
|
<filename>concord.c</filename>:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Annotated source file: /home/njn/grind/ws1/cachegrind/docs/concord.c
|
|
--------------------------------------------------------------------------------
|
|
Ir____________
|
|
|
|
. /* Function builds the hash table from the given file. */
|
|
. void init_hash_table(char *file_name, Word_Node *table[])
|
|
8 (0.0%) {
|
|
. FILE *file_ptr;
|
|
. Word_Info *data;
|
|
2 (0.0%) int line = 1, i;
|
|
.
|
|
. /* Structure used when reading in words and line numbers. */
|
|
3 (0.0%) data = (Word_Info *) create(sizeof(Word_Info));
|
|
.
|
|
. /* Initialise entire table to NULL. */
|
|
2,993 (0.0%) for (i = 0; i < TABLE_SIZE; i++)
|
|
997 (0.0%) table[i] = NULL;
|
|
.
|
|
. /* Open file, check it. */
|
|
4 (0.0%) file_ptr = fopen(file_name, "r");
|
|
2 (0.0%) if (!(file_ptr)) {
|
|
. fprintf(stderr, "Couldn't open '%s'.\n", file_name);
|
|
. exit(EXIT_FAILURE);
|
|
. }
|
|
.
|
|
. /* 'Get' the words and lines one at a time from the file, and insert them
|
|
. ** into the table one at a time. */
|
|
55,363 (0.7%) while ((line = get_word(data, line, file_ptr)) != EOF)
|
|
31,632 (0.4%) insert(data->word, data->line, table);
|
|
.
|
|
2 (0.0%) free(data);
|
|
2 (0.0%) fclose(file_ptr);
|
|
6 (0.0%) }
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
Each executed line is annotated with its event counts. Other lines are
|
|
annotated with a dot. This may be because they contain no executable code, or
|
|
they contain executable code but were never executed.
|
|
</para>
|
|
|
|
<para>
|
|
You can easily tell if a function is inlined from this output. If it is not
|
|
inlined, it will have event counts on the lines containing the opening and
|
|
closing braces. If it is inlined, it will not have event counts on those lines.
|
|
In the example above, <function>init_hash_table</function> does have counts,
|
|
so you can tell it is not inlined.
|
|
</para>
|
|
|
|
<para>
|
|
Note again that inlining can lead to surprising results. If a function
|
|
<function>f</function> is always inlined, in the file:function and
|
|
function:file sections counts will be attributed to the functions it is inlined
|
|
into, rather than itself. However, if you look at the line-by-line annotations
|
|
for <function>f</function> you'll see the counts that belong to
|
|
<function>f</function>. So it's worth looking for large counts/percentages in the
|
|
line-by-line annotations.
|
|
</para>
|
|
|
|
<para>
|
|
Sometimes only a small section of a source file is executed. To minimise
|
|
uninteresting output, Cachegrind only shows annotated lines and lines within a
|
|
small distance of annotated lines. Gaps are marked with line numbers, for
|
|
example:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
(counts and code for line 704)
|
|
-- line 375 ----------------------------------------
|
|
-- line 514 ----------------------------------------
|
|
(counts and code for line 878)
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
The number of lines of context shown around annotated lines is controlled by
|
|
the <option>--context</option> option.
|
|
</para>
|
|
|
|
<para>
|
|
Any significant source files that could not be found are shown like this:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Annotated source file: ./malloc/./malloc/malloc.c
|
|
--------------------------------------------------------------------------------
|
|
Unannotated because one or more of these original files are unreadable:
|
|
- ./malloc/./malloc/malloc.c
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
This is common for library files, because libraries are usually compiled with
|
|
debugging information but the source files are rarely present on a system.
|
|
</para>
|
|
|
|
<para>
|
|
Cachegrind relies heavily on accurate debug info. Sometimes compilers do not
|
|
map a particular compiled instruction to line number 0, where the 0 represents
|
|
"unknown" or "none". This is annoying but does happen in practice. cg_annotate
|
|
prints these in the following way:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Annotated source file: /home/njn/dev/rust0/compiler/rustc_borrowck/src/lib.rs
|
|
--------------------------------------------------------------------------------
|
|
Ir______________
|
|
|
|
1,046,746 (0.0%) <unknown (line 0)>
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
Finally, when annotation is performed, the output ends with a summary of how
|
|
many counts were annotated and unannotated, and why. For example:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
--------------------------------------------------------------------------------
|
|
-- Annotation summary
|
|
--------------------------------------------------------------------------------
|
|
Ir_______________
|
|
|
|
3,534,817 (43.1%) annotated: files known & above threshold & readable, line numbers known
|
|
0 annotated: files known & above threshold & readable, line numbers unknown
|
|
0 unannotated: files known & above threshold & two or more non-identical
|
|
4,132,126 (50.4%) unannotated: files known & above threshold & unreadable
|
|
59,950 (0.7%) unannotated: files known & below threshold
|
|
468,163 (5.7%) unannotated: files unknown
|
|
]]></programlisting>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.forkingprograms" xreflabel="Forking Programs">
|
|
<title>Forking Programs</title>
|
|
|
|
<para>
|
|
If your program forks, the child will inherit all the profiling data that
|
|
has been gathered for the parent.
|
|
</para>
|
|
|
|
<para>
|
|
If the output file name (controlled by <option>--cachegrind-out-file</option>)
|
|
does not contain <option>%p</option>, then the outputs from the parent and
|
|
child will be intermingled in a single output file, which will almost certainly
|
|
make it unreadable by cg_annotate.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.annopts.warnings" xreflabel="cg_annotate Warnings">
|
|
<title>cg_annotate Warnings</title>
|
|
|
|
<para>
|
|
There are two situations in which cg_annotate prints warnings.
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
If a source file is more recent than the Cachegrind output file. This is
|
|
because the information in the Cachegrind output file is only recorded with
|
|
line numbers, so if the line numbers change at all in the source (e.g.
|
|
lines added, deleted, swapped), any annotations will be incorrect.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If information is recorded about line numbers past the end of a file. This
|
|
can be caused by the above problem, e.g. shortening the source file while
|
|
using an old Cachegrind output file. If this happens, the figures for the
|
|
bogus lines are printed anyway (and clearly marked as bogus) in case they
|
|
are important.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.cg_merge" xreflabel="cg_merge">
|
|
<title>Merging Cachegrind Output Files</title>
|
|
|
|
<para>
|
|
cg_annotate can merge data from multiple Cachegrind output files in a single
|
|
run. (There is also a program called cg_merge that can merge multiple
|
|
Cachegrind output files into a single Cachegrind output file, but it is now
|
|
deprecated because cg_annotate's merging does a better job.)
|
|
</para>
|
|
|
|
<para>
|
|
Use it as follows:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
cg_annotate file1 file2 file3 ...
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
cg_annotate computes the sum of these files (effectively
|
|
<filename>file1</filename> + <filename>file2</filename> +
|
|
<filename>file3</filename>), and then produces output as usual that shows the
|
|
summed counts.
|
|
</para>
|
|
|
|
<para>
|
|
The most common merging scenario is if you want to aggregate costs over
|
|
multiple runs of the same program, possibly on different inputs.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="cg-manual.cg_diff" xreflabel="cg_diff">
|
|
<title>Differencing Cachegrind output files</title>
|
|
|
|
<para>
|
|
cg_annotate can diff data from two Cachegrind output files in a single run.
|
|
(There is also a program called cg_diff that can diff two Cachegrind output
|
|
files into a single Cachegrind output file, but it is now deprecated because
|
|
cg_annotate's differencing does a better job.)
|
|
</para>
|
|
|
|
<para>
|
|
Use it as follows:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
cg_annotate --diff file1 file2
|
|
]]></programlisting>
|
|
|
|
<para>
|
|
cg_annotate computes the difference between these two files (effectively
|
|
<filename>file2</filename> - <filename>file1</filename>), and then
|
|
produces output as usual that shows the count differences. Note that many of
|
|
the counts may be negative; this indicates that the counts for the relevant
|
|
file/function/line are smaller in the second version than those in the first
|
|
version.
|
|
</para>
|
|
|
|
<para>
|
|
The simplest common scenario is comparing two Cachegrind output files that came
|
|
from the same program, but on different inputs. cg_annotate will do a good job
|
|
on this without assistance.
|
|
</para>
|
|
|
|
<para>
|
|
A more complex scenario is if you want to compare Cachegrind output files from
|
|
two slightly different versions of a program that you have sitting
|
|
side-by-side, running on the same input. For example, you might have
|
|
<filename>version1/prog.c</filename> and <filename>version2/prog.c</filename>.
|
|
A straight comparison of the two would not be useful. Because functions are
|
|
always paired with filenames, a function <function>f</function> would be listed
|
|
as <filename>version1/prog.c:f</filename> for the first version but
|
|
<filename>version2/prog.c:f</filename> for the second version.
|
|
</para>
|
|
|
|
<para>
|
|
In this case, use the <option>--mod-filename</option> option. Its argument is a
|
|
search-and-replace expression that will be applied to all the filenames in both
|
|
Cachegrind output files. It can be used to remove minor differences in
|
|
filenames. For example, the option
|
|
<option>--mod-filename='s/version[0-9]/versionN/'</option> will suffice for the
|
|
above example.
|
|
</para>
|
|
|
|
<para>
|
|
Similarly, sometimes compilers auto-generate certain functions and give them
|
|
randomized names like <function>T.1234</function> where the suffixes vary from
|
|
build to build. You can use the <option>--mod-funcname</option> option to
|
|
remove small differences like these; it works in the same way as
|
|
<option>--mod-filename</option>.
|
|
</para>
|
|
|
|
<para>
|
|
When <option>--mod-filename</option> is used to compare two different versions
|
|
of the same program, cg_annotate will not annotate any file that is different
|
|
between the two versions, because the per-line counts are not reliable in such
|
|
a case. For example, imagine if <filename>version2/prog.c</filename> is the
|
|
same as <filename>version1/prog.c</filename> except with an extra blank line at
|
|
the top of the file. Every single per-line count will have changed. In
|
|
comparison, the per-file and per-function counts have not changed, and are
|
|
still very useful for determining differences between programs. You might think
|
|
that this means every interesting file will be left unannotated, but again
|
|
inlining means that files that are identical in the two versions can have
|
|
different counts on many lines.
|
|
</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="cg-manual.cache-branch-sim" xreflabel="cache-branch-sim">
|
|
<title>Cache and Branch Simulation</title>
|
|
|
|
<para>
|
|
Cachegrind can simulate how your program interacts with a machine's cache
|
|
hierarchy and/or branch predictor.
|
|
|
|
The cache simulation models a machine with independent first-level instruction
|
|
and data caches (I1 and D1), backed by a unified second-level cache (L2). For
|
|
these machines (in the cases where Cachegrind can auto-detect the cache
|
|
configuration) Cachegrind simulates the first-level and last-level caches.
|
|
Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches.
|
|
</para>
|
|
|
|
<para>
|
|
When simulating the cache, with <option>--cache-sim=yes</option>, Cachegrind
|
|
gathers the following statistics:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
I cache reads (<computeroutput>Ir</computeroutput>, which equals the number
|
|
of instructions executed), I1 cache read misses
|
|
(<computeroutput>I1mr</computeroutput>) and LL cache instruction read
|
|
misses (<computeroutput>ILmr</computeroutput>).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
D cache reads (<computeroutput>Dr</computeroutput>, which equals the number
|
|
of memory reads), D1 cache read misses
|
|
(<computeroutput>D1mr</computeroutput>), and LL cache data read misses
|
|
(<computeroutput>DLmr</computeroutput>).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
D cache writes (<computeroutput>Dw</computeroutput>, which equals the
|
|
number of memory writes), D1 cache write misses
|
|
(<computeroutput>D1mw</computeroutput>), and LL cache data write misses
|
|
(<computeroutput>DLmw</computeroutput>).
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Note that D1 total accesses is given by <computeroutput>D1mr</computeroutput> +
|
|
<computeroutput>D1mw</computeroutput>, and that LL total accesses is given by
|
|
<computeroutput>ILmr</computeroutput> + <computeroutput>DLmr</computeroutput> +
|
|
<computeroutput>DLmw</computeroutput>.
|
|
</para>
|
|
|
|
<para>
|
|
When simulating the branch predictor, with <option>--branch-sim=yes</option>,
|
|
Cachegrind gathers the following statistics:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Conditional branches executed (<computeroutput>Bc</computeroutput>) and
|
|
conditional branches mispredicted (<computeroutput>Bcm</computeroutput>).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Indirect branches executed (<computeroutput>Bi</computeroutput>) and
|
|
indirect branches mispredicted (<computeroutput>Bim</computeroutput>).
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
When cache and/or branch simulation is enabled, cg_annotate will print multiple
|
|
counts per line of output. For example:
|
|
</para>
|
|
|
|
<programlisting><![CDATA[
|
|
Ir______________________ Bc____________________ Bcm__________________ Bi____________________ Bim______________ function:file
|
|
|
|
> 8,547 (0.1%, 99.4%) 936 (0.1%, 99.1%) 177 (0.3%, 96.7%) 59 (0.0%, 99.9%) 38 (19.4%, 66.3%) strcmp:
|
|
8,503 (0.1%) 928 (0.1%) 175 (0.3%) 59 (0.0%) 38 (19.4%) ./string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S
|
|
]]></programlisting>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cg-manual.cgopts" xreflabel="Cachegrind Command-line Options">
|
|
<title>Cachegrind Command-line Options</title>
|
|
|
|
<!-- start of xi:include in the manpage -->
|
|
<para>
|
|
Cachegrind-specific options are:
|
|
</para>
|
|
|
|
<variablelist id="cg.opts.list">
|
|
|
|
<varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
|
|
<term>
|
|
<option><![CDATA[--cachegrind-out-file=<file> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Write the Cachegrind output file to <filename>file</filename> rather than
|
|
to the default output file,
|
|
<filename>cachegrind.out.<pid></filename>. The <option>%p</option>
|
|
and <option>%q</option> format specifiers can be used to embed the
|
|
process ID and/or the contents of an environment variable in the name, as
|
|
is the case for the core option
|
|
<option><link linkend="opt.log-file">--log-file</link></option>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.cache-sim" xreflabel="--cache-sim">
|
|
<term>
|
|
<option><![CDATA[--cache-sim=no|yes [no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Enables or disables collection of cache access and miss counts.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="opt.branch-sim" xreflabel="--branch-sim">
|
|
<term>
|
|
<option><![CDATA[--branch-sim=no|yes [no] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Enables or disables collection of branch instruction and
|
|
misprediction counts.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cg.opt.I1" xreflabel="--I1">
|
|
<term>
|
|
<option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specify the size, associativity and line size of the level 1 instruction
|
|
cache. Only useful with <option>--cache-sim=yes</option>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cg.opt.D1" xreflabel="--D1">
|
|
<term>
|
|
<option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specify the size, associativity and line size of the level 1 data cache.
|
|
Only useful with <option>--cache-sim=yes</option>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="cg.opt.LL" xreflabel="--LL">
|
|
<term>
|
|
<option><![CDATA[--LL=<size>,<associativity>,<line size> ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specify the size, associativity and line size of the last-level cache.
|
|
Only useful with <option>--cache-sim=yes</option>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
<!-- end of xi:include in the manpage -->
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="cg-manual.annopts" xreflabel="cg_annotate Command-line Options">
|
|
<title>cg_annotate Command-line Options</title>
|
|
|
|
<!-- start of xi:include in the manpage -->
|
|
<variablelist id="cg_annotate.opts.list">
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[-h --help ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Show the help message.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--version ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Show the version number.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--diff ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Diff two Cachegrind output files.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--mod-filename <regex> [default: none]]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specifies an <option>s/old/new/</option> search-and-replace expression
|
|
that is applied to all filenames. Useful when differencing, for removing
|
|
minor differences in paths between two different versions of a program
|
|
that are sitting in different directories. An <option>i</option> suffix
|
|
makes the regex case-insensitive, and a <option>g</option> suffix makes
|
|
it match multiple times.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--mod-funcname <regex> [default: none]]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Like <option>--mod-filename</option>, but for filenames. Useful for
|
|
removing minor differences in randomized names of auto-generated
|
|
functions generated by some compilers.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--show=A,B,C [default: all, using order in
|
|
the Cachegrind output file] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specifies which events to show (and the column order). Default is to use
|
|
all present in the Cachegrind output file (and use the order in the
|
|
file). Best used in conjunction with <option>--sort</option>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--sort=A,B,C [default: order in the Cachegrind output file] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specifies the events upon which the sorting of the file:function and
|
|
function:file entries will be based.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--threshold=X [default: 0.1%] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Sets the significance threshold for the file:function and function:files
|
|
sections. A file or function is shown if it accounts for more than X% of
|
|
the counts for the primary sort event. If annotating source files, this
|
|
also affects which files are annotated.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--show-percs, --no-show-percs, --show-percs=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
When enabled, a percentage is printed next to all event counts. This
|
|
helps gauge the relative importance of each function and line.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--annotate, --no-annotate, --auto=<no|yes> [default: yes] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Enables or disables source file annotation.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--context=N [default: 8] ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
The number of lines of context to show before and after each annotated
|
|
line. Use a large number (e.g. 100000) to show all source lines.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
<!-- end of xi:include in the manpage -->
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cg-manual.mergeopts" xreflabel="cg_merge Command-line Options">
|
|
<title>cg_merge Command-line Options</title>
|
|
|
|
<!-- start of xi:include in the manpage -->
|
|
<variablelist id="cg_merge.opts.list">
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[-o outfile]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Write the output to to <computeroutput>outfile</computeroutput>
|
|
instead of standard output.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
<!-- end of xi:include in the manpage -->
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cg-manual.diffopts" xreflabel="cg_diff Command-line Options">
|
|
<title>cg_diff Command-line Options</title>
|
|
|
|
<!-- start of xi:include in the manpage -->
|
|
<variablelist id="cg_diff.opts.list">
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[-h --help ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Show the help message.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--version ]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>Show the version number.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--mod-filename=<expr> [default: none]]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Specifies an <option>s/old/new/</option> search-and-replace expression
|
|
that is applied to all filenames.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>
|
|
<option><![CDATA[--mod-funcname=<expr> [default: none]]]></option>
|
|
</term>
|
|
<listitem>
|
|
<para>
|
|
Like <option>--mod-filename</option>, but for filenames.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
<!-- end of xi:include in the manpage -->
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="cg-manual.sim-details"
|
|
xreflabel="Simulation Details">
|
|
<title>Simulation Details</title>
|
|
<para>
|
|
This section talks about details you don't need to know about in order to
|
|
use Cachegrind, but may be of interest to some people.
|
|
</para>
|
|
|
|
<sect2 id="cache-sim" xreflabel="Cache Simulation Specifics">
|
|
<title>Cache Simulation Specifics</title>
|
|
|
|
<para>
|
|
The cache simulation approximates the hardware of an AMD Athlon CPU circa 2002.
|
|
Its specific characteristics are as follows:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para>Write-allocate: when a write miss occurs, the block
|
|
written to is brought into the D1 cache. Most modern caches
|
|
have this property.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Bit-selection hash function: the set of line(s) in the cache
|
|
to which a memory block maps is chosen by the middle bits
|
|
M--(M+N-1) of the byte address, where:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>line size = 2^M bytes</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>(cache size / line size / associativity) = 2^N bytes</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Inclusive LL cache: the LL cache typically replicates all
|
|
the entries of the L1 caches, because fetching into L1 involves
|
|
fetching into LL first (this does not guarantee strict inclusiveness,
|
|
as lines evicted from LL still could reside in L1). This is
|
|
standard on Pentium chips, but AMD Opterons, Athlons and Durons
|
|
use an exclusive LL cache that only holds
|
|
blocks evicted from L1. Ditto most modern VIA CPUs.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>The cache configuration simulated (cache size,
|
|
associativity and line size) is determined automatically using
|
|
the x86 CPUID instruction. If you have a machine that (a)
|
|
doesn't support the CPUID instruction, or (b) supports it in an
|
|
early incarnation that doesn't give any cache information, then
|
|
Cachegrind will fall back to using a default configuration (that
|
|
of a model 3/4 Athlon). Cachegrind will tell you if this
|
|
happens. You can manually specify one, two or all three levels
|
|
(I1/D1/LL) of the cache from the command line using the
|
|
<option>--I1</option>,
|
|
<option>--D1</option> and
|
|
<option>--LL</option> options.
|
|
For cache parameters to be valid for simulation, the number
|
|
of sets (with associativity being the number of cache lines in
|
|
each set) has to be a power of two.</para>
|
|
|
|
<para>On PowerPC platforms
|
|
Cachegrind cannot automatically
|
|
determine the cache configuration, so you will
|
|
need to specify it with the
|
|
<option>--I1</option>,
|
|
<option>--D1</option> and
|
|
<option>--LL</option> options.</para>
|
|
|
|
|
|
<para>Other noteworthy behaviour:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>References that straddle two cache lines are treated as
|
|
follows:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>If both blocks hit --> counted as one hit</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>If one block hits, the other misses --> counted
|
|
as one miss.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>If both blocks miss --> counted as one miss (not
|
|
two)</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Instructions that modify a memory location
|
|
(e.g. <computeroutput>inc</computeroutput> and
|
|
<computeroutput>dec</computeroutput>) are counted as doing
|
|
just a read, i.e. a single data reference. This may seem
|
|
strange, but since the write can never cause a miss (the read
|
|
guarantees the block is in the cache) it's not very
|
|
interesting.</para>
|
|
|
|
<para>Thus it measures not the number of times the data cache
|
|
is accessed, but the number of times a data cache miss could
|
|
occur.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
If you are interested in simulating a cache with different properties, it is
|
|
not particularly hard to write your own cache simulator, or to modify the
|
|
existing ones in <computeroutput>cg_sim.c</computeroutput>.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="branch-sim" xreflabel="Branch Simulation Specifics">
|
|
<title>Branch Simulation Specifics</title>
|
|
|
|
<para>Cachegrind simulates branch predictors intended to be
|
|
typical of mainstream desktop/server processors of around 2004.</para>
|
|
|
|
<para>Conditional branches are predicted using an array of 16384 2-bit
|
|
saturating counters. The array index used for a branch instruction is
|
|
computed partly from the low-order bits of the branch instruction's
|
|
address and partly using the taken/not-taken behaviour of the last few
|
|
conditional branches. As a result the predictions for any specific
|
|
branch depend both on its own history and the behaviour of previous
|
|
branches. This is a standard technique for improving prediction
|
|
accuracy.</para>
|
|
|
|
<para>For indirect branches (that is, jumps to unknown destinations)
|
|
Cachegrind uses a simple branch target address predictor. Targets are
|
|
predicted using an array of 512 entries indexed by the low order 9
|
|
bits of the branch instruction's address. Each branch is predicted to
|
|
jump to the same address it did last time. Any other behaviour causes
|
|
a mispredict.</para>
|
|
|
|
<para>More recent processors have better branch predictors, in
|
|
particular better indirect branch predictors. Cachegrind's predictor
|
|
design is deliberately conservative so as to be representative of the
|
|
large installed base of processors which pre-date widespread
|
|
deployment of more sophisticated indirect branch predictors. In
|
|
particular, late model Pentium 4s (Prescott), Pentium M, Core and Core
|
|
2 have more sophisticated indirect branch predictors than modelled by
|
|
Cachegrind. </para>
|
|
|
|
<para>Cachegrind does not simulate a return stack predictor. It
|
|
assumes that processors perfectly predict function return addresses,
|
|
an assumption which is probably close to being true.</para>
|
|
|
|
<para>See Hennessy and Patterson's classic text "Computer
|
|
Architecture: A Quantitative Approach", 4th edition (2007), Section
|
|
2.3 (pages 80-89) for background on modern branch predictors.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
|
|
<title>Accuracy</title>
|
|
|
|
<para>
|
|
Cachegrind's instruction counting has one shortcoming on x86/amd64:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
When a <function>REP</function>-prefixed instruction executes each
|
|
iteration is counted separately. In contrast, hardware counters count each
|
|
such instruction just once, no matter how many times it iterates. It is
|
|
arguable that Cachegrind's behaviour is more useful.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Cachegrind's cache profiling has a number of shortcomings:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
It doesn't account for kernel activity. The effect of system calls on the
|
|
cache and branch predictor contents is ignored.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
It doesn't account for other process activity. This is arguably desirable
|
|
when considering a single program.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>It doesn't account for virtual-to-physical address
|
|
mappings. Hence the simulation is not a true
|
|
representation of what's happening in the
|
|
cache. Most caches and branch predictors are physically indexed, but
|
|
Cachegrind simulates caches using virtual addresses.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>It doesn't account for cache misses not visible at the
|
|
instruction level, e.g. those arising from TLB misses, or
|
|
speculative execution.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Valgrind will schedule
|
|
threads differently from how they would be when running natively.
|
|
This could warp the results for threaded programs.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The x86/amd64 instructions <computeroutput>bts</computeroutput>,
|
|
<computeroutput>btr</computeroutput> and
|
|
<computeroutput>btc</computeroutput> will incorrectly be counted as doing a
|
|
data read if both the arguments are registers, e.g.:
|
|
<programlisting><![CDATA[
|
|
btsl %eax, %edx]]></programlisting>
|
|
This should only happen rarely.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
|
|
(e.g. <computeroutput>fsave</computeroutput>) are treated as
|
|
though they only access 16 bytes. These instructions seem to
|
|
be rare so hopefully this won't affect accuracy much.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Another thing worth noting is that results are very sensitive.
|
|
Changing the size of the executable being profiled, or the sizes
|
|
of any of the shared libraries it uses, or even the length of their
|
|
file names, can perturb the results. Variations will be small, but
|
|
don't expect perfectly repeatable results if your program changes at
|
|
all.</para>
|
|
|
|
<para>
|
|
Many Linux distributions perform address space layout randomisation (ASLR), in
|
|
which identical runs of the same program have their shared libraries loaded at
|
|
different locations, as a security measure. This also perturbs the
|
|
results.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="cg-manual.impl-details"
|
|
xreflabel="Implementation Details">
|
|
<title>Implementation Details</title>
|
|
<para>
|
|
This section talks about details you don't need to know about in order to
|
|
use Cachegrind, but may be of interest to some people.
|
|
</para>
|
|
|
|
<sect2 id="cg-manual.impl-details.how-cg-works"
|
|
xreflabel="How Cachegrind Works">
|
|
<title>How Cachegrind Works</title>
|
|
<para>The best reference for understanding how Cachegrind works is chapter 3 of
|
|
"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
|
|
is available on the <ulink url="&vg-pubs-url;">Valgrind publications
|
|
page</ulink>.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="cg-manual.impl-details.file-format"
|
|
xreflabel="Cachegrind Output File Format">
|
|
<title>Cachegrind Output File Format</title>
|
|
<para>The file format is fairly straightforward, basically giving the
|
|
cost centre for every line, grouped by files and
|
|
functions. It's also totally generic and self-describing, in the sense that
|
|
it can be used for any events that can be counted on a line-by-line basis,
|
|
not just cache and branch predictor events. For example, earlier versions
|
|
of Cachegrind didn't have a branch predictor simulation. When this was
|
|
added, the file format didn't need to change at all. So the format (and
|
|
consequently, cg_annotate) could be used by other tools.</para>
|
|
|
|
<para>The file format:</para>
|
|
<programlisting><![CDATA[
|
|
file ::= desc_line* cmd_line events_line data_line+ summary_line
|
|
desc_line ::= "desc:" ws? non_nl_string
|
|
cmd_line ::= "cmd:" ws? cmd
|
|
events_line ::= "events:" ws? (event ws)+
|
|
data_line ::= file_line | fn_line | count_line
|
|
file_line ::= "fl=" filename
|
|
fn_line ::= "fn=" fn_name
|
|
count_line ::= line_num (ws+ count)* ws*
|
|
summary_line ::= "summary:" ws? count (ws+ count)+ ws*
|
|
count ::= num]]></programlisting>
|
|
|
|
<para>Where:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><computeroutput>non_nl_string</computeroutput> is any
|
|
string not containing a newline.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><computeroutput>cmd</computeroutput> is a string holding the
|
|
command line of the profiled program.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><computeroutput>event</computeroutput> is a string containing
|
|
no whitespace.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><computeroutput>filename</computeroutput> and
|
|
<computeroutput>fn_name</computeroutput> are strings.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><computeroutput>num</computeroutput> and
|
|
<computeroutput>line_num</computeroutput> are decimal
|
|
numbers.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><computeroutput>ws</computeroutput> is whitespace.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The contents of the "desc:" lines are printed out at the top
|
|
of the summary. This is a generic way of providing simulation
|
|
specific information, e.g. for giving the cache configuration for
|
|
cache simulation.</para>
|
|
|
|
<para>More than one line of info can be present for each file/fn/line number.
|
|
In such cases, the counts for the named events will be accumulated.</para>
|
|
|
|
<para>The number of counts in each
|
|
<computeroutput>line</computeroutput> and the
|
|
<computeroutput>summary_line</computeroutput> should not exceed
|
|
the number of events in the
|
|
<computeroutput>event_line</computeroutput>. If the number in
|
|
each <computeroutput>line</computeroutput> is less, cg_annotate
|
|
treats those missing as though they were a "0" entry. This can reduce
|
|
file size.
|
|
</para>
|
|
|
|
<para>A <computeroutput>file_line</computeroutput> changes the
|
|
current file name. A <computeroutput>fn_line</computeroutput>
|
|
changes the current function name. A
|
|
<computeroutput>count_line</computeroutput> contains counts that
|
|
pertain to the current filename/fn_name. A "fn="
|
|
<computeroutput>file_line</computeroutput> and a
|
|
<computeroutput>fn_line</computeroutput> must appear before any
|
|
<computeroutput>count_line</computeroutput>s to give the context
|
|
of the first <computeroutput>count_line</computeroutput>s.</para>
|
|
|
|
<para>Similarly, each <computeroutput>file_line</computeroutput> must be
|
|
immediately followed by a <computeroutput>fn_line</computeroutput>.
|
|
</para>
|
|
|
|
<para>The summary line is redundant, because it just holds the total counts
|
|
for each event. But this serves as a useful sanity check of the data; if
|
|
the totals for each event don't match the summary line, something has gone
|
|
wrong.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
</chapter>
|