mirror of
https://github.com/Zenithsiz/ftmemsim-valgrind.git
synced 2026-02-04 10:21:20 +00:00
Majorly improved. Still a lot to do, but the structure is better. git-svn-id: svn://svn.valgrind.org/valgrind/trunk@1324
718 lines
32 KiB
HTML
718 lines
32 KiB
HTML
<html>
|
|
<head>
|
|
<title>Cachegrind: a cache-miss profiler</title>
|
|
</head>
|
|
|
|
<body>
|
|
<a name="cg-top"></a>
|
|
<h2>4 <b>Cachegrind</b>: a cache-miss profiler</h2>
|
|
|
|
To use this skin, you must specify <code>--skin=cachegrind</code>
|
|
on the Valgrind command line.
|
|
|
|
<p>
|
|
Detailed technical documentation on how Cachegrind works is available
|
|
<A HREF="cg_techdocs.html">here</A>. If you want to know how
|
|
to <b>use</b> it, you only need to read this page.
|
|
|
|
|
|
<a name="cache"></a>
|
|
<h3>4.1 Cache profiling</h3>
|
|
Cachegrind is a tool for doing cache simulations and annotating your source
|
|
line-by-line with the number of cache misses. In particular, it records:
|
|
<ul>
|
|
<li>L1 instruction cache reads and misses;
|
|
<li>L1 data cache reads and read misses, writes and write misses;
|
|
<li>L2 unified cache reads and read misses, writes and writes misses.
|
|
</ul>
|
|
On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
|
|
and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
|
|
very useful for improving the performance of your program.<p>
|
|
|
|
Also, since one instruction cache read is performed per instruction executed,
|
|
you can find out how many instructions are executed per line, which can be
|
|
useful for traditional profiling and test coverage.<p>
|
|
|
|
Any feedback, bug-fixes, suggestions, etc, welcome.
|
|
|
|
|
|
<h3>4.2 Overview</h3>
|
|
First off, as for normal Valgrind use, you probably want to compile with
|
|
debugging info (the <code>-g</code> flag). But by contrast with normal
|
|
Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
|
|
should profile your program as it will be normally run.
|
|
|
|
The two steps are:
|
|
<ol>
|
|
<li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
|
|
the normal command line invocation. When the program finishes,
|
|
Cachegrind will print summary cache statistics. It also collects
|
|
line-by-line information in a file
|
|
<code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
|
|
is the program's process id.
|
|
<p>
|
|
This step should be done every time you want to collect
|
|
information about a new program, a changed program, or about the
|
|
same program with different input.
|
|
</li>
|
|
<p>
|
|
<li>Generate a function-by-function summary, and possibly annotate
|
|
source files, using the supplied
|
|
<code>cg_annotate</code> program. Source files to annotate can be
|
|
specified manually, or manually on the command line, or
|
|
"interesting" source files can be annotated automatically with
|
|
the <code>--auto=yes</code> option. You can annotate C/C++
|
|
files or assembly language files equally easily.
|
|
<p>
|
|
This step can be performed as many times as you like for each
|
|
Step 2. You may want to do multiple annotations showing
|
|
different information each time.<p>
|
|
</li>
|
|
</ol>
|
|
|
|
The steps are described in detail in the following sections.<p>
|
|
|
|
|
|
<h4>4.3 Cache simulation specifics</h3>
|
|
|
|
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
|
|
L2 cache. This configuration is used for all (modern) x86-based machines we
|
|
are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
|
|
ancient history now.<p>
|
|
|
|
The more specific characteristics of the simulation are as follows.
|
|
|
|
<ul>
|
|
<li>Write-allocate: when a write miss occurs, the block written to
|
|
is brought into the D1 cache. Most modern caches have this
|
|
property.</li><p>
|
|
|
|
<li>Bit-selection hash function: the line(s) in the cache to which a
|
|
memory block maps is chosen by the middle bits M--(M+N-1) of the
|
|
byte address, where:
|
|
<ul>
|
|
<li> line size = 2^M bytes </li>
|
|
<li>(cache size / line size) = 2^N bytes</li>
|
|
</ul> </li><p>
|
|
|
|
<li>Inclusive L2 cache: the L2 cache replicates all the entries of
|
|
the L1 cache. This is standard on Pentium chips, but AMD
|
|
Athlons use an exclusive L2 cache that only holds blocks evicted
|
|
from L1. Ditto AMD Durons and most modern VIAs.</li><p>
|
|
</ul>
|
|
|
|
The cache configuration simulated (cache size, associativity and line size) is
|
|
determined automagically using the CPUID instruction. If you have an old
|
|
machine that (a) doesn't support the CPUID instruction, or (b) supports it in
|
|
an early incarnation that doesn't give any cache information, then Cachegrind
|
|
will fall back to using a default configuration (that of a model 3/4 Athlon).
|
|
Cachegrind will tell you if this happens. You can manually specify one, two or
|
|
all three levels (I1/D1/L2) of the cache from the command line using the
|
|
<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>
|
|
|
|
Other noteworthy behaviour:
|
|
|
|
<ul>
|
|
<li>References that straddle two cache lines are treated as follows:
|
|
<ul>
|
|
<li>If both blocks hit --> counted as one hit</li>
|
|
<li>If one block hits, the other misses --> counted as one miss</li>
|
|
<li>If both blocks miss --> counted as one miss (not two)</li>
|
|
</ul><p></li>
|
|
|
|
<li>Instructions that modify a memory location (eg. <code>inc</code> and
|
|
<code>dec</code>) are counted as doing just a read, ie. a single data
|
|
reference. This may seem strange, but since the write can never cause a
|
|
miss (the read guarantees the block is in the cache) it's not very
|
|
interesting.<p>
|
|
|
|
Thus it measures not the number of times the data cache is accessed, but
|
|
the number of times a data cache miss could occur.<p>
|
|
</li>
|
|
</ul>
|
|
|
|
If you are interested in simulating a cache with different properties, it is
|
|
not particularly hard to write your own cache simulator, or to modify the
|
|
existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
|
|
<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>. We'd be
|
|
interested to hear from anyone who does.
|
|
|
|
|
|
<a name="profile"></a>
|
|
<h3>4.4 Profiling programs</h3>
|
|
|
|
Cache profiling is enabled by using the <code>--skin=cachegrind</code>
|
|
option to the <code>valgrind</code> shell script. To gather cache profiling
|
|
information about the program <code>ls -l</code>, type:
|
|
|
|
<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
|
|
|
|
The program will execute (slowly). Upon completion, summary statistics
|
|
that look like this will be printed:
|
|
|
|
<pre>
|
|
==31751== I refs: 27,742,716
|
|
==31751== I1 misses: 276
|
|
==31751== L2 misses: 275
|
|
==31751== I1 miss rate: 0.0%
|
|
==31751== L2i miss rate: 0.0%
|
|
==31751==
|
|
==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
|
|
==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
|
|
==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
|
|
==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
|
|
==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
|
|
==31751==
|
|
==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
|
|
==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
|
|
</pre>
|
|
|
|
Cache accesses for instruction fetches are summarised first, giving the
|
|
number of fetches made (this is the number of instructions executed, which
|
|
can be useful to know in its own right), the number of I1 misses, and the
|
|
number of L2 instruction (<code>L2i</code>) misses.<p>
|
|
|
|
Cache accesses for data follow. The information is similar to that of the
|
|
instruction fetches, except that the values are also shown split between reads
|
|
and writes (note each row's <code>rd</code> and <code>wr</code> values add up
|
|
to the row's total).<p>
|
|
|
|
Combined instruction and data figures for the L2 cache follow that.<p>
|
|
|
|
|
|
<h3>4.5 Output file</h3>
|
|
|
|
As well as printing summary information, Cachegrind also writes
|
|
line-by-line cache profiling information to a file named
|
|
<code>cachegrind.out.<i>pid</i></code>. This file is human-readable, but is
|
|
best interpreted by the accompanying program <code>cg_annotate</code>,
|
|
described in the next section.
|
|
<p>
|
|
Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
|
|
<ul>
|
|
<li>It is written every time <code>valgrind --skin=cachegrind</code>
|
|
is run, and will overwrite any existing
|
|
<code>cachegrind.out.<i>pid</i></code> in the current directory (but
|
|
that won't happen very often because it takes some time for process ids
|
|
to be recycled).</li>
|
|
<p>
|
|
<li>It can be huge: <code>ls -l</code> generates a file of about
|
|
350KB. Browsing a few files and web pages with a Konqueror
|
|
built with full debugging information generates a file
|
|
of around 15 MB.</li>
|
|
</ul>
|
|
|
|
Note that older versions of Cachegrind used a log file named
|
|
<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
|
|
The suffix serves two purposes. Firstly, it means you don't have to
|
|
rename old log files that you don't want to overwrite. Secondly, and
|
|
more importantly, it allows correct profiling with the
|
|
<code>--trace-children=yes</code> option of programs that spawn child
|
|
processes.
|
|
|
|
|
|
<a name="profileflags"></a>
|
|
<h3>4.6 Cachegrind options</h3>
|
|
|
|
Cache-simulation specific options are:
|
|
|
|
<ul>
|
|
<li><code>--I1=<size>,<associativity>,<line_size></code><br>
|
|
<code>--D1=<size>,<associativity>,<line_size></code><br>
|
|
<code>--L2=<size>,<associativity>,<line_size></code><p>
|
|
[default: uses CPUID for automagic cache configuration]<p>
|
|
|
|
Manually specifies the I1/D1/L2 cache configuration, where
|
|
<code>size</code> and <code>line_size</code> are measured in bytes. The
|
|
three items must be comma-separated, but with no spaces, eg:
|
|
|
|
<blockquote>
|
|
<code>valgrind --skin=cachegrind --I1=65535,2,64</code>
|
|
</blockquote>
|
|
|
|
You can specify one, two or three of the I1/D1/L2 caches. Any level not
|
|
manually specified will be simulated using the configuration found in the
|
|
normal way (via the CPUID instruction, or failing that, via defaults).
|
|
</ul>
|
|
|
|
|
|
<a name="annotate"></a>
|
|
<h3>4.7 Annotating C/C++ programs</h3>
|
|
|
|
Before using <code>cg_annotate</code>, it is worth widening your
|
|
window to be at least 120-characters wide if possible, as the output
|
|
lines can be quite long.
|
|
<p>
|
|
To get a function-by-function summary, run <code>cg_annotate
|
|
--<i>pid</i></code> in a directory containing a
|
|
<code>cachegrind.out.<i>pid</i></code> file. The <code>--<i>pid</i></code>
|
|
is required so that <code>cg_annotate</code> knows which log file to use when
|
|
several are present.
|
|
<p>
|
|
The output looks like this:
|
|
|
|
<pre>
|
|
--------------------------------------------------------------------------------
|
|
I1 cache: 65536 B, 64 B, 2-way associative
|
|
D1 cache: 65536 B, 64 B, 2-way associative
|
|
L2 cache: 262144 B, 64 B, 8-way associative
|
|
Command: concord vg_to_ucode.c
|
|
Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
|
|
Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
|
|
Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
|
|
Threshold: 99%
|
|
Chosen for annotation:
|
|
Auto-annotation: on
|
|
|
|
--------------------------------------------------------------------------------
|
|
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
|
|
--------------------------------------------------------------------------------
|
|
27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
|
|
|
|
--------------------------------------------------------------------------------
|
|
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
|
|
--------------------------------------------------------------------------------
|
|
8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
|
|
5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
|
|
2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
|
|
2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
|
|
2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
|
|
1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
|
|
897,991 51 51 897,831 95 30 62 1 1 ???:???
|
|
598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
|
|
598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
|
|
598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
|
|
446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
|
|
341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
|
|
320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
|
|
298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
|
|
149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
|
|
149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
|
|
95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
|
|
85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
|
|
</pre>
|
|
|
|
First up is a summary of the annotation options:
|
|
|
|
<ul>
|
|
<li>I1 cache, D1 cache, L2 cache: cache configuration. So you know the
|
|
configuration with which these results were obtained.</li><p>
|
|
|
|
<li>Command: the command line invocation of the program under
|
|
examination.</li><p>
|
|
|
|
<li>Events recorded: event abbreviations are:<p>
|
|
<ul>
|
|
<li><code>Ir </code>: I cache reads (ie. instructions executed)</li>
|
|
<li><code>I1mr</code>: I1 cache read misses</li>
|
|
<li><code>I2mr</code>: L2 cache instruction read misses</li>
|
|
<li><code>Dr </code>: D cache reads (ie. memory reads)</li>
|
|
<li><code>D1mr</code>: D1 cache read misses</li>
|
|
<li><code>D2mr</code>: L2 cache data read misses</li>
|
|
<li><code>Dw </code>: D cache writes (ie. memory writes)</li>
|
|
<li><code>D1mw</code>: D1 cache write misses</li>
|
|
<li><code>D2mw</code>: L2 cache data write misses</li>
|
|
</ul><p>
|
|
Note that D1 total accesses is given by <code>D1mr</code> +
|
|
<code>D1mw</code>, and that L2 total accesses is given by
|
|
<code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
|
|
|
|
<li>Events shown: the events shown (a subset of events gathered). This can
|
|
be adjusted with the <code>--show</code> option.</li><p>
|
|
|
|
<li>Event sort order: the sort order in which functions are shown. For
|
|
example, in this case the functions are sorted from highest
|
|
<code>Ir</code> counts to lowest. If two functions have identical
|
|
<code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
|
|
counts, and so on. This order can be adjusted with the
|
|
<code>--sort</code> option.<p>
|
|
|
|
Note that this dictates the order the functions appear. It is <b>not</b>
|
|
the order in which the columns appear; that is dictated by the "events
|
|
shown" line (and can be changed with the <code>--show</code> option).
|
|
</li><p>
|
|
|
|
<li>Threshold: <code>cg_annotate</code> by default omits functions
|
|
that cause very low numbers of misses to avoid drowning you in
|
|
information. In this case, cg_annotate shows summaries the
|
|
functions that account for 99% of the <code>Ir</code> counts;
|
|
<code>Ir</code> is chosen as the threshold event since it is the
|
|
primary sort event. The threshold can be adjusted with the
|
|
<code>--threshold</code> option.</li><p>
|
|
|
|
<li>Chosen for annotation: names of files specified manually for annotation;
|
|
in this case none.</li><p>
|
|
|
|
<li>Auto-annotation: whether auto-annotation was requested via the
|
|
<code>--auto=yes</code> option. In this case no.</li><p>
|
|
</ul>
|
|
|
|
Then follows summary statistics for the whole program. These are similar
|
|
to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
|
|
|
|
Then follows function-by-function statistics. Each function is
|
|
identified by a <code>file_name:function_name</code> pair. If a column
|
|
contains only a dot it means the function never performs
|
|
that event (eg. the third row shows that <code>strcmp()</code>
|
|
contains no instructions that write to memory). The name
|
|
<code>???</code> is used if the the file name and/or function name
|
|
could not be determined from debugging information. If most of the
|
|
entries have the form <code>???:???</code> the program probably wasn't
|
|
compiled with <code>-g</code>. If any code was invalidated (either due to
|
|
self-modifying code or unloading of shared objects) its counts are aggregated
|
|
into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
|
|
|
|
It is worth noting that functions will come from three types of source files:
|
|
<ol>
|
|
<li> From the profiled program (<code>concord.c</code> in this example).</li>
|
|
<li>From libraries (eg. <code>getc.c</code>)</li>
|
|
<li>From Valgrind's implementation of some libc functions (eg.
|
|
<code>vg_clientmalloc.c:malloc</code>). These are recognisable because
|
|
the filename begins with <code>vg_</code>, and is probably one of
|
|
<code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
|
|
<code>vg_mylibc.c</code>.
|
|
</li>
|
|
</ol>
|
|
|
|
There are two ways to annotate source files -- by choosing them
|
|
manually, or with the <code>--auto=yes</code> option. To do it
|
|
manually, just specify the filenames as arguments to
|
|
<code>cg_annotate</code>. For example, the output from running
|
|
<code>cg_annotate concord.c</code> for our example produces the same
|
|
output as above followed by an annotated version of
|
|
<code>concord.c</code>, a section of which looks like:
|
|
|
|
<pre>
|
|
--------------------------------------------------------------------------------
|
|
-- User-annotated source: concord.c
|
|
--------------------------------------------------------------------------------
|
|
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
|
|
|
|
[snip]
|
|
|
|
. . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
|
|
3 1 1 . . . 1 0 0 {
|
|
. . . . . . . . . FILE *file_ptr;
|
|
. . . . . . . . . Word_Info *data;
|
|
1 0 0 . . . 1 1 1 int line = 1, i;
|
|
. . . . . . . . .
|
|
5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
|
|
. . . . . . . . .
|
|
4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
|
|
3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
|
|
. . . . . . . . .
|
|
. . . . . . . . . /* Open file, check it. */
|
|
6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
|
|
2 0 0 1 0 0 . . . if (!(file_ptr)) {
|
|
. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
|
|
1 1 1 . . . . . . exit(EXIT_FAILURE);
|
|
. . . . . . . . . }
|
|
. . . . . . . . .
|
|
165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
|
|
146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
|
|
. . . . . . . . .
|
|
4 0 0 1 0 0 2 0 0 free(data);
|
|
4 0 0 1 0 0 2 0 0 fclose(file_ptr);
|
|
3 0 0 2 0 0 . . . }
|
|
</pre>
|
|
|
|
(Although column widths are automatically minimised, a wide terminal is clearly
|
|
useful.)<p>
|
|
|
|
Each source file is clearly marked (<code>User-annotated source</code>) as
|
|
having been chosen manually for annotation. If the file was found in one of
|
|
the directories specified with the <code>-I</code>/<code>--include</code>
|
|
option, the directory and file are both given.<p>
|
|
|
|
Each line is annotated with its event counts. Events not applicable for a line
|
|
are represented by a `.'; this is useful for distinguishing between an event
|
|
which cannot happen, and one which can but did not.<p>
|
|
|
|
Sometimes only a small section of a source file is executed. To minimise
|
|
uninteresting output, Valgrind only shows annotated lines and lines within a
|
|
small distance of annotated lines. Gaps are marked with the line numbers so
|
|
you know which part of a file the shown code comes from, eg:
|
|
|
|
<pre>
|
|
(figures and code for line 704)
|
|
-- line 704 ----------------------------------------
|
|
-- line 878 ----------------------------------------
|
|
(figures and code for line 878)
|
|
</pre>
|
|
|
|
The amount of context to show around annotated lines is controlled by the
|
|
<code>--context</code> option.<p>
|
|
|
|
To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
|
|
cg_annotate will automatically annotate every source file it can find that is
|
|
mentioned in the function-by-function summary. Therefore, the files chosen for
|
|
auto-annotation are affected by the <code>--sort</code> and
|
|
<code>--threshold</code> options. Each source file is clearly marked
|
|
(<code>Auto-annotated source</code>) as being chosen automatically. Any files
|
|
that could not be found are mentioned at the end of the output, eg:
|
|
|
|
<pre>
|
|
--------------------------------------------------------------------------------
|
|
The following files chosen for auto-annotation could not be found:
|
|
--------------------------------------------------------------------------------
|
|
getc.c
|
|
ctype.c
|
|
../sysdeps/generic/lockfile.c
|
|
</pre>
|
|
|
|
This is quite common for library files, since libraries are usually compiled
|
|
with debugging information, but the source files are often not present on a
|
|
system. If a file is chosen for annotation <b>both</b> manually and
|
|
automatically, it is marked as <code>User-annotated source</code>.
|
|
|
|
Use the <code>-I/--include</code> option to tell Valgrind where to look for
|
|
source files if the filenames found from the debugging information aren't
|
|
specific enough.
|
|
|
|
Beware that cg_annotate can take some time to digest large
|
|
<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more. Also
|
|
beware that auto-annotation can produce a lot of output if your program is
|
|
large!
|
|
|
|
|
|
<h3>4.8 Annotating assembler programs</h3>
|
|
|
|
Valgrind can annotate assembler programs too, or annotate the
|
|
assembler generated for your C program. Sometimes this is useful for
|
|
understanding what is really happening when an interesting line of C
|
|
code is translated into multiple instructions.<p>
|
|
|
|
To do this, you just need to assemble your <code>.s</code> files with
|
|
assembler-level debug information. gcc doesn't do this, but you can
|
|
use the GNU assembler with the <code>--gstabs</code> option to
|
|
generate object files with this information, eg:
|
|
|
|
<blockquote><code>as --gstabs foo.s</code></blockquote>
|
|
|
|
You can then profile and annotate source files in the same way as for C/C++
|
|
programs.
|
|
|
|
|
|
<h3>4.9 <code>cg_annotate</code> options</h3>
|
|
<ul>
|
|
<li><code>--<i>pid</i></code></li><p>
|
|
|
|
Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
|
|
Not actually an option -- it is required.
|
|
|
|
<li><code>-h, --help</code></li><p>
|
|
<li><code>-v, --version</code><p>
|
|
|
|
Help and version, as usual.</li>
|
|
|
|
<li><code>--sort=A,B,C</code> [default: order in
|
|
<code>cachegrind.out.<i>pid</i></code>]<p>
|
|
Specifies the events upon which the sorting of the function-by-function
|
|
entries will be based. Useful if you want to concentrate on eg. I cache
|
|
misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
|
|
(<code>--sort=D1mr,D2mr</code>), or L2 misses
|
|
(<code>--sort=D2mr,I2mr</code>).</li><p>
|
|
|
|
<li><code>--show=A,B,C</code> [default: all, using order in
|
|
<code>cachegrind.out.<i>pid</i></code>]<p>
|
|
Specifies which events to show (and the column order). Default is to use
|
|
all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
|
|
the order in the file).</li><p>
|
|
|
|
<li><code>--threshold=X</code> [default: 99%] <p>
|
|
Sets the threshold for the function-by-function summary. Functions are
|
|
shown that account for more than X% of the primary sort event. If
|
|
auto-annotating, also affects which files are annotated.
|
|
|
|
Note: thresholds can be set for more than one of the events by appending
|
|
any events for the <code>--sort</code> option with a colon and a number
|
|
(no spaces, though). E.g. if you want to see the functions that cover
|
|
99% of L2 read misses and 99% of L2 write misses, use this option:
|
|
|
|
<blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
|
|
</li><p>
|
|
|
|
<li><code>--auto=no</code> [default]<br>
|
|
<code>--auto=yes</code> <p>
|
|
When enabled, automatically annotates every file that is mentioned in the
|
|
function-by-function summary that can be found. Also gives a list of
|
|
those that couldn't be found.
|
|
|
|
<li><code>--context=N</code> [default: 8]<p>
|
|
Print N lines of context before and after each annotated line. Avoids
|
|
printing large sections of source files that were not executed. Use a
|
|
large number (eg. 10,000) to show all source lines.
|
|
</li><p>
|
|
|
|
<li><code>-I=<dir>, --include=<dir></code>
|
|
[default: empty string]<p>
|
|
Adds a directory to the list in which to search for files. Multiple
|
|
-I/--include options can be given to add multiple directories.
|
|
</ul>
|
|
|
|
|
|
<h3>4.10 Warnings</h3>
|
|
There are a couple of situations in which cg_annotate issues warnings.
|
|
|
|
<ul>
|
|
<li>If a source file is more recent than the
|
|
<code>cachegrind.out.<i>pid</i></code> file. This is because the
|
|
information in <code>cachegrind.out.<i>pid</i></code> is only recorded
|
|
with line numbers, so if the line numbers change at all in the source
|
|
(eg. lines added, deleted, swapped), any annotations will be
|
|
incorrect.<p>
|
|
|
|
<li>If information is recorded about line numbers past the end of a file.
|
|
This can be caused by the above problem, ie. shortening the source file
|
|
while using an old <code>cachegrind.out.<i>pid</i></code> file. If this
|
|
happens, the figures for the bogus lines are printed anyway (clearly
|
|
marked as bogus) in case they are important.</li><p>
|
|
</ul>
|
|
|
|
|
|
<h3>4.11 Things to watch out for</h3>
|
|
Some odd things that can occur during annotation:
|
|
|
|
<ul>
|
|
<li>If annotating at the assembler level, you might see something like this:
|
|
|
|
<pre>
|
|
1 0 0 . . . . . . leal -12(%ebp),%eax
|
|
1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
|
|
2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
|
|
. . . . . . . . . .align 4,0x90
|
|
1 0 0 . . . . . . movl $.LnrB,%eax
|
|
1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
|
|
</pre>
|
|
|
|
How can the third instruction be executed twice when the others are
|
|
executed only once? As it turns out, it isn't. Here's a dump of the
|
|
executable, using <code>objdump -d</code>:
|
|
|
|
<pre>
|
|
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
|
|
8048f28: 89 43 54 mov %eax,0x54(%ebx)
|
|
8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
|
|
8048f32: 89 f6 mov %esi,%esi
|
|
8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
|
|
8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
|
|
</pre>
|
|
|
|
Notice the extra <code>mov %esi,%esi</code> instruction. Where did this
|
|
come from? The GNU assembler inserted it to serve as the two bytes of
|
|
padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
|
|
a four-byte boundary, but pretended it didn't exist when adding debug
|
|
information. Thus when Valgrind reads the debug info it thinks that the
|
|
<code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
|
|
range 0x8048f2b--0x804833 by itself, and attributes the counts for the
|
|
<code>mov %esi,%esi</code> to it.<p>
|
|
</li>
|
|
|
|
<li>Inlined functions can cause strange results in the function-by-function
|
|
summary. If a function <code>inline_me()</code> is defined in
|
|
<code>foo.h</code> and inlined in the functions <code>f1()</code>,
|
|
<code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
|
|
not be a <code>foo.h:inline_me()</code> function entry. Instead, there
|
|
will be separate function entries for each inlining site, ie.
|
|
<code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
|
|
<code>foo.h:f3()</code>. To find the total counts for
|
|
<code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
|
|
|
|
The reason for this is that although the debug info output by gcc
|
|
indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
|
|
doesn't indicate the name of the function in <code>foo.h</code>, so
|
|
Valgrind keeps using the old one.<p>
|
|
|
|
<li>Sometimes, the same filename might be represented with a relative name
|
|
and with an absolute name in different parts of the debug info, eg:
|
|
<code>/home/user/proj/proj.h</code> and <code>../proj.h</code>. In this
|
|
case, if you use auto-annotation, the file will be annotated twice with
|
|
the counts split between the two.<p>
|
|
</li>
|
|
|
|
<li>Files with more than 65,535 lines cause difficulties for the stabs debug
|
|
info reader. This is because the line number in the <code>struct
|
|
nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
|
|
value. Valgrind can handle some files with more than 65,535 lines
|
|
correctly by making some guesses to identify line number overflows. But
|
|
some cases are beyond it, in which case you'll get a warning message
|
|
explaining that annotations for the file might be incorrect.<p>
|
|
</li>
|
|
|
|
<li>If you compile some files with <code>-g</code> and some without, some
|
|
events that take place in a file without debug info could be attributed
|
|
to the last line of a file with debug info (whichever one gets placed
|
|
before the non-debug-info file in the executable).<p>
|
|
</li>
|
|
</ul>
|
|
|
|
This list looks long, but these cases should be fairly rare.<p>
|
|
|
|
Note: stabs is not an easy format to read. If you come across bizarre
|
|
annotations that look like might be caused by a bug in the stabs reader,
|
|
please let us know.<p>
|
|
|
|
|
|
<h3>4.12 Accuracy</h3>
|
|
Valgrind's cache profiling has a number of shortcomings:
|
|
|
|
<ul>
|
|
<li>It doesn't account for kernel activity -- the effect of system calls on
|
|
the cache contents is ignored.</li><p>
|
|
|
|
<li>It doesn't account for other process activity (although this is probably
|
|
desirable when considering a single program).</li><p>
|
|
|
|
<li>It doesn't account for virtual-to-physical address mappings; hence the
|
|
entire simulation is not a true representation of what's happening in the
|
|
cache.</li><p>
|
|
|
|
<li>It doesn't account for cache misses not visible at the instruction level,
|
|
eg. those arising from TLB misses, or speculative execution.</li><p>
|
|
|
|
<li>Valgrind's custom <code>malloc()</code> will allocate memory in different
|
|
ways to the standard <code>malloc()</code>, which could warp the results.
|
|
</li><p>
|
|
|
|
<li>Valgrind's custom threads implementation will schedule threads
|
|
differently to the standard one. This too could warp the results for
|
|
threaded programs.
|
|
</li><p>
|
|
|
|
<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
|
|
will incorrectly be counted as doing a data read if both the arguments
|
|
are registers, eg:
|
|
|
|
<blockquote><code>btsl %eax, %edx</code></blockquote>
|
|
|
|
This should only happen rarely.
|
|
</li><p>
|
|
|
|
<li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
|
|
<code>fsave</code>) are treated as though they only access 16 bytes.
|
|
These instructions seem to be rare so hopefully this won't affect
|
|
accuracy much.
|
|
</li><p>
|
|
</ul>
|
|
|
|
Another thing worth nothing is that results are very sensitive. Changing the
|
|
size of the <code>valgrind.so</code> file, the size of the program being
|
|
profiled, or even the length of its name can perturb the results. Variations
|
|
will be small, but don't expect perfectly repeatable results if your program
|
|
changes at all.<p>
|
|
|
|
While these factors mean you shouldn't trust the results to be super-accurate,
|
|
hopefully they should be close enough to be useful.<p>
|
|
|
|
|
|
<h3>4.13 Todo</h3>
|
|
<ul>
|
|
<li>Program start-up/shut-down calls a lot of functions that aren't
|
|
interesting and just complicate the output. Would be nice to exclude
|
|
these somehow.</li>
|
|
<p>
|
|
</ul>
|
|
</body>
|
|
</html>
|
|
|