From 5d93498d4d296e2ef048e4ced96412e24d83fc16 Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Mon, 11 Nov 2002 00:20:07 +0000 Subject: [PATCH] Add documentation back in, in its new form. Still all very rough and totally borked, but pretty much all the duplication is gone, and there is a good start on a common core section in coregrind/coregrind_core.html. At least I know where I'm going with all this now. The Makefile.am's need to be fixed up. Basic idea is that, when put together in a single directory, these files make a coherent manual, starting at manual.html. Fortunately :-) "make install" does exactly that -- copies them to a single directory. After redundancy removal, there's more that 38000 words of documentation here, according to wc. Amazing. git-svn-id: svn://svn.valgrind.org/valgrind/trunk@1284 --- addrcheck/ac_main.html | 10 + cachegrind/cg_main.html | 752 ++++++++++++ cachegrind/cg_techdocs.html | 461 +++++++ coregrind/coregrind_core.html | 1270 +++++++++++++++++++ coregrind/coregrind_intro.html | 176 +++ coregrind/coregrind_skins.html | 687 +++++++++++ docs/manual.html | 92 ++ helgrind/hg_main.html | 80 ++ lackey/lk_main.html | 68 + memcheck/mc_main.html | 830 +++++++++++++ memcheck/mc_techdocs.html | 2113 ++++++++++++++++++++++++++++++++ none/nl_main.html | 57 + 12 files changed, 6596 insertions(+) create mode 100644 addrcheck/ac_main.html create mode 100644 cachegrind/cg_main.html create mode 100644 cachegrind/cg_techdocs.html create mode 100644 coregrind/coregrind_core.html create mode 100644 coregrind/coregrind_intro.html create mode 100644 coregrind/coregrind_skins.html create mode 100644 docs/manual.html create mode 100644 helgrind/hg_main.html create mode 100644 lackey/lk_main.html create mode 100644 memcheck/mc_main.html create mode 100644 memcheck/mc_techdocs.html create mode 100644 none/nl_main.html diff --git a/addrcheck/ac_main.html b/addrcheck/ac_main.html new file mode 100644 index 000000000..7aa2e9a52 --- /dev/null +++ b/addrcheck/ac_main.html @@ -0,0 +1,10 @@ + + + AddrCheck + + + +(no docs yet, sorry) + + + diff --git a/cachegrind/cg_main.html b/cachegrind/cg_main.html new file mode 100644 index 000000000..85462560e --- /dev/null +++ b/cachegrind/cg_main.html @@ -0,0 +1,752 @@ + + + + Cachegrind + + + + +  +

Cachegrind, version 1.0.0

+
This manual was last updated on 20020726
+

+ +

+jseward@acm.org
+Copyright © 2000-2002 Julian Seward +

+Cachegrind is licensed under the GNU General Public License, +version 2
+An open-source tool for finding memory-management problems in +Linux-x86 executables. +

+ +

+ +


+ +

Contents of this manual

+ +

How to use Cachegrind

+ +

How Cachegrind works

+ +
+ + + +

1  Cache profiling

+Cachegrind is a tool for doing cache simulations and annotate your source +line-by-line with the number of cache misses. In particular, it records: + +On a modern x86 machine, an L1 miss will typically cost around 10 cycles, +and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be +very useful for improving the performance of your program.

+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for traditional profiling and test coverage.

+ +Any feedback, bug-fixes, suggestions, etc, welcome. + + +

1.1  Overview

+First off, as for normal Valgrind use, you probably want to compile with +debugging info (the -g flag). But by contrast with normal +Valgrind use, you probably do want to turn optimisation on, since you +should profile your program as it will be normally run. + +The two steps are: +
    +
  1. Run your program with valgrind --skin=cachegrind in front of + the normal command line invocation. When the program finishes, + Valgrind will print summary cache statistics. It also collects + line-by-line information in a file + cachegrind.out.pid, where pid + is the program's process id. +

    + This step should be done every time you want to collect + information about a new program, a changed program, or about the + same program with different input. +

  2. +

    +

  3. Generate a function-by-function summary, and possibly annotate + source files with 'cg_annotate'. Source files to annotate can be + specified manually, or manually on the command line, or + "interesting" source files can be annotated automatically with + the --auto=yes option. You can annotate C/C++ + files or assembly language files equally easily. +

    + This step can be performed as many times as you like for each + Step 2. You may want to do multiple annotations showing + different information each time.

    +

  4. +
+ +The steps are described in detail in the following sections.

+ + +

1.2  Cache simulation specifics

+ +Cachegrind uses a simulation for a machine with a split L1 cache and a unified +L2 cache. This configuration is used for all (modern) x86-based machines we +are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are +ancient history now.

+ +The more specific characteristics of the simulation are as follows. + +

+ +The cache configuration simulated (cache size, associativity and line size) is +determined automagically using the CPUID instruction. If you have an old +machine that (a) doesn't support the CPUID instruction, or (b) supports it in +an early incarnation that doesn't give any cache information, then Cachegrind +will fall back to using a default configuration (that of a model 3/4 Athlon). +Cachegrind will tell you if this happens. You can manually specify one, two or +all three levels (I1/D1/L2) of the cache from the command line using the +--I1, --D1 and --L2 options.

+ +Other noteworthy behaviour: + +

+ +If you are interested in simulating a cache with different properties, it is +not particularly hard to write your own cache simulator, or to modify the +existing ones in vg_cachesim_I1.c, vg_cachesim_D1.c, +vg_cachesim_L2.c and vg_cachesim_gen.c. We'd be +interested to hear from anyone who does. + + +

1.3  Profiling programs

+ +Cache profiling is enabled by using the --skin=cachegrind +option to the valgrind shell script. To gather cache profiling +information about the program ls -l, type: + +
valgrind --skin=cachegrind ls -l
+ +The program will execute (slowly). Upon completion, summary statistics +that look like this will be printed: + +
+==31751== I   refs:      27,742,716
+==31751== I1  misses:           276
+==31751== L2  misses:           275
+==31751== I1  miss rate:        0.0%
+==31751== L2i miss rate:        0.0%
+==31751== 
+==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
+==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
+==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
+==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
+==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
+==31751== 
+==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
+==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
+
+ +Cache accesses for instruction fetches are summarised first, giving the +number of fetches made (this is the number of instructions executed, which +can be useful to know in its own right), the number of I1 misses, and the +number of L2 instruction (L2i) misses.

+ +Cache accesses for data follow. The information is similar to that of the +instruction fetches, except that the values are also shown split between reads +and writes (note each row's rd and wr values add up +to the row's total).

+ +Combined instruction and data figures for the L2 cache follow that.

+ + +

1.4  Output file

+ +As well as printing summary information, Cachegrind also writes +line-by-line cache profiling information to a file named +cachegrind.out.pid. This file is human-readable, but is +best interpreted by the accompanying program cg_annotate, +described in the next section. +

+Things to note about the cachegrind.out.pid file: +

+ +Note that older versions of Cachegrind used a log file named +cachegrind.out (i.e. no .pid suffix). +The suffix serves two purposes. Firstly, it means you don't have to rename old +log files that you don't want to overwrite. Secondly, and more importantly, +it allows correct profiling with the --trace-children=yes option +of programs that spawn child processes. + + +

1.5  Cachegrind options

+Cachegrind accepts all the options that Valgrind does, although some of them +(ones related to memory checking) don't do anything when cache profiling.

+ +The interesting cache-simulation specific options are: + +

+ + + +

1.6  Annotating C/C++ programs

+ +Before using cg_annotate, it is worth widening your +window to be at least 120-characters wide if possible, as the output +lines can be quite long. +

+To get a function-by-function summary, run cg_annotate +--pid in a directory containing a +cachegrind.out.pid file. The --pid +is required so that cg_annotate knows which log file to use when +several are present. +

+The output looks like this: + +

+--------------------------------------------------------------------------------
+I1 cache:              65536 B, 64 B, 2-way associative
+D1 cache:              65536 B, 64 B, 2-way associative
+L2 cache:              262144 B, 64 B, 8-way associative
+Command:               concord vg_to_ucode.c
+Events recorded:       Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Events shown:          Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Event sort order:      Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Threshold:             99%
+Chosen for annotation:
+Auto-annotation:       on
+
+--------------------------------------------------------------------------------
+Ir         I1mr I2mr Dr         D1mr   D2mr  Dw        D1mw   D2mw
+--------------------------------------------------------------------------------
+27,742,716  276  275 10,955,517 21,905 3,987 4,474,773 19,280 19,098  PROGRAM TOTALS
+
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
+--------------------------------------------------------------------------------
+8,821,482    5    5 2,242,702 1,621    73 1,794,230      0      0  getc.c:_IO_getc
+5,222,023    4    4 2,276,334    16    12   875,959      1      1  concord.c:get_word
+2,649,248    2    2 1,344,810 7,326 1,385         .      .      .  vg_main.c:strcmp
+2,521,927    2    2   591,215     0     0   179,398      0      0  concord.c:hash
+2,242,740    2    2 1,046,612   568    22   448,548      0      0  ctype.c:tolower
+1,496,937    4    4   630,874 9,000 1,400   279,388      0      0  concord.c:insert
+  897,991   51   51   897,831    95    30        62      1      1  ???:???
+  598,068    1    1   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__flockfile
+  598,068    0    0   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__funlockfile
+  598,024    4    4   213,580    35    16   149,506      0      0  vg_clientmalloc.c:malloc
+  446,587    1    1   215,973 2,167   430   129,948 14,057 13,957  concord.c:add_existing
+  341,760    2    2   128,160     0     0   128,160      0      0  vg_clientmalloc.c:vg_trap_here_WRAPPER
+  320,782    4    4   150,711   276     0    56,027     53     53  concord.c:init_hash_table
+  298,998    1    1   106,785     0     0    64,071      1      1  concord.c:create
+  149,518    0    0   149,516     0     0         1      0      0  ???:tolower@@GLIBC_2.0
+  149,518    0    0   149,516     0     0         1      0      0  ???:fgetc@@GLIBC_2.0
+   95,983    4    4    38,031     0     0    34,409  3,152  3,150  concord.c:new_word_node
+   85,440    0    0    42,720     0     0    21,360      0      0  vg_clientmalloc.c:vg_bogus_epilogue
+
+ +First up is a summary of the annotation options: + + + +Then follows summary statistics for the whole program. These are similar +to the summary provided when running valgrind --skin=cachegrind.

+ +Then follows function-by-function statistics. Each function is +identified by a file_name:function_name pair. If a column +contains only a dot it means the function never performs +that event (eg. the third row shows that strcmp() +contains no instructions that write to memory). The name +??? is used if the the file name and/or function name +could not be determined from debugging information. If most of the +entries have the form ???:??? the program probably wasn't +compiled with -g. If any code was invalidated (either due to +self-modifying code or unloading of shared objects) its counts are aggregated +into a single cost centre written as (discarded):(discarded).

+ +It is worth noting that functions will come from three types of source files: +

    +
  1. From the profiled program (concord.c in this example).
  2. +
  3. From libraries (eg. getc.c)
  4. +
  5. From Valgrind's implementation of some libc functions (eg. + vg_clientmalloc.c:malloc). These are recognisable because + the filename begins with vg_, and is probably one of + vg_main.c, vg_clientmalloc.c or + vg_mylibc.c. +
  6. +
+ +There are two ways to annotate source files -- by choosing them +manually, or with the --auto=yes option. To do it +manually, just specify the filenames as arguments to +cg_annotate. For example, the output from running +cg_annotate concord.c for our example produces the same +output as above followed by an annotated version of +concord.c, a section of which looks like: + +
+--------------------------------------------------------------------------------
+-- User-annotated source: concord.c
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr      D1mr  D2mr  Dw      D1mw   D2mw
+
+[snip]
+
+        .    .    .       .     .     .       .      .      .  void init_hash_table(char *file_name, Word_Node *table[])
+        3    1    1       .     .     .       1      0      0  {
+        .    .    .       .     .     .       .      .      .      FILE *file_ptr;
+        .    .    .       .     .     .       .      .      .      Word_Info *data;
+        1    0    0       .     .     .       1      1      1      int line = 1, i;
+        .    .    .       .     .     .       .      .      .
+        5    0    0       .     .     .       3      0      0      data = (Word_Info *) create(sizeof(Word_Info));
+        .    .    .       .     .     .       .      .      .
+    4,991    0    0   1,995     0     0     998      0      0      for (i = 0; i < TABLE_SIZE; i++)
+    3,988    1    1   1,994     0     0     997     53     52          table[i] = NULL;
+        .    .    .       .     .     .       .      .      .
+        .    .    .       .     .     .       .      .      .      /* Open file, check it. */
+        6    0    0       1     0     0       4      0      0      file_ptr = fopen(file_name, "r");
+        2    0    0       1     0     0       .      .      .      if (!(file_ptr)) {
+        .    .    .       .     .     .       .      .      .          fprintf(stderr, "Couldn't open '%s'.\n", file_name);
+        1    1    1       .     .     .       .      .      .          exit(EXIT_FAILURE);
+        .    .    .       .     .     .       .      .      .      }
+        .    .    .       .     .     .       .      .      .
+  165,062    1    1  73,360     0     0  91,700      0      0      while ((line = get_word(data, line, file_ptr)) != EOF)
+  146,712    0    0  73,356     0     0  73,356      0      0          insert(data->;word, data->line, table);
+        .    .    .       .     .     .       .      .      .
+        4    0    0       1     0     0       2      0      0      free(data);
+        4    0    0       1     0     0       2      0      0      fclose(file_ptr);
+        3    0    0       2     0     0       .      .      .  }
+
+ +(Although column widths are automatically minimised, a wide terminal is clearly +useful.)

+ +Each source file is clearly marked (User-annotated source) as +having been chosen manually for annotation. If the file was found in one of +the directories specified with the -I/--include +option, the directory and file are both given.

+ +Each line is annotated with its event counts. Events not applicable for a line +are represented by a `.'; this is useful for distinguishing between an event +which cannot happen, and one which can but did not.

+ +Sometimes only a small section of a source file is executed. To minimise +uninteresting output, Valgrind only shows annotated lines and lines within a +small distance of annotated lines. Gaps are marked with the line numbers so +you know which part of a file the shown code comes from, eg: + +

+(figures and code for line 704)
+-- line 704 ----------------------------------------
+-- line 878 ----------------------------------------
+(figures and code for line 878)
+
+ +The amount of context to show around annotated lines is controlled by the +--context option.

+ +To get automatic annotation, run cg_annotate --auto=yes. +cg_annotate will automatically annotate every source file it can find that is +mentioned in the function-by-function summary. Therefore, the files chosen for +auto-annotation are affected by the --sort and +--threshold options. Each source file is clearly marked +(Auto-annotated source) as being chosen automatically. Any files +that could not be found are mentioned at the end of the output, eg: + +

+--------------------------------------------------------------------------------
+The following files chosen for auto-annotation could not be found:
+--------------------------------------------------------------------------------
+  getc.c
+  ctype.c
+  ../sysdeps/generic/lockfile.c
+
+ +This is quite common for library files, since libraries are usually compiled +with debugging information, but the source files are often not present on a +system. If a file is chosen for annotation both manually and +automatically, it is marked as User-annotated source. + +Use the -I/--include option to tell Valgrind where to look for +source files if the filenames found from the debugging information aren't +specific enough. + +Beware that cg_annotate can take some time to digest large +cachegrind.out.pid files, e.g. 30 seconds or more. Also +beware that auto-annotation can produce a lot of output if your program is +large! + + +

1.7  Annotating assembler programs

+ +Valgrind can annotate assembler programs too, or annotate the +assembler generated for your C program. Sometimes this is useful for +understanding what is really happening when an interesting line of C +code is translated into multiple instructions.

+ +To do this, you just need to assemble your .s files with +assembler-level debug information. gcc doesn't do this, but you can +use the GNU assembler with the --gstabs option to +generate object files with this information, eg: + +

as --gstabs foo.s
+ +You can then profile and annotate source files in the same way as for C/C++ +programs. + + +

1.8  cg_annotate options

+ + + +

1.9  Warnings

+There are a couple of situations in which cg_annotate issues warnings. + + + + +

1.10  Things to watch out for

+Some odd things that can occur during annotation: + + + +This list looks long, but these cases should be fairly rare.

+ +Note: stabs is not an easy format to read. If you come across bizarre +annotations that look like might be caused by a bug in the stabs reader, +please let us know.

+ + +

1.11  Accuracy

+Valgrind's cache profiling has a number of shortcomings: + + + +Another thing worth nothing is that results are very sensitive. Changing the +size of the valgrind.so file, the size of the program being +profiled, or even the length of its name can perturb the results. Variations +will be small, but don't expect perfectly repeatable results if your program +changes at all.

+ +While these factors mean you shouldn't trust the results to be super-accurate, +hopefully they should be close enough to be useful.

+ + +

1.12  Todo

+ +
+ + + diff --git a/cachegrind/cg_techdocs.html b/cachegrind/cg_techdocs.html new file mode 100644 index 000000000..3375ef066 --- /dev/null +++ b/cachegrind/cg_techdocs.html @@ -0,0 +1,461 @@ + + + + The design and implementation of Valgrind + + + + +  +

How Cachegrind works

+ +
+Detailed technical notes for hackers, maintainers and the +overly-curious
+These notes pertain to snapshot 20020306
+

+jseward@acm.org
+
http://developer.kde.org/~sewardj
+Copyright © 2000-2002 Julian Seward +

+Valgrind is licensed under the GNU General Public License, +version 2
+An open-source tool for finding memory-management problems in +x86 GNU/Linux executables. +

+ +

+ + + + +


+ +

Cache profiling

+Valgrind is a very nice platform for doing cache profiling and other kinds of +simulation, because it converts horrible x86 instructions into nice clean +RISC-like UCode. For example, for cache profiling we are interested in +instructions that read and write memory; in UCode there are only four +instructions that do this: LOAD, STORE, +FPU_R and FPU_W. By contrast, because of the x86 +addressing modes, almost every instruction can read or write memory.

+ +Most of the cache profiling machinery is in the file +vg_cachesim.c.

+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.

+ +

Cost centres

+Valgrind gathers cache profiling about every instruction executed, +individually. Each instruction has a cost centre associated with it. +There are two kinds of cost centre: one for instructions that don't reference +memory (iCC), and one for instructions that do +(idCC): + +
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+
+ +Each CC has three fields a, m1, +m2 for recording references, level 1 misses and level 2 misses. +Each of these is a 64-bit ULong -- the numbers can get very large, +ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

+ +A iCC has one CC for instruction cache accesses. A +idCC has two, one for instruction cache accesses, and one for data +cache accesses.

+ +The iCC and dCC structs also store unchanging +information about the instruction: +

+ +Note that data address is not one of the fields for idCC. This is +because for many memory-referencing instructions the data address can change +each time it's executed (eg. if it uses register-offset addressing). We have +to give this item to the cache simulation in a different way (see +Instrumentation section below). Some memory-referencing instructions do always +reference the same address, but we don't try to treat them specialy in order to +keep things simple.

+ +Also note that there is only room for recording info about one data cache +access in an idCC. So what about instructions that do a read then +a write, such as: + +

inc %(esi)
+ +In a write-allocate cache, as simulated by Valgrind, the write cannot miss, +since it immediately follows the read which will drag the block into the cache +if it's not already there. So the write access isn't really interesting, and +Valgrind doesn't record it. This means that Valgrind doesn't measure +memory references, but rather memory references that could miss in the cache. +This behaviour is the same as that used by the AMD Athlon hardware counters. +It also has the benefit of simplifying the implementation -- instructions that +read and write memory can be treated like instructions that read memory.

+ +

Storing cost-centres

+Cost centres are stored in a way that makes them very cheap to lookup, which is +important since one is looked up for every original x86 instruction +executed.

+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.

+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +

+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+
+ +The translation to UCode looks like this: + +
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+
+ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the INCEIPo instruction, the argument of which gives +the byte size of the instruction (note that lazy INCEIP updating is turned off +to allow this).

+ +We can tell if an x86 instruction references memory by looking for +LDL and STL UCode instructions, and thus what kind of +cost centre is required. From this we can determine how many cost centres we +need for the basic block, and their sizes. We can then allocate them in a +single array.

+ +Consider the example code above. After the preliminary pass, we know we need +two cost centres, one iCC and one dCC. So we +allocate an array to store these which looks like this: + +

+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 byte)
+|(uninit)|      data_size   (1 byte)
+|(uninit)|      (padding)   (1 byte)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+
+ +(We can see now why we need tags to distinguish between the two types of cost +centres.)

+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.

+ +

Instrumentation

+The instrumentation pass has two main jobs: + +
    +
  1. Fill in the gaps in the allocated cost centres.
  2. +

  3. Add UCode to call the cache simulator for each instruction.
  4. +

+ +The instrumentation pass steps through the UCode and the cost centres in +tandem. As each original x86 instruction's UCode is processed, the appropriate +gaps in the instructions cost centre are filled in, for example: + +
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|WRITE_CC|      tag         (1 byte)
+|7       |      instr_size  (1 byte)
+|4       |      data_size   (1 byte)
+|(uninit)|      (padding)   (1 byte)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+
+ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)

+ +GCC inserts padding before the instr_size field so that it is word +aligned.

+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +

+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+
+ +Consider the first instruction's UCode. Each call is surrounded by three +PUSHL and POPL instructions to save and restore the +caller-save registers. Then the address of the instruction's cost centre is +pushed onto the stack, to be the first argument to the cache simulation +function. The address is known at this point because we are doing a +simultaneous pass through the cost centre array. This means the cost centre +lookup for each instruction is almost free (just the cost of pushing an +argument for a function call). Then the call to the cache simulation function +for non-memory-reference instructions is made (note that the +CALLMo UInstruction takes an offset into a table of predefined +functions; it is not an absolute address), and the single argument is +CLEARed from the stack.

+ +The second instruction's UCode is similar. The only difference is that, as +mentioned before, we have to pass the address of the data item referenced to +the cache simulation function too. This explains the MOVL t14, +t42 and PUSHL t42 UInstructions. (Note that the seemingly +redundant MOVing will probably be optimised away during register +allocation.)

+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.

+ + +

Handling basic block retranslations

+The above description ignores one complication. Valgrind has a limited size +cache for basic block translations; if it fills up, old translations are +discarded. If a discarded basic block is executed again, it must be +re-translated.

+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.

+ +

The cache simulation

+The cache simulation is fairly straightforward. It just tracks which memory +blocks are in the cache at the moment (it doesn't track the contents, since +that is irrelevant).

+ +The interface to the simulation is quite clean. The functions called from the +UCode contain calls to the simulation functions in the files +vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only +one function call is done per simulated x86 instruction. The file +vg_cachesim.c simply #includes the three files +containing the simulation, which makes plugging in new cache simulations is +very easy -- you just replace the three files and recompile.

+ +

Output

+Output is fairly straightforward, basically printing the cost centre for every +instruction, grouped by files and functions. Total counts (eg. total cache +accesses, total L1 misses) are calculated when traversing this structure rather +than during execution, to save time; the cache simulation functions are called +so often that even one or two extra adds can make a sizeable difference.

+ +Input file has the following format: + +

+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+
+ +Where: + + + +The contents of the "desc:" lines is printed out at the top of the summary. +This is a generic way of providing simulation specific information, eg. for +giving the cache configuration for cache simulation.

+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.

+ +The number of counts in each line and the +summary_line should not exceed the number of events in the +event_line. If the number in each line is less, +cg_annotate treats those missing as though they were a "." entry.

+ +A file_line changes the current file name. A fn_line +changes the current function name. A count_line contains counts +that pertain to the current filename/fn_name. A "fn=" file_line +and a fn_line must appear before any count_lines to +give the context of the first count_lines.

+ +Each file_line should be immediately followed by a +fn_line. "fi=" file_lines are used to switch +filenames for inlined functions; "fe=" file_lines are similar, but +are put at the end of a basic block in which the file name hasn't been switched +back to the original file name. (fi and fe lines behave the same, they are +only distinguished to help debugging.)

+ + +

Summary of performance features

+Quite a lot of work has gone into making the profiling as fast as possible. +This is a summary of the important features: + + + + +

Annotation

+Annotation is done by cg_annotate. It is a fairly straightforward Perl script +that slurps up all the cost centres, and then runs through all the chosen +source files, printing out cost centres with them. It too has been carefully +optimised. + + +

Similar work, extensions

+It would be relatively straightforward to do other simulations and obtain +line-by-line information about interesting events. A good example would be +branch prediction -- all branches could be instrumented to interact with a +branch prediction simulator, using very similar techniques to those described +above.

+ +In particular, cg_annotate would not need to change -- the file format is such +that it is not specific to the cache simulation, but could be used for any kind +of line-by-line information. The only part of cg_annotate that is specific to +the cache simulation is the name of the input file +(cachegrind.out), although it would be very simple to add an +option to control this.

+ + + diff --git a/coregrind/coregrind_core.html b/coregrind/coregrind_core.html new file mode 100644 index 000000000..7e6083636 --- /dev/null +++ b/coregrind/coregrind_core.html @@ -0,0 +1,1270 @@ + + + +

2  Using and understanding the valgrind core services

+ +This section describes the core services, flags and behaviours. That +means it is relevant regardless of what particular skin you are using. +A point of terminology: most references to "valgrind" in the rest of +this section (Section 2) refer to the valgrind core services. + + + +

2.1  What it does with your program

+ +Valgrind is designed to be as non-intrusive as possible. It works +directly with existing executables. You don't need to recompile, +relink, or otherwise modify, the program to be checked. Simply place +the word valgrind at the start of the command line +normally used to run the program, and tell it what skin you want to +use. + +

+So, for example, if you want to run the command ls -l +using the heavyweight memory-checking tool, issue the command: +valgrind --skin=memcheck ls -l. The --skin= +parameter tells the core which skin is to be used. + +

+To preserve compatibility with the 1.0.X series, if you do not specify +a skin, the default is to use the memcheck skin. That means the above +example simplifies to: valgrind ls -l. + +

Regardless of which skin is in use, Valgrind takes control of your +program before it starts. Debugging information is read from the +executable and associated libraries, so that error messages can be +phrased in terms of source code locations (if that is appropriate). + +

+Your program is then run on a synthetic x86 CPU provided by the +valgrind core. As new code is executed for the first time, the core +hands the code to the selected skin. The skin adds its own +instrumentation code to this and hands the result back to the core, +which coordinates the continued execution of this instrumented code. + +

+The amount of instrumentation code added varies widely between skins. +At one end of the scale, the memcheck skin adds code to check every +memory access and every value computed, increasing the size of the +code at least 12 times, and making it run 25-50 times slower than +natively. At the other end of the spectrum, the ultra-trivial "none" +skin adds no instrumentation at all and causes in total "only" about a +4 times slowdown. + +

+Valgrind simulates every single instruction your program executes. +Because of this, the active skin checks, or profiles, not only the +code in your application but also in all supporting dynamically-linked +(.so-format) libraries, including the GNU C library, the +X client libraries, Qt, if you work with KDE, and so on. + +

+If -- as is usually the case -- you're using one of the +error-detection skins, valgrind will often detect errors in +libraries, for example the GNU C or X11 libraries, which you have to +use. Since you're probably using valgrind to debug your own +application, and not those libraries, you don't want to see those +errors and probably can't fix them anyway. + +

+So, rather than swamping you with errors in which you are not +interested, Valgrind allows you to selectively suppress errors, by +recording them in a suppressions file which is read when Valgrind +starts up. The build mechanism attempts to select suppressions which +give reasonable behaviour for the libc and XFree86 versions detected +on your machine. + +

+Different skins report different kinds of errors. The suppression +mechanism therefore allows you to say which skin or skin(s) each +suppression applies to. + + + + +

2.2  Getting started

+ +First off, consider whether it might be beneficial to recompile your +application and supporting libraries with debugging info enabled (the +-g flag). Without debugging info, the best valgrind +will be able to do is guess which function a particular piece of code +belongs to, which makes both error messages and profiling output +nearly useless. With -g, you'll potentially get messages +which point directly to the relevant source code lines. + +

+You don't have to do this, but doing so helps Valgrind produce more +accurate and less confusing error reports. Chances are you're set up +like this already, if you intended to debug your program with GNU gdb, +or some other debugger. + +

+This paragraph applies only if you plan to use the memcheck +skin (which is the default). On rare occasions, optimisation levels +at -O2 and above have been observed to generate code which +fools memcheck into wrongly reporting uninitialised value +errors. We have looked in detail into fixing this, and unfortunately +the result is that doing so would give a further significant slowdown +in what is already a slow skin. So the best solution is to turn off +optimisation altogether. Since this often makes things unmanagably +slow, a plausible compromise is to use -O. This gets +you the majority of the benefits of higher optimisation levels whilst +keeping relatively small the chances of false complaints from memcheck. +All other skins (as far as we know) are unaffected by optimisation +level. + +

+Valgrind understands both the older "stabs" debugging format, used by +gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1 +and later. We continue to refine and debug our debug-info readers, +although the majority of effort will naturally enough go into the +newer DWARF2 reader. + +

+Then just run your application, but place valgrind +--skin=the-selected-skin in front of your usual command-line +invokation. Note that you should run the real (machine-code) +executable here. If your application is started by, for example, a +shell or perl script, you'll need to modify it to invoke Valgrind on +the real executables. Running such scripts directly under Valgrind +will result in you getting error reports pertaining to +/bin/sh, /usr/bin/perl, or whatever +interpreter you're using. This almost certainly isn't what you want +and can be confusing. You can probably force the issue by +giving the flag --trace-children=yes, but confusion is +still highly likely. + + + +

2.3  The commentary

+ +Valgrind writes a commentary, a stream of text, detailing error +reports and other significant events. All lines in the commentary +have following form:
+
+  ==12345== some-message-from-Valgrind
+
+ +

The 12345 is the process ID. This scheme makes it easy +to distinguish program output from Valgrind commentary, and also easy +to differentiate commentaries from different processes which have +become merged together, for whatever reason. + +

By default, Valgrind writes only essential messages to the commentary, +so as to avoid flooding you with information of secondary importance. +If you want more information about what is happening, re-run, passing +the -v flag to Valgrind. + +

+Version 2 of valgrind gives significantly more flexibility than 1.0.X +does about where that stream is sent to. You have three options: + +

+

+Here is an important point about the relationship between the +commentary and profiling output from skins. The commentary contains a +mix of messages from the valgrind core and the selected skin. If the +skin reports errors, it will report them to the commentary. However, +if the skin does profiling, the profile data will be written to a file +of some kind, depending on the skin, and independent of what +--log* options are in force. The commentary is intended +to be a low-bandwidth, human-readable channel. Profiling data, on the +other hand, is usually voluminous and not meaningful without further +processing, which is why we have chosen this arrangement. + + + +

2.4  Reporting of errors

+ +When one of the error-checking skins (memcheck, addrcheck, helgrind) +detects something bad happening in the program, an error message is +written to the commentary. For example:
+
+  ==25832== Invalid read of size 4
+  ==25832==    at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45)
+  ==25832==    by 0x80487AF: main (bogon.cpp:66)
+  ==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
+  ==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
+  ==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
+
+ +

+This message says that the program did an illegal 4-byte read of +address 0xBFFFF74C, which, as far as memcheck can tell, is not a valid +stack address, nor corresponds to any currently malloc'd or free'd +blocks. The read is happening at line 45 of bogon.cpp, +called from line 66 of the same file, etc. For errors associated with +an identified malloc'd/free'd block, for example reading free'd +memory, Valgrind reports not only the location where the error +happened, but also where the associated block was malloc'd/free'd. + +

+Valgrind remembers all error reports. When an error is detected, +it is compared against old reports, to see if it is a duplicate. If +so, the error is noted, but no further commentary is emitted. This +avoids you being swamped with bazillions of duplicate error reports. + +

+If you want to know how many times each error occurred, run with the +-v option. When execution finishes, all the reports are +printed out, along with, and sorted by, their occurrence counts. This +makes it easy to see which errors have occurred most frequently. + +

+Errors are reported before the associated operation actually happens. +If you're using a skin (memcheck, addrcheck) which does address +checking, and your program attempts to read from address zero, the +skin will emit a message to this effect, and the program will then +duly die with a segmentation fault. + +

+In general, you should try and fix errors in the order that they are +reported. Not doing so can be confusing. For example, a program +which copies uninitialised values to several memory locations, and +later uses them, will generate several error messages, when run on +memcheck. The first such error message may well give the most direct +clue to the root cause of the problem. + +

+The process of detecting duplicate errors is quite an expensive one +and can become a significant performance overhead if your program +generates huge quantities of errors. To avoid serious problems here, +Valgrind will simply stop collecting errors after 300 different errors +have been seen, or 30000 errors in total have been seen. In this +situation you might as well stop your program and fix it, because +Valgrind won't tell you anything else useful after this. Note that +the 300/30000 limits apply after suppressed errors are removed. These +limits are defined in vg_include.h and can be increased +if necessary. + +

+To avoid this cutoff you can use the --error-limit=no +flag. Then valgrind will always show errors, regardless of how many +there are. Use this flag carefully, since it may have a dire effect +on performance. + + + +

2.5  Suppressing errors

+ +The error-checking skins detect numerous problems in the base +libraries, such as the GNU C library, and the XFree86 client +libraries, which come pre-installed on your GNU/Linux system. You +can't easily fix these, but you don't want to see these errors (and +yes, there are many!) So Valgrind reads a list of errors to suppress +at startup. A default suppression file is cooked up by the +./configure script when the system is built. + +

+You can modify and add to the suppressions file at your leisure, +or, better, write your own. Multiple suppression files are allowed. +This is useful if part of your project contains errors you can't or +don't want to fix, yet you don't want to continuously be reminded of +them. + +

+Each error to be suppressed is described very specifically, to +minimise the possibility that a suppression-directive inadvertantly +suppresses a bunch of similar errors which you did want to see. The +suppression mechanism is designed to allow precise yet flexible +specification of errors to suppress. + +

+If you use the -v flag, at the end of execution, Valgrind +prints out one line for each used suppression, giving its name and the +number of times it got used. Here's the suppressions used by a run of +valgrind --skin=memcheck ls -l: +

+  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r
+  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r
+  --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object
+
+ + + +

2.6  Command-line flags for the valgrind core

+ + +As mentioned above, valgrind's core accepts a common set of flags. +The skins also accept skin-specific flags, which are documented +seperately for each skin. + +You invoke Valgrind like this: +
+  valgrind [options-for-Valgrind] your-prog [options for your-prog]
+
+ +

Note that Valgrind also reads options from the environment variable +$VALGRIND_OPTS, and processes them before the command-line +options. Options for the valgrind core may be freely mixed with those +for the selected skin. + +

Valgrind's default settings succeed in giving reasonable behaviour +in most cases. Available options, in no particular order, are as +follows: +

+ +There are also some options for debugging Valgrind itself. You +shouldn't need to use them in the normal run of things. Nevertheless: + + + + + +

2.8  The Client Request mechanism

+ +Valgrind has a trapdoor mechanism via which the client program can +pass all manner of requests and queries to Valgrind. Internally, this +is used extensively to make malloc, free, signals, threads, etc, work, +although you don't see that. +

+For your convenience, a subset of these so-called client requests is +provided to allow you to tell Valgrind facts about the behaviour of +your program, and conversely to make queries. In particular, your +program can tell Valgrind about changes in memory range permissions +that Valgrind would not otherwise know about, and so allows clients to +get Valgrind to do arbitrary custom checks. +

+Clients need to include the header file valgrind.h to +make this work. The macros therein have the magical property that +they generate code in-line which Valgrind can spot. However, the code +does nothing when not run on Valgrind, so you are not forced to run +your program on Valgrind just because you use the macros in this file. +Also, you are not required to link your program with any extra +supporting libraries. +

+A brief description of the available macros: +

+

+ + + +

2.9  Support for POSIX Pthreads

+ +As of late April 02, Valgrind supports programs which use POSIX +pthreads. Doing this has proved technically challenging but is now +mostly complete. It works well enough for significant threaded +applications to work. +

+It works as follows: threaded apps are (dynamically) linked against +libpthread.so. Usually this is the one installed with +your Linux distribution. Valgrind, however, supplies its own +libpthread.so and automatically connects your program to +it instead. +

+The fake libpthread.so and Valgrind cooperate to +implement a user-space pthreads package. This approach avoids the +horrible implementation problems of implementing a truly +multiprocessor version of Valgrind, but it does mean that threaded +apps run only on one CPU, even if you have a multiprocessor machine. +

+Valgrind schedules your threads in a round-robin fashion, with all +threads having equal priority. It switches threads every 50000 basic +blocks (typically around 300000 x86 instructions), which means you'll +get a much finer interleaving of thread executions than when run +natively. This in itself may cause your program to behave differently +if you have some kind of concurrency, critical race, locking, or +similar, bugs. +

+The current (valgrind-1.0 release) state of pthread support is as +follows: +

+ + +As of 18 May 02, the following threaded programs now work fine on my +RedHat 7.2 box: Opera 6.0Beta2, KNode in KDE 3.0, Mozilla-0.9.2.1 and +Galeon-0.11.3, both as supplied with RedHat 7.2. Also Mozilla 1.0RC2. +OpenOffice 1.0. MySQL 3.something (the current stable release). + + +

2.10  Building and installing

+ +We now use the standard Unix ./configure, +make, make install mechanism, and I have +attempted to ensure that it works on machines with kernel 2.2 or 2.4 +and glibc 2.1.X or 2.2.X. I don't think there is much else to say. +There are no options apart from the usual --prefix that +you should give to ./configure. + +

+The configure script tests the version of the X server +currently indicated by the current $DISPLAY. This is a +known bug. The intention was to detect the version of the current +XFree86 client libraries, so that correct suppressions could be +selected for them, but instead the test checks the server version. +This is just plain wrong. + +

+If you are building a binary package of Valgrind for distribution, +please read README_PACKAGERS. It contains some important +information. + +

+Apart from that there is no excitement here. Let me know if you have +build problems. + + + + +

2.11  If you have problems

+Mail me (jseward@acm.org). + +

See Section 4 for the known limitations of +Valgrind, and for a list of programs which are known not to work on +it. + +

The translator/instrumentor has a lot of assertions in it. They +are permanently enabled, and I have no plans to disable them. If one +of these breaks, please mail me! + +

If you get an assertion failure on the expression +chunkSane(ch) in vg_free() in +vg_malloc.c, this may have happened because your program +wrote off the end of a malloc'd block, or before its beginning. +Valgrind should have emitted a proper message to that effect before +dying in this way. This is a known problem which I should fix. +

+ +


+ + + +

3.4  Signals

+ +Valgrind provides suitable handling of signals, so, provided you stick +to POSIX stuff, you should be ok. Basic sigaction() and sigprocmask() +are handled. Signal handlers may return in the normal way or do +longjmp(); both should work ok. As specified by POSIX, a signal is +blocked in its own handler. Default actions for signals should work +as before. Etc, etc. + +

Under the hood, dealing with signals is a real pain, and Valgrind's +simulation leaves much to be desired. If your program does +way-strange stuff with signals, bad things may happen. If so, let me +know. I don't promise to fix it, but I'd at least like to be aware of +it. + + + + +

4  Limitations

+ +The following list of limitations seems depressingly long. However, +most programs actually work fine. + +

Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on +a kernel 2.2.X or 2.4.X system, subject to the following constraints: + +

+ +Programs which are known not to work are: + + + +Known platform-specific limitations, as of release 1.0.0: + + + + +


+ + + +

5  How it works -- a rough overview

+Some gory details, for those with a passion for gory details. You +don't need to read this section if all you want to do is use Valgrind. + + +

5.1  Getting started

+ +Valgrind is compiled into a shared object, valgrind.so. The shell +script valgrind sets the LD_PRELOAD environment variable to point to +valgrind.so. This causes the .so to be loaded as an extra library to +any subsequently executed dynamically-linked ELF binary, viz, the +program you want to debug. + +

The dynamic linker allows each .so in the process image to have an +initialisation function which is run before main(). It also allows +each .so to have a finalisation function run after main() exits. + +

When valgrind.so's initialisation function is called by the dynamic +linker, the synthetic CPU to starts up. The real CPU remains locked +in valgrind.so for the entire rest of the program, but the synthetic +CPU returns from the initialisation function. Startup of the program +now continues as usual -- the dynamic linker calls all the other .so's +initialisation routines, and eventually runs main(). This all runs on +the synthetic CPU, not the real one, but the client program cannot +tell the difference. + +

Eventually main() exits, so the synthetic CPU calls valgrind.so's +finalisation function. Valgrind detects this, and uses it as its cue +to exit. It prints summaries of all errors detected, possibly checks +for memory leaks, and then exits the finalisation routine, but now on +the real CPU. The synthetic CPU has now lost control -- permanently +-- so the program exits back to the OS on the real CPU, just as it +would have done anyway. + +

On entry, Valgrind switches stacks, so it runs on its own stack. +On exit, it switches back. This means that the client program +continues to run on its own stack, so we can switch back and forth +between running it on the simulated and real CPUs without difficulty. +This was an important design decision, because it makes it easy (well, +significantly less difficult) to debug the synthetic CPU. + + + +

5.2  The translation/instrumentation engine

+ +Valgrind does not directly run any of the original program's code. Only +instrumented translations are run. Valgrind maintains a translation +table, which allows it to find the translation quickly for any branch +target (code address). If no translation has yet been made, the +translator - a just-in-time translator - is summoned. This makes an +instrumented translation, which is added to the collection of +translations. Subsequent jumps to that address will use this +translation. + +

Valgrind no longer directly supports detection of self-modifying +code. Such checking is expensive, and in practice (fortunately) +almost no applications need it. However, to help people who are +debugging dynamic code generation systems, there is a Client Request +(basically a macro you can put in your program) which directs Valgrind +to discard translations in a given address range. So Valgrind can +still work in this situation provided the client tells it when +code has become out-of-date and needs to be retranslated. + +

The JITter translates basic blocks -- blocks of straight-line-code +-- as single entities. To minimise the considerable difficulties of +dealing with the x86 instruction set, x86 instructions are first +translated to a RISC-like intermediate code, similar to sparc code, +but with an infinite number of virtual integer registers. Initially +each insn is translated seperately, and there is no attempt at +instrumentation. + +

The intermediate code is improved, mostly so as to try and cache +the simulated machine's registers in the real machine's registers over +several simulated instructions. This is often very effective. Also, +we try to remove redundant updates of the simulated machines's +condition-code register. + +

The intermediate code is then instrumented, giving more +intermediate code. There are a few extra intermediate-code operations +to support instrumentation; it is all refreshingly simple. After +instrumentation there is a cleanup pass to remove redundant value +checks. + +

This gives instrumented intermediate code which mentions arbitrary +numbers of virtual registers. A linear-scan register allocator is +used to assign real registers and possibly generate spill code. All +of this is still phrased in terms of the intermediate code. This +machinery is inspired by the work of Reuben Thomas (Mite). + +

Then, and only then, is the final x86 code emitted. The +intermediate code is carefully designed so that x86 code can be +generated from it without need for spare registers or other +inconveniences. + +

The translations are managed using a traditional LRU-based caching +scheme. The translation cache has a default size of about 14MB. + + + +

5.3  Tracking the status of memory

Each byte in the +process' address space has nine bits associated with it: one A bit and +eight V bits. The A and V bits for each byte are stored using a +sparse array, which flexibly and efficiently covers arbitrary parts of +the 32-bit address space without imposing significant space or +performance overheads for the parts of the address space never +visited. The scheme used, and speedup hacks, are described in detail +at the top of the source file vg_memory.c, so you should read that for +the gory details. + + + +

5.4 System calls

+All system calls are intercepted. The memory status map is consulted +before and updated after each call. It's all rather tiresome. See +vg_syscall_mem.c for details. + + + +

5.5  Signals

+All system calls to sigaction() and sigprocmask() are intercepted. If +the client program is trying to set a signal handler, Valgrind makes a +note of the handler address and which signal it is for. Valgrind then +arranges for the same signal to be delivered to its own handler. + +

When such a signal arrives, Valgrind's own handler catches it, and +notes the fact. At a convenient safe point in execution, Valgrind +builds a signal delivery frame on the client's stack and runs its +handler. If the handler longjmp()s, there is nothing more to be said. +If the handler returns, Valgrind notices this, zaps the delivery +frame, and carries on where it left off before delivering the signal. + +

The purpose of this nonsense is that setting signal handlers +essentially amounts to giving callback addresses to the Linux kernel. +We can't allow this to happen, because if it did, signal handlers +would run on the real CPU, not the simulated one. This means the +checking machinery would not operate during the handler run, and, +worse, memory permissions maps would not be updated, which could cause +spurious error reports once the handler had returned. + +

An even worse thing would happen if the signal handler longjmp'd +rather than returned: Valgrind would completely lose control of the +client program. + +

Upshot: we can't allow the client to install signal handlers +directly. Instead, Valgrind must catch, on behalf of the client, any +signal the client asks to catch, and must delivery it to the client on +the simulated CPU, not the real one. This involves considerable +gruesome fakery; see vg_signals.c for details. +

+ +


+ + +

6  Example

+This is the log for a run of a small program. The program is in fact +correct, and the reported error is as the result of a potentially serious +code generation bug in GNU g++ (snapshot 20010527). +
+sewardj@phoenix:~/newmat10$
+~/Valgrind-6/valgrind -v ./bogon 
+==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
+==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
+==25832== Startup, with flags:
+==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
+==25832== reading syms from /lib/ld-linux.so.2
+==25832== reading syms from /lib/libc.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
+==25832== reading syms from /lib/libm.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
+==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
+==25832== reading syms from /proc/self/exe
+==25832== loaded 5950 symbols, 142333 line number locations
+==25832== 
+==25832== Invalid read of size 4
+==25832==    at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45)
+==25832==    by 0x80487AF: main (bogon.cpp:66)
+==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
+==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
+==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
+==25832==
+==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
+==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
+==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
+==25832== For a detailed leak analysis, rerun with: --leak-check=yes
+==25832==
+==25832== exiting, did 1881 basic blocks, 0 misses.
+==25832== 223 translations, 3626 bytes in, 56801 bytes out.
+
+

The GCC folks fixed this about a week before gcc-3.0 shipped. +


+

+ + + + diff --git a/coregrind/coregrind_intro.html b/coregrind/coregrind_intro.html new file mode 100644 index 000000000..e561da410 --- /dev/null +++ b/coregrind/coregrind_intro.html @@ -0,0 +1,176 @@ + + + +

1  Introduction

+ + +

1.1  An overview of Valgrind

+ +Valgrind is a flexible tool for profiling and debugging Linux-x86 +executables. The tool consists of a core, which provides a synthetic +x86 CPU in software, and a series of "skins", each of which is a +debugging or profiling tool. The architecture is modular, so that new +skins can be created easily and without disturbing the existing +structure. + +

+A number of useful skins are supplied as standard. In summary, these +are: + +

+ +A number of minor skins (corecheck, lackey and +none) are also supplied. These aren't particularly useful -- +they exist to illustrate how to create simple skins and to help the +valgrind developers in various ways. + + +

+Valgrind is closely tied to details of the CPU, operating system and +to a less extent, compiler and basic C libraries. This makes it +difficult to make it portable, so we have chosen at the outset to +concentrate on what we believe to be a widely used platform: Linux on +x86s. Valgrind uses the standard Unix ./configure, +make, make install mechanism, and we have +attempted to ensure that it works on machines with kernel 2.2 or 2.4 +and glibc 2.1.X, 2.2.X or 2.3.X. This should cover the vast majority +of modern Linux installations. + + +

+Valgrind is licensed under the GNU General Public License, version +2. Read the file LICENSE in the source distribution for details. Some +of the PThreads test cases, pth_*.c, are taken from +"Pthreads Programming" by Bradford Nichols, Dick Buttlar & +Jacqueline Proulx Farrell, ISBN 1-56592-115-1, published by O'Reilly +& Associates, Inc. + + + + + +

1.2  How to navigate this manual

+ +Valgrind is structured as a set of core services supporting a number +of profiling and debugging tools ("skins"). This manual is structured +similarly. Below, we continue with a description of the valgrind +core, how to use it, and the flags it supports. + +

+The skins each have their own chapters in this manual. You only need +to read the documentation for the skin(s) you actually use, although +you may find it helpful to be at least a little bit familar with what +all skins do. + +

+If you're new to all this, you're most likely to be using the Memcheck +skin, since that's the one selected by default. So, read the rest of +this page, and the section Memcheck. + +

+Be aware that the core understands some command line flags, and the +skins then have their own flags which they know about. This means +there is no central place describing all the flags that are accepted +-- you have to read the flags documentation both for valgrind's core +(below) and for the skin you want to use. + +

+

For users migrating from valgrind-1.0.X

+

+Valgrind-2.0.X is a major redesign of the 1.0.X series. You should at +least be familiar with the concept of the new core/skin division, +as explained above in the Introduction. Having said that, we've tried +to make the command line handling and behaviour as +backwards-compatible as we can. In particular, just running +valgrind [args-for-valgrind] my_prog [args-for-my-prog] +should work pretty much as before. + +

+ diff --git a/coregrind/coregrind_skins.html b/coregrind/coregrind_skins.html new file mode 100644 index 000000000..a17397139 --- /dev/null +++ b/coregrind/coregrind_skins.html @@ -0,0 +1,687 @@ + + + + Valgrind + + + + +  +

Valgrind Skins

+
+ A guide to writing new skins for Valgrind
+ This guide was last updated on 20020926 +
+

+ +

+njn25@cam.ac.uk
+Nick Nethercote, October 2002 +

+Valgrind is licensed under the GNU General Public License, +version 2
+An open-source tool for supervising execution of Linux-x86 executables. +

+ +

+ +


+ +

Contents of this manual

+ +

Introduction

+ 1.1  Supervised Execution
+ 1.2  Skins
+ 1.3  Execution Spaces
+ +

Writing a Skin

+ 2.1  Why write a skin?
+ 2.2  How skins work
+ 2.3  Getting the code
+ 2.4  Getting started
+ 2.5  Writing the code
+ 2.6  Initialisation
+ 2.7  Instrumentation
+ 2.8  Finalisation
+ 2.9  Other important information
+ 2.10  Words of advice
+ +

Advanced Topics

+ 3.1  Suppressions
+ 3.2  Documentation
+ 3.3  Regression tests
+ 3.4  Profiling
+ 3.5  Other makefile hackery
+ 3.6  Core/skin interface versions
+ +

Final Words

+ +
+ + +

1  Introduction

+ + +

1.1  Supervised Execution

+ +Valgrind provides a generic infrastructure for supervising the execution of +programs. This is done by providing a way to instrument programs in very +precise ways, making it relatively easy to support activities such as dynamic +error detection and profiling.

+ +Although writing a skin is not easy, and requires learning quite a few things +about Valgrind, it is much easier than instrumenting a program from scratch +yourself. + + +

1.2  Skins

+The key idea behind Valgrind's architecture is the division between its +``core'' and ``skins''. +

+The core provides the common low-level infrastructure to support program +instrumentation, including the x86-to-x86 JIT compiler, low-level memory +manager, signal handling and a scheduler (for pthreads). It also provides +certain services that are useful to some but not all skins, such as support +for error recording and suppression. +

+But the core leaves certain operations undefined, which must be filled by skins. +Most notably, skins define how program code should be instrumented. They can +also define certain variables to indicate to the core that they would like to +use certain services, or be notified when certain interesting events occur. +

+Each skin that is written defines a new program supervision tool. Writing a +new tool just requires writing a new skin. The core takes care of all the hard +work. +

+ + +

1.3  Execution Spaces

+An important concept to understand before writing a skin is that there are +three spaces in which program code executes: + +
    +
  1. User space: this covers most of the program's execution. The skin is + given the code and can instrument it any way it likes, providing (more or + less) total control over the code.

    + + Code executed in user space includes all the program code, almost all of + the C library (including things like the dynamic linker), and almost + all parts of all other libraries. +

  2. + +

  3. Core space: a small proportion of the program's execution takes place + entirely within Valgrind's core. This includes:

    + +

      +
    • Dynamic memory management (malloc() etc.)
    • + +
    • Pthread operations and scheduling
    • + +
    • Signal handling
    • +

    + + A skin has no control over these operations; it never ``sees'' the code + doing this work and thus cannot instrument it. However, the core + provides hooks so a skin can be notified when certain interesting events + happen, for example when when dynamic memory is allocated or freed, the + stack pointer is changed, or a pthread mutex is locked, etc.

    + + Note that these hooks only notify skins of events relevant to user + space. For example, when the core allocates some memory for its own use, + the skin is not notified of this, because it's not directly part of the + supervised program's execution. +

  4. + +

  5. Kernel space: execution in the kernel. Two kinds:

    + +

      +
    1. System calls: can't be directly observed by either the skin or the + core. But the core does have some idea of what happens to the + arguments, and it provides hooks for a skin to wrap system calls. +
    2. + +

    3. Other: all other kernel activity (e.g. process scheduling) is + totally opaque and irrelevant to the program. +
    4. +

    +
  6. + + It should be noted that a skin only has direct control over code executed in + user space. This is the vast majority of code executed, but it is not + absolutely all of it, so any profiling information recorded by a skin won't + be totally accurate. +

+ + + +

2  Writing a Skin

+ + +

2.1  Why write a skin?

+ +Before you write a skin, you should have some idea of what it should do. What +is it you want to know about your programs of interest? Consider some existing +skins: + + + +These examples give a reasonable idea of what kinds of things Valgrind can be +used for. The instrumentation can range from very lightweight (e.g. counting +the number of times a particular function is called) to very intrusive (e.g. +memcheck's memory checking). + +
+

2.2  How skins work

+ +Skins must define various functions for instrumenting programs that are called +by Valgrind's core, yet they must be implemented in such a way that they can be +written and compiled without touching Valgrind's core. This is important, +because one of our aims is to allow people to write and distribute their own +skins that can be plugged into Valgrind's core easily.

+ +This is achieved by packaging each skin into a separate shared object which is +then loaded ahead of the core shared object valgrind.so, using the +dynamic linker's LD_PRELOAD variable. Any functions defined in +the skin that share the name with a function defined in core (such as +the instrumentation function SK_(instrument)()) override the +core's definition. Thus the core can call the necessary skin functions.

+ +This magic is all done for you; the shared object used is chosen with the +--skin option to the valgrind startup script. The +default skin used is memcheck, Valgrind's original memory checker. + + +

2.3  Getting the code

+ +To write your own skin, you'll need to check out a copy of Valgrind from the +CVS repository, rather than using a packaged distribution. This is because it +contains several extra files needed for writing skins.

+ +To check out the code from the CVS repository, first login: +

+cvs -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind login +
+ +Then checkout the code. To get a copy of the current development version +(recommended for the brave only): +
+cvs -z3 -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind co valgrind +
+ +To get a copy of the stable released branch: +
+cvs -z3 -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind co -r TAG valgrind +
+ +where TAG has the form VALGRIND_X_Y_Z for +version X.Y.Z. + +
+

2.4  Getting started

+ +Valgrind uses GNU automake and autoconf for the +creation of Makefiles and configuration. But don't worry, these instructions +should be enough to get you started even if you know nothing about those +tools.

+ +In what follows, all filenames are relative to Valgrind's top-level directory +valgrind/. + +

    +
  1. Choose a name for the skin, and an abbreviation that can be used as a + short prefix. We'll use foobar and fb as an + example. +
  2. + +

  3. Make a new directory foobar/ which will hold the skin. +
  4. + +

  5. Copy example/Makefile.am into foobar/. + Edit it by replacing all occurrences of the string + ``example'' with ``foobar'' and the one + occurrence of the string ``ex_'' with ``fb_''. + It might be worth trying to understand this file, at least a little; you + might have to do more complicated things with it later on. In + particular, the name of the vgskin_foobar_so_SOURCES variable + determines the name of the skin's shared object, which determines what + name must be passed to the --skin option to use the skin. +
  6. + +

  7. Copy example/ex_main.c into + foobar/, renaming it as fb_main.c. + Edit it by changing the five lines in SK_(pre_clo_init)() + to something appropriate for the skin. These fields are used in the + startup message, except for bug_reports_to which is used + if a skin assertion fails. +
  8. + +

  9. Edit Makefile.am, adding the new directory + foobar to the SUBDIRS variable. +
  10. + +

  11. Edit configure.in, adding foobar/Makefile to the + AC_OUTPUT list. +
  12. + +

  13. Run: +
    +    autogen.sh
    +    ./configure --prefix=`pwd`/inst
    +    make install
    + + It should automake, configure and compile without errors, putting copies + of the skin's shared object vgskin_foobar.so in + foobar/ and + inst/lib/valgrind/. +
  14. + +

  15. You can test it with a command like +
    +    inst/bin/valgrind --skin=foobar date
    + + (almost any program should work; date is just an example). + The output should be something like this: +
    +==738== foobar-0.0.1, a foobarring tool for x86-linux.
    +==738== Copyright (C) 2002, and GNU GPL'd, by J. Random Hacker.
    +==738== Built with valgrind-1.1.0, a program execution monitor.
    +==738== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward.
    +==738== Estimated CPU clock rate is 1400 MHz
    +==738== For more details, rerun with: -v
    +==738== 
    +Wed Sep 25 10:31:54 BST 2002
    +==738==
    + + The skin does nothing except run the program uninstrumented. +
  16. +

+ +These steps don't have to be followed exactly - you can choose different names +for your source files, and use a different --prefix for +./configure.

+ +Now that we've setup, built and tested the simplest possible skin, onto the +interesting stuff... + + + +

2.5  Writing the code

+ +A skin must define at least these four functions: +
+    SK_(pre_clo_init)()
+    SK_(post_clo_init)()
+    SK_(instrument)()
+    SK_(fini)()
+
+ +Also, it must use the macro VG_DETERMINE_INTERFACE_VERSION +exactly once in its source code. If it doesn't, you will get a link error +involving VG_(skin_interface_major_version). This macro is +used to ensure the core/skin interface used by the core and a plugged-in +skin are binary compatible. + +In addition, if a skin wants to use some of the optional services provided by +the core, it may have to define other functions. + + +

2.6  Initialisation

+ +Most of the initialisation should be done in SK_(pre_clo_init)(). +Only use SK_(post_clo_init)() if a skin provides command line +options and must do some initialisation after option processing takes place +(``clo'' stands for ``command line options'').

+ +The first argument to SK_(pre_clo_init)() must be initialised with +various ``details'' for a skin. These are all compulsory except for +version. They are used when constructing the startup message, +except for which is used if VG_(skin_panic)() is +ever called, or a skin assertion fails.

+ +The second argument to SK_(pre_clo_init)() must be initialised with +the ``needs'' for a skin. They are mostly booleans, and can be left untouched +(they default to False). They determine whether a skin can do +various things such as: record, report and suppress errors; process command +line options; wrap system calls; record extra information about malloc'd +blocks, etc.

+ +For example, if a skin wants the core's help in recording and reporting errors, +it must set the skin_errors need to True, and then +provide definitions of six functions for comparing errors, printing out errors, +reading suppressions from a suppressions file, etc. While writing these +functions requires some work, it's much less than doing error handling from +scratch because the core is doing most of the work. See the type +VgNeeds in include/vg_skin.h for full details of all +the needs.

+ +The third argument to SK_(pre_clo_init)() must be initialised to +indicate which events in core the skin wants to be notified about. These +include things such as blocks of memory being malloc'd, the stack pointer +changing, a mutex being locked, etc. If a skin wants to know about this, +it should set the relevant pointer in the structure to point to a function, +which will be called when that event happens.

+ +For example, if the skin want to be notified when a new block of memory is +malloc'd, it should set the new_mem_heap function pointer, and the +assigned function will be called each time this happens. See the type +VgTrackEvents in include/vg_skin.h for full details +of all the trackable events.

+ + +

2.7  Instrumentation

+ +SK_(instrument)() is the interesting one. It allows you to +instrument UCode, which is Valgrind's RISC-like intermediate language. +UCode is described in the technical docs. + +The easiest way to instrument UCode is to insert calls to C functions when +interesting things happen. See the skin ``lackey'' +(lackey/lk_main.c) for a simple example of this, or +Cachegrind (cachegrind/cg_main.c) for a more complex +example.

+ +A much more complicated way to instrument UCode, albeit one that might result +in faster instrumented programs, is to extend UCode with new UCode +instructions. This is recommended for advanced Valgrind hackers only! See the +``memcheck'' skin for an example. + + +

2.8  Finalisation

+ +This is where you can present the final results, such as a summary of the +information collected. Any log files should be written out at this point. + + +

2.9  Other important information

+ +Please note that the core/skin split infrastructure is all very new, and not +very well documented. Here are some important points, but there are +undoubtedly many others that I should note but haven't thought of.

+ +The file include/vg_skin.h contains all the types, +macros, functions, etc. that a skin should (hopefully) need, and is the only +.h file a skin should need to #include.

+ +In particular, you probably shouldn't use anything from the C library (there +are deep reasons for this, trust us). Valgrind provides an implementation of a +reasonable subset of the C library, details of which are in +vg_skin.h.

+ +Similarly, when writing a skin, you shouldn't need to look at any of the code +in Valgrind's core. Although it might be useful sometimes to help understand +something.

+ +vg_skin.h has a reasonable amount of documentation in it that +should hopefully be enough to get you going. But ultimately, the skins +distributed (memcheck, addrcheck, cachegrind, lackey, etc.) are probably the +best documentation of all, for the moment.

+ +Note that the VG_ and SK_ macros are used heavily. +These just prepend longer strings in front of names to avoid potential +namespace clashes. We strongly recommend using the SK_ macro +for any global functions and variables in your skin.

+ + +

2.10  Words of Advice

+ +Writing and debugging skins is not trivial. Here are some suggestions for +solving common problems.

+ +If you are getting segmentation faults in C functions used by your skin, the +usual GDB command: +

gdb prog core
+usually gives the location of the segmentation fault.

+ +If you want to debug C functions used by your skin, you can attach GDB to +Valgrind with some effort: +

+ +GDB may be able to give you useful information. Note that by default +most of the system is built with -fomit-frame-pointer, +and you'll need to get rid of this to extract useful tracebacks from +GDB.

+ +If you just want to know whether a program point has been reached, using the +OINK macro (in include/vg_skin.h) can be easier than +using GDB.

+ +If you are having problems with your UCode instrumentation, it's likely that +GDB won't be able to help at all. In this case, Valgrind's +--trace-codegen option is invaluable for observing the results of +instrumentation.

+ +The other debugging command line options can be useful too (run valgrind +-h for the list).

+ + +

3  Advanced Topics

+ +Once a skin becomes more complicated, there are some extra things you may +want/need to do. + + +

3.1  Suppressions

+ +If your skin reports errors and you want to suppress some common ones, you can +add suppressions to the suppression files. The relevant files are +valgrind/*.supp; the final suppression file is aggregated from +these files by combining the relevant .supp files depending on the +versions of linux, X and glibc on a system. + +
+

3.2  Documentation

+ +If you are feeling conscientious and want to write some HTML documentation for +your skin, follow these steps (using foobar as the example skin +name again): + +
    +
  1. Make a directory foobar/docs/. +
  2. + +

  3. Edit foobar/Makefile.am, adding docs to + the SUBDIRS variable. +
  4. + +

  5. Edit configure.in, adding + foobar/docs/Makefile to the AC_OUTPUT list. +
  6. + +

  7. Write foobar/docs/Makefile.am. Use + memcheck/docs/Makefile.am as an example. +
  8. + +
  9. Write the documentation; the top-level file should be called + foobar/docs/index.html. +
  10. + +

  11. (optional) Add a link in the main documentation index + docs/index.html to + foobar/docs/index.html +
  12. +

+ +
+

3.3  Regression tests

+ +Valgrind has some support for regression tests. If you want to write +regression tests for your skin: + +
    +
  1. Make a directory foobar/tests/. +
  2. + +

  3. Edit foobar/Makefile.am, adding tests to + the SUBDIRS variable. +
  4. + +

  5. Edit configure.in, adding + foobar/tests/Makefile to the AC_OUTPUT list. +
  6. + +

  7. Write foobar/tests/Makefile.am. Use + memcheck/tests/Makefile.am as an example. +
  8. + +

  9. Write the tests, .vgtest test description files, + .stdout.exp and .stderr.exp expected output + files. (Note that Valgrind's output goes to stderr.) Some details + on writing and running tests are given in the comments at the top of the + testing script tests/vg_regtest. +
  10. + +

  11. Write a filter for stderr results foobar/tests/filter_stderr. + It can call the existing filters in tests/. See + memcheck/tests/filter_stderr for an example; in particular + note the $dir trick that ensures the filter works correctly + from any directory. +
  12. +

+ +
+

3.4  Profiling

+ +To do simple tick-based profiling of a skin, include the line +
+#include "vg_profile.c" +
+in the skin somewhere, and rebuild (you may have to make clean +first). Then run Valgrind with the --profile=yes option.

+ +The profiler is stack-based; you can register a profiling event with +VGP_(register_profile_event)() and then use the +VGP_PUSHCC and VGP_POPCC macros to record time spent +doing certain things. New profiling event numbers must not overlap with the +core profiling event numbers. See include/vg_skin.h for details +and the ``memcheck'' skin for an example. + + + +

3.5  Other makefile hackery

+ +If you add any directories under valgrind/foobar/, you will +need to add an appropriate Makefile.am to it, and add a +corresponding entry to the AC_OUTPUT list in +valgrind/configure.in.

+ +If you add any scripts to your skin (see Cachegrind for an example) you need to +add them to the bin_SCRIPTS variable in +valgrind/foobar/Makefile.am.

+ + + +

3.5  Core/skin interface versions

+ +In order to allow for the core/skin interface to evolve over time, Valgrind +uses a basic interface versioning system. All a skin has to do is use the +VG_DETERMINE_INTERFACE_VERSION macro exactly once in its code. +If not, a link error will occur when the skin is built. +

+The interface version number has the form X.Y. Changes in Y indicate binary +compatible changes. Changes in X indicate binary incompatible changes. If +the core and skin has the same major version number X they should work +together. If X doesn't match, Valgrind will abort execution with an +explanation of the problem. +

+This approach was chosen so that if the interface changes in the future, +old skins won't work and the reason will be clearly explained, instead of +possibly crashing mysteriously. We have attempted to minimise the potential +for binary incompatible changes by means such as minimising the use of naked +structs in the interface. + + +

4  Final Words

+ +This whole core/skin business is very new and experimental, and under active +development.

+ +The first consequence of this is that the core/skin interface is quite +immature. It will almost certainly change in the future; we have no intention +of freezing it and then regretting the inevitable stupidities. Hopefully most +of the future changes will be to add new features, hooks, functions, etc, +rather than to change old ones, which should cause a minimum of trouble for +existing skins, and we've put some effort into future-proofing the interface +to avoid binary incompatibility. But we can't guarantee anything. The +versioning system should catch any incompatibilities. Just something to be +aware of.

+ +The second consequence of this is that we'd love to hear your feedback about +it: + +

+ +or anything else!

+ +Happy programming. + diff --git a/docs/manual.html b/docs/manual.html new file mode 100644 index 000000000..d3670199a --- /dev/null +++ b/docs/manual.html @@ -0,0 +1,92 @@ + + + + Valgrind + + + + +  +

Valgrind, version 2.0.0

+
This manual was last updated on 10 November 2002
+

+ +

+jseward@acm.org, + njn25@cam.ac.uk
+Copyright © 2000-2002 Julian Seward, Nick Nethercote +

+ +Valgrind is licensed under the GNU General Public License, version +2
+ +An open-source tool for debugging and profiling Linux-x86 executables. +

+ +

+ +


+ +

Contents of this manual

+ +

Introduction

+ 1.1  What Valgrind is for
+ 1.2  What it does with your program + +

How to use it, and how to make sense + of the results

+ 2.1  Getting started
+ 2.2  The commentary
+ 2.3  Reporting of errors
+ 2.4  Suppressing errors
+ 2.5  Command-line flags
+ 2.6  Explaination of error messages
+ 2.7  Writing suppressions files
+ 2.8  The Client Request mechanism
+ 2.9  Support for POSIX pthreads
+ 2.10  Building and installing
+ 2.11  If you have problems
+ +

Details of the checking machinery

+ 3.1  Valid-value (V) bits
+ 3.2  Valid-address (A) bits
+ 3.3  Putting it all together
+ 3.4  Signals
+ 3.5  Memory leak detection
+ +

Limitations

+ +

How it works -- a rough overview

+ 5.1  Getting started
+ 5.2  The translation/instrumentation engine
+ 5.3  Tracking the status of memory
+ 5.4  System calls
+ 5.5  Signals
+ +

An example

+ +

Cache profiling

+ +

The design and implementation of Valgrind

+ +
+ + diff --git a/helgrind/hg_main.html b/helgrind/hg_main.html new file mode 100644 index 000000000..b9d72f9bc --- /dev/null +++ b/helgrind/hg_main.html @@ -0,0 +1,80 @@ + + + + Cachegrind + + + + + +

Helgrind

+
This manual was last updated on 2002-10-03
+

+ +

+njn25@cam.ac.uk
+Copyright © 2000-2002 Nicholas Nethercote +

+Helgrind is licensed under the GNU General Public License, +version 2
+Helgrind is a Valgrind skin for detecting data races in threaded programs. +

+ +

+ +

1  Helgrind

+ +Helgrind is a Valgrind skin for detecting data races in C and C++ programs +that use the Pthreads library. +

+It uses the Eraser algorithm described in +

+ Eraser: A Dynamic Data Race Detector for Multithreaded Programs
+ Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro and + Thomas Anderson
+ ACM Transactions on Computer Systems, 15(4):391-411
+ November 1997. +
+ +It is unfortunately in a rather mangy state and probably doesn't work at all. +We include it partly because it may serve as a useful example skin, and partly +in case anybody is inspired to improve it and get it working. +

+If you are inspired, we'd love to hear from you. And if you are successful, +you might like to include some improvements to the basic Eraser algorithm +described in Section 4.2 of + +

+ Runtime Checking of Multithreaded Applications with Visual Threads + Jerry J. Harrow, Jr.
+ Proceedings of the 7th International SPIN Workshop on Model Checking of + Software
+ Stanford, California, USA
+ August 2000
+ LNCS 1885, pp331--342
+ K. Havelund, J. Penix, and W. Visser, editors.
+
+ + +
+ + + diff --git a/lackey/lk_main.html b/lackey/lk_main.html new file mode 100644 index 000000000..72f1e8425 --- /dev/null +++ b/lackey/lk_main.html @@ -0,0 +1,68 @@ + + + + Cachegrind + + + + + +

Lackey

+
This manual was last updated on 2002-10-03
+

+ +

+njn25@cam.ac.uk
+Copyright © 2000-2002 Nicholas Nethercote +

+Lackey is licensed under the GNU General Public License, +version 2
+Lackey is an example Valgrind skin that does some very basic program +measurement. +

+ +

+ +

1  Lackey

+ +Lackey is a simple Valgrind skin that does some basic program measurement. +It adds quite a lot of simple instrumentation to the program's code. It is +primarily intended to be of use as an example skin. +

+It measures three things: + +

    +
  1. The number of calls to _dl_runtime_resolve(), the function + in glibc's dynamic linker that resolves function lookups into shared + objects.

    + +

  2. The number of UCode instructions (UCode is Valgrind's RISC-like + intermediate language), x86 instructions, and basic blocks executed by the + program, and some ratios between the three counts.

    + +

  3. The number of conditional branches encountered and the proportion of those + taken.

    +

+ +
+ + + diff --git a/memcheck/mc_main.html b/memcheck/mc_main.html new file mode 100644 index 000000000..32177f30f --- /dev/null +++ b/memcheck/mc_main.html @@ -0,0 +1,830 @@ + +--------------------------- + +
  • --partial-loads-ok=yes [the default]
    + --partial-loads-ok=no +

    Controls how Valgrind handles word (4-byte) loads from + addresses for which some bytes are addressible and others + are not. When yes (the default), such loads + do not elicit an address error. Instead, the loaded V bytes + corresponding to the illegal addresses indicate undefined, and + those corresponding to legal addresses are loaded from shadow + memory, as usual. +

    + When no, loads from partially + invalid addresses are treated the same as loads from completely + invalid addresses: an illegal-address error is issued, + and the resulting V bytes indicate valid data. +


  • + +

  • --freelist-vol=<number> [default: 1000000] +

    When the client program releases memory using free (in C) or + delete (C++), that memory is not immediately made available for + re-allocation. Instead it is marked inaccessible and placed in + a queue of freed blocks. The purpose is to delay the point at + which freed-up memory comes back into circulation. This + increases the chance that Valgrind will be able to detect + invalid accesses to blocks for some significant period of time + after they have been freed. +

    + This flag specifies the maximum total size, in bytes, of the + blocks in the queue. The default value is one million bytes. + Increasing this increases the total amount of memory used by + Valgrind but may detect invalid uses of freed blocks which would + otherwise go undetected.


  • + +

  • --leak-check=no [default]
    + --leak-check=yes +

    When enabled, search for memory leaks when the client program + finishes. A memory leak means a malloc'd block, which has not + yet been free'd, but to which no pointer can be found. Such a + block can never be free'd by the program, since no pointer to it + exists. Leak checking is disabled by default because it tends + to generate dozens of error messages.


  • + +

  • --show-reachable=no [default]
    + --show-reachable=yes +

    When disabled, the memory leak detector only shows blocks for + which it cannot find a pointer to at all, or it can only find a + pointer to the middle of. These blocks are prime candidates for + memory leaks. When enabled, the leak detector also reports on + blocks which it could find a pointer to. Your program could, at + least in principle, have freed such blocks before exit. + Contrast this to blocks for which no pointer, or only an + interior pointer could be found: they are more likely to + indicate memory leaks, because you do not actually have a + pointer to the start of the block which you can hand to + free, even if you wanted to.


  • + +

  • --leak-resolution=low [default]
    + --leak-resolution=med
    + --leak-resolution=high +

    When doing leak checking, determines how willing Valgrind is + to consider different backtraces to be the same. When set to + low, the default, only the first two entries need + match. When med, four entries have to match. When + high, all entries need to match. +

    + For hardcore leak debugging, you probably want to use + --leak-resolution=high together with + --num-callers=40 or some such large number. Note + however that this can give an overwhelming amount of + information, which is why the defaults are 4 callers and + low-resolution matching. +

    + Note that the --leak-resolution= setting does not + affect Valgrind's ability to find leaks. It only changes how + the results are presented. +


  • + +

  • --workaround-gcc296-bugs=no [default]
    + --workaround-gcc296-bugs=yes

    When enabled, + assume that reads and writes some small distance below the stack + pointer %esp are due to bugs in gcc 2.96, and does + not report them. The "small distance" is 256 bytes by default. + Note that gcc 2.96 is the default compiler on some popular Linux + distributions (RedHat 7.X, Mandrake) and so you may well need to + use this flag. Do not use it if you do not have to, as it can + cause real errors to be overlooked. Another option is to use a + gcc/g++ which does not generate accesses below the stack + pointer. 2.95.3 seems to be a good choice in this respect. +

    + Unfortunately (27 Feb 02) it looks like g++ 3.0.4 has a similar + bug, so you may need to issue this flag if you use 3.0.4. A + while later (early Apr 02) this is confirmed as a scheduling bug + in g++-3.0.4. +


  • + +

  • --cleanup=no
    + --cleanup=yes [default] +

    When enabled, various improvments are applied to the + post-instrumented intermediate code, aimed at removing redundant + value checks.


  • +

    + + + + + +

    2.6  Explaination of error messages

    + +Despite considerable sophistication under the hood, Valgrind can only +really detect two kinds of errors, use of illegal addresses, and use +of undefined values. Nevertheless, this is enough to help you +discover all sorts of memory-management nasties in your code. This +section presents a quick summary of what error messages mean. The +precise behaviour of the error-checking machinery is described in +Section 4. + + +

    2.6.1  Illegal read / Illegal write errors

    +For example: +
    +  Invalid read of size 4
    +     at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
    +     by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
    +     by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326)
    +     by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
    +     Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
    +
    + +

    This happens when your program reads or writes memory at a place +which Valgrind reckons it shouldn't. In this example, the program did +a 4-byte read at address 0xBFFFF0E0, somewhere within the +system-supplied library libpng.so.2.1.0.9, which was called from +somewhere else in the same library, called from line 326 of +qpngio.cpp, and so on. + +

    Valgrind tries to establish what the illegal address might relate +to, since that's often useful. So, if it points into a block of +memory which has already been freed, you'll be informed of this, and +also where the block was free'd at. Likewise, if it should turn out +to be just off the end of a malloc'd block, a common result of +off-by-one-errors in array subscripting, you'll be informed of this +fact, and also where the block was malloc'd. + +

    In this example, Valgrind can't identify the address. Actually the +address is on the stack, but, for some reason, this is not a valid +stack address -- it is below the stack pointer, %esp, and that isn't +allowed. In this particular case it's probably caused by gcc +generating invalid code, a known bug in various flavours of gcc. + +

    Note that Valgrind only tells you that your program is about to +access memory at an illegal address. It can't stop the access from +happening. So, if your program makes an access which normally would +result in a segmentation fault, you program will still suffer the same +fate -- but you will get a message from Valgrind immediately prior to +this. In this particular example, reading junk on the stack is +non-fatal, and the program stays alive. + + +

    2.6.2  Use of uninitialised values

    +For example: +
    +  Conditional jump or move depends on uninitialised value(s)
    +     at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
    +     by 0x402E8476: _IO_printf (printf.c:36)
    +     by 0x8048472: main (tests/manuel1.c:8)
    +     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
    +
    + +

    An uninitialised-value use error is reported when your program uses +a value which hasn't been initialised -- in other words, is undefined. +Here, the undefined value is used somewhere inside the printf() +machinery of the C library. This error was reported when running the +following small program: +

    +  int main()
    +  {
    +    int x;
    +    printf ("x = %d\n", x);
    +  }
    +
    + +

    It is important to understand that your program can copy around +junk (uninitialised) data to its heart's content. Valgrind observes +this and keeps track of the data, but does not complain. A complaint +is issued only when your program attempts to make use of uninitialised +data. In this example, x is uninitialised. Valgrind observes the +value being passed to _IO_printf and thence to _IO_vfprintf, but makes +no comment. However, _IO_vfprintf has to examine the value of x so it +can turn it into the corresponding ASCII string, and it is at this +point that Valgrind complains. + +

    Sources of uninitialised data tend to be: +

    + + + +

    2.6.3  Illegal frees

    +For example: +
    +  Invalid free()
    +     at 0x4004FFDF: free (ut_clientmalloc.c:577)
    +     by 0x80484C7: main (tests/doublefree.c:10)
    +     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
    +     by 0x80483B1: (within tests/doublefree)
    +     Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
    +     at 0x4004FFDF: free (ut_clientmalloc.c:577)
    +     by 0x80484C7: main (tests/doublefree.c:10)
    +     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
    +     by 0x80483B1: (within tests/doublefree)
    +
    +

    Valgrind keeps track of the blocks allocated by your program with +malloc/new, so it can know exactly whether or not the argument to +free/delete is legitimate or not. Here, this test program has +freed the same block twice. As with the illegal read/write errors, +Valgrind attempts to make sense of the address free'd. If, as +here, the address is one which has previously been freed, you wil +be told that -- making duplicate frees of the same block easy to spot. + + +

    2.6.4  When a block is freed with an inappropriate +deallocation function

    +In the following example, a block allocated with new[] +has wrongly been deallocated with free: +
    +  Mismatched free() / delete / delete []
    +     at 0x40043249: free (vg_clientfuncs.c:171)
    +     by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
    +     by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
    +     by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
    +     Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
    +     at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
    +     by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
    +     by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
    +     by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
    +
    +The following was told to me be the KDE 3 developers. I didn't know +any of it myself. They also implemented the check itself. +

    +In C++ it's important to deallocate memory in a way compatible with +how it was allocated. The deal is: +

    +The worst thing is that on Linux apparently it doesn't matter if you +do muddle these up, and it all seems to work ok, but the same program +may then crash on a different platform, Solaris for example. So it's +best to fix it properly. According to the KDE folks "it's amazing how +many C++ programmers don't know this". +

    +Pascal Massimino adds the following clarification: +delete[] must be called associated with a +new[] because the compiler stores the size of the array +and the pointer-to-member to the destructor of the array's content +just before the pointer actually returned. This implies a +variable-sized overhead in what's returned by new or +new[]. It rather surprising how compilers [Ed: +runtime-support libraries?] are robust to mismatch in +new/delete +new[]/delete[]. + + +

    2.6.5  Passing system call parameters with inadequate +read/write permissions

    + +Valgrind checks all parameters to system calls. If a system call +needs to read from a buffer provided by your program, Valgrind checks +that the entire buffer is addressible and has valid data, ie, it is +readable. And if the system call needs to write to a user-supplied +buffer, Valgrind checks that the buffer is addressible. After the +system call, Valgrind updates its administrative information to +precisely reflect any changes in memory permissions caused by the +system call. + +

    Here's an example of a system call with an invalid parameter: +

    +  #include <stdlib.h>
    +  #include <unistd.h>
    +  int main( void )
    +  {
    +    char* arr = malloc(10);
    +    (void) write( 1 /* stdout */, arr, 10 );
    +    return 0;
    +  }
    +
    + +

    You get this complaint ... +

    +  Syscall param write(buf) contains uninitialised or unaddressable byte(s)
    +     at 0x4035E072: __libc_write
    +     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
    +     by 0x80483B1: (within tests/badwrite)
    +     by <bogus frame pointer> ???
    +     Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
    +     at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
    +     by 0x80484A0: main (tests/badwrite.c:6)
    +     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
    +     by 0x80483B1: (within tests/badwrite)
    +
    + +

    ... because the program has tried to write uninitialised junk from +the malloc'd block to the standard output. + + +

    2.6.6  Warning messages you might see

    + +Most of these only appear if you run in verbose mode (enabled by +-v): + + + + +

    2.7  Writing suppressions files

    + +A suppression file describes a bunch of errors which, for one reason +or another, you don't want Valgrind to tell you about. Usually the +reason is that the system libraries are buggy but unfixable, at least +within the scope of the current debugging session. Multiple +suppressions files are allowed. By default, Valgrind uses +$PREFIX/lib/valgrind/default.supp. + +

    +You can ask to add suppressions from another file, by specifying +--suppressions=/path/to/file.supp. + +

    Each suppression has the following components:
    +

    + +

    +Locations may be either names of shared objects/executables or wildcards +matching function names. They begin obj: and fun: +respectively. Function and object names to match against may use the +wildcard characters * and ?. + +A suppression only suppresses an error when the error matches all the +details in the suppression. Here's an example: +

    +  {
    +    __gconv_transform_ascii_internal/__mbrtowc/mbtowc
    +    Value4
    +    fun:__gconv_transform_ascii_internal
    +    fun:__mbr*toc
    +    fun:mbtowc
    +  }
    +
    + +

    What is means is: suppress a use-of-uninitialised-value error, when +the data size is 4, when it occurs in the function +__gconv_transform_ascii_internal, when that is called +from any function of name matching __mbr*toc, +when that is called from +mbtowc. It doesn't apply under any other circumstances. +The string by which this suppression is identified to the user is +__gconv_transform_ascii_internal/__mbrtowc/mbtowc. + +

    Another example: +

    +  {
    +    libX11.so.6.2/libX11.so.6.2/libXaw.so.7.0
    +    Value4
    +    obj:/usr/X11R6/lib/libX11.so.6.2
    +    obj:/usr/X11R6/lib/libX11.so.6.2
    +    obj:/usr/X11R6/lib/libXaw.so.7.0
    +  }
    +
    + +

    Suppress any size 4 uninitialised-value error which occurs anywhere +in libX11.so.6.2, when called from anywhere in the same +library, when called from anywhere in libXaw.so.7.0. The +inexact specification of locations is regrettable, but is about all +you can hope for, given that the X11 libraries shipped with Red Hat +7.2 have had their symbol tables removed. + +

    Note -- since the above two examples did not make it clear -- that +you can freely mix the obj: and fun: +styles of description within a single suppression record. + + + + + + +

    3  Details of the checking machinery

    + +Read this section if you want to know, in detail, exactly what and how +Valgrind is checking. + + +

    3.1  Valid-value (V) bits

    + +It is simplest to think of Valgrind implementing a synthetic Intel x86 +CPU which is identical to a real CPU, except for one crucial detail. +Every bit (literally) of data processed, stored and handled by the +real CPU has, in the synthetic CPU, an associated "valid-value" bit, +which says whether or not the accompanying bit has a legitimate value. +In the discussions which follow, this bit is referred to as the V +(valid-value) bit. + +

    Each byte in the system therefore has a 8 V bits which follow +it wherever it goes. For example, when the CPU loads a word-size item +(4 bytes) from memory, it also loads the corresponding 32 V bits from +a bitmap which stores the V bits for the process' entire address +space. If the CPU should later write the whole or some part of that +value to memory at a different address, the relevant V bits will be +stored back in the V-bit bitmap. + +

    In short, each bit in the system has an associated V bit, which +follows it around everywhere, even inside the CPU. Yes, the CPU's +(integer and %eflags) registers have their own V bit +vectors. + +

    Copying values around does not cause Valgrind to check for, or +report on, errors. However, when a value is used in a way which might +conceivably affect the outcome of your program's computation, the +associated V bits are immediately checked. If any of these indicate +that the value is undefined, an error is reported. + +

    Here's an (admittedly nonsensical) example: +

    +  int i, j;
    +  int a[10], b[10];
    +  for (i = 0; i < 10; i++) {
    +    j = a[i];
    +    b[i] = j;
    +  }
    +
    + +

    Valgrind emits no complaints about this, since it merely copies +uninitialised values from a[] into b[], and +doesn't use them in any way. However, if the loop is changed to +

    +  for (i = 0; i < 10; i++) {
    +    j += a[i];
    +  }
    +  if (j == 77) 
    +     printf("hello there\n");
    +
    +then Valgrind will complain, at the if, that the +condition depends on uninitialised values. + +

    Most low level operations, such as adds, cause Valgrind to +use the V bits for the operands to calculate the V bits for the +result. Even if the result is partially or wholly undefined, +it does not complain. + +

    Checks on definedness only occur in two places: when a value is +used to generate a memory address, and where control flow decision +needs to be made. Also, when a system call is detected, valgrind +checks definedness of parameters as required. + +

    If a check should detect undefinedness, an error message is +issued. The resulting value is subsequently regarded as well-defined. +To do otherwise would give long chains of error messages. In effect, +we say that undefined values are non-infectious. + +

    This sounds overcomplicated. Why not just check all reads from +memory, and complain if an undefined value is loaded into a CPU register? +Well, that doesn't work well, because perfectly legitimate C programs routinely +copy uninitialised values around in memory, and we don't want endless complaints +about that. Here's the canonical example. Consider a struct +like this: +

    +  struct S { int x; char c; };
    +  struct S s1, s2;
    +  s1.x = 42;
    +  s1.c = 'z';
    +  s2 = s1;
    +
    + +

    The question to ask is: how large is struct S, in +bytes? An int is 4 bytes and a char one byte, so perhaps a struct S +occupies 5 bytes? Wrong. All (non-toy) compilers I know of will +round the size of struct S up to a whole number of words, +in this case 8 bytes. Not doing this forces compilers to generate +truly appalling code for subscripting arrays of struct +S's. + +

    So s1 occupies 8 bytes, yet only 5 of them will be initialised. +For the assignment s2 = s1, gcc generates code to copy +all 8 bytes wholesale into s2 without regard for their +meaning. If Valgrind simply checked values as they came out of +memory, it would yelp every time a structure assignment like this +happened. So the more complicated semantics described above is +necessary. This allows gcc to copy s1 into +s2 any way it likes, and a warning will only be emitted +if the uninitialised values are later used. + +

    One final twist to this story. The above scheme allows garbage to +pass through the CPU's integer registers without complaint. It does +this by giving the integer registers V tags, passing these around in +the expected way. This complicated and computationally expensive to +do, but is necessary. Valgrind is more simplistic about +floating-point loads and stores. In particular, V bits for data read +as a result of floating-point loads are checked at the load +instruction. So if your program uses the floating-point registers to +do memory-to-memory copies, you will get complaints about +uninitialised values. Fortunately, I have not yet encountered a +program which (ab)uses the floating-point registers in this way. + + +

    3.2  Valid-address (A) bits

    + +Notice that the previous section describes how the validity of values +is established and maintained without having to say whether the +program does or does not have the right to access any particular +memory location. We now consider the latter issue. + +

    As described above, every bit in memory or in the CPU has an +associated valid-value (V) bit. In addition, all bytes in memory, but +not in the CPU, have an associated valid-address (A) bit. This +indicates whether or not the program can legitimately read or write +that location. It does not give any indication of the validity or the +data at that location -- that's the job of the V bits -- only whether +or not the location may be accessed. + +

    Every time your program reads or writes memory, Valgrind checks the +A bits associated with the address. If any of them indicate an +invalid address, an error is emitted. Note that the reads and writes +themselves do not change the A bits, only consult them. + +

    So how do the A bits get set/cleared? Like this: + +

    + + + +

    3.3  Putting it all together

    +Valgrind's checking machinery can be summarised as follows: + + + +Valgrind intercepts calls to malloc, calloc, realloc, valloc, +memalign, free, new and delete. The behaviour you get is: + + + + + + + +

    3.5  Memory leak detection

    + +Valgrind keeps track of all memory blocks issued in response to calls +to malloc/calloc/realloc/new. So when the program exits, it knows +which blocks are still outstanding -- have not been returned, in other +words. Ideally, you want your program to have no blocks still in use +at exit. But many programs do. + +

    For each such block, Valgrind scans the entire address space of the +process, looking for pointers to the block. One of three situations +may result: + +

    + +Valgrind reports summaries about leaked and dubious blocks. +For each such block, it will also tell you where the block was +allocated. This should help you figure out why the pointer to it has +been lost. In general, you should attempt to ensure your programs do +not have any leaked or dubious blocks at exit. + +

    The precise area of memory in which Valgrind searches for pointers +is: all naturally-aligned 4-byte words for which all A bits indicate +addressibility and all V bits indicated that the stored value is +actually valid. + +


    diff --git a/memcheck/mc_techdocs.html b/memcheck/mc_techdocs.html new file mode 100644 index 000000000..017763412 --- /dev/null +++ b/memcheck/mc_techdocs.html @@ -0,0 +1,2113 @@ + + + + The design and implementation of Valgrind + + + + +  +

    The design and implementation of Valgrind

    + +
    +Detailed technical notes for hackers, maintainers and the +overly-curious
    +These notes pertain to snapshot 20020306
    +

    +jseward@acm.org
    +
    http://developer.kde.org/~sewardj
    +Copyright © 2000-2002 Julian Seward +

    +Valgrind is licensed under the GNU General Public License, +version 2
    +An open-source tool for finding memory-management problems in +x86 GNU/Linux executables. +

    + +

    + + + + +


    + +

    Introduction

    + +This document contains a detailed, highly-technical description of the +internals of Valgrind. This is not the user manual; if you are an +end-user of Valgrind, you do not want to read this. Conversely, if +you really are a hacker-type and want to know how it works, I assume +that you have read the user manual thoroughly. +

    +You may need to read this document several times, and carefully. Some +important things, I only say once. + + +

    History

    + +Valgrind came into public view in late Feb 2002. However, it has been +under contemplation for a very long time, perhaps seriously for about +five years. Somewhat over two years ago, I started working on the x86 +code generator for the Glasgow Haskell Compiler +(http://www.haskell.org/ghc), gaining familiarity with x86 internals +on the way. I then did Cacheprof (http://www.cacheprof.org), gaining +further x86 experience. Some time around Feb 2000 I started +experimenting with a user-space x86 interpreter for x86-Linux. This +worked, but it was clear that a JIT-based scheme would be necessary to +give reasonable performance for Valgrind. Design work for the JITter +started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86 +dynamic translator which could run quite large programs. This +translator was in a sense pointless, since it did not do any +instrumentation or checking. + +

    +Most of the rest of 2001 was taken up designing and implementing the +instrumentation scheme. The main difficulty, which consumed a lot +of effort, was to design a scheme which did not generate large numbers +of false uninitialised-value warnings. By late 2001 a satisfactory +scheme had been arrived at, and I started to test it on ever-larger +programs, with an eventual eye to making it work well enough so that +it was helpful to folks debugging the upcoming version 3 of KDE. I've +used KDE since before version 1.0, and wanted to Valgrind to be an +indirect contribution to the KDE 3 development effort. At the start of +Feb 02 the kde-core-devel crew started using it, and gave a huge +amount of helpful feedback and patches in the space of three weeks. +Snapshot 20020306 is the result. + +

    +In the best Unix tradition, or perhaps in the spirit of Fred Brooks' +depressing-but-completely-accurate epitaph "build one to throw away; +you will anyway", much of Valgrind is a second or third rendition of +the initial idea. The instrumentation machinery +(vg_translate.c, vg_memory.c) and core CPU +simulation (vg_to_ucode.c, vg_from_ucode.c) +have had three redesigns and rewrites; the register allocator, +low-level memory manager (vg_malloc2.c) and symbol table +reader (vg_symtab2.c) are on the second rewrite. In a +sense, this document serves to record some of the knowledge gained as +a result. + + +

    Design overview

    + +Valgrind is compiled into a Linux shared object, +valgrind.so, and also a dummy one, +valgrinq.so, of which more later. The +valgrind shell script adds valgrind.so to +the LD_PRELOAD list of extra libraries to be +loaded with any dynamically linked library. This is a standard trick, +one which I assume the LD_PRELOAD mechanism was developed +to support. + +

    +valgrind.so +is linked with the -z initfirst flag, which requests that +its initialisation code is run before that of any other object in the +executable image. When this happens, valgrind gains control. The +real CPU becomes "trapped" in valgrind.so and the +translations it generates. The synthetic CPU provided by Valgrind +does, however, return from this initialisation function. So the +normal startup actions, orchestrated by the dynamic linker +ld.so, continue as usual, except on the synthetic CPU, +not the real one. Eventually main is run and returns, +and then the finalisation code of the shared objects is run, +presumably in inverse order to which they were initialised. Remember, +this is still all happening on the simulated CPU. Eventually +valgrind.so's own finalisation code is called. It spots +this event, shuts down the simulated CPU, prints any error summaries +and/or does leak detection, and returns from the initialisation code +on the real CPU. At this point, in effect the real and synthetic CPUs +have merged back into one, Valgrind has lost control of the program, +and the program finally exit()s back to the kernel in the +usual way. + +

    +The normal course of activity, one Valgrind has started up, is as +follows. Valgrind never runs any part of your program (usually +referred to as the "client"), not a single byte of it, directly. +Instead it uses function VG_(translate) to translate +basic blocks (BBs, straight-line sequences of code) into instrumented +translations, and those are run instead. The translations are stored +in the translation cache (TC), vg_tc, with the +translation table (TT), vg_tt supplying the +original-to-translation code address mapping. Auxiliary array +VG_(tt_fast) is used as a direct-map cache for fast +lookups in TT; it usually achieves a hit rate of around 98% and +facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad. + +

    +Function VG_(dispatch) in vg_dispatch.S is +the heart of the JIT dispatcher. Once a translated code address has +been found, it is executed simply by an x86 call +to the translation. At the end of the translation, the next +original code addr is loaded into %eax, and the +translation then does a ret, taking it back to the +dispatch loop, with, interestingly, zero branch mispredictions. +The address requested in %eax is looked up first in +VG_(tt_fast), and, if not found, by calling C helper +VG_(search_transtab). If there is still no translation +available, VG_(dispatch) exits back to the top-level +C dispatcher VG_(toploop), which arranges for +VG_(translate) to make a new translation. All fairly +unsurprising, really. There are various complexities described below. + +

    +The translator, orchestrated by VG_(translate), is +complicated but entirely self-contained. It is described in great +detail in subsequent sections. Translations are stored in TC, with TT +tracking administrative information. The translations are subject to +an approximate LRU-based management scheme. With the current +settings, the TC can hold at most about 15MB of translations, and LRU +passes prune it to about 13.5MB. Given that the +orig-to-translation expansion ratio is about 13:1 to 14:1, this means +TC holds translations for more or less a megabyte of original code, +which generally comes to about 70000 basic blocks for C++ compiled +with optimisation on. Generating new translations is expensive, so it +is worth having a large TC to minimise the (capacity) miss rate. + +

    +The dispatcher, VG_(dispatch), receives hints from +the translations which allow it to cheaply spot all control +transfers corresponding to x86 call and ret +instructions. It has to do this in order to spot some special events: +

    +Valgrind intercepts the client's malloc, +free, etc, +calls, so that it can store additional information. Each block +malloc'd by the client gives rise to a shadow block +in which Valgrind stores the call stack at the time of the +malloc +call. When the client calls free, Valgrind tries to +find the shadow block corresponding to the address passed to +free, and emits an error message if none can be found. +If it is found, the block is placed on the freed blocks queue +vg_freed_list, it is marked as inaccessible, and +its shadow block now records the call stack at the time of the +free call. Keeping free'd blocks in +this queue allows Valgrind to spot all (presumably invalid) accesses +to them. However, once the volume of blocks in the free queue +exceeds VG_(clo_freelist_vol), blocks are finally +removed from the queue. + +

    +Keeping track of A and V bits (note: if you don't know what these are, +you haven't read the user guide carefully enough) for memory is done +in vg_memory.c. This implements a sparse array structure +which covers the entire 4G address space in a way which is reasonably +fast and reasonably space efficient. The 4G address space is divided +up into 64K sections, each covering 64Kb of address space. Given a +32-bit address, the top 16 bits are used to select one of the 65536 +entries in VG_(primary_map). The resulting "secondary" +(SecMap) holds A and V bits for the 64k of address space +chunk corresponding to the lower 16 bits of the address. + + +

    Design decisions

    + +Some design decisions were motivated by the need to make Valgrind +debuggable. Imagine you are writing a CPU simulator. It works fairly +well. However, you run some large program, like Netscape, and after +tens of millions of instructions, it crashes. How can you figure out +where in your simulator the bug is? + +

    +Valgrind's answer is: cheat. Valgrind is designed so that it is +possible to switch back to running the client program on the real +CPU at any point. Using the --stop-after= flag, you can +ask Valgrind to run just some number of basic blocks, and then +run the rest of the way on the real CPU. If you are searching for +a bug in the simulated CPU, you can use this to do a binary search, +which quickly leads you to the specific basic block which is +causing the problem. + +

    +This is all very handy. It does constrain the design in certain +unimportant ways. Firstly, the layout of memory, when viewed from the +client's point of view, must be identical regardless of whether it is +running on the real or simulated CPU. This means that Valgrind can't +do pointer swizzling -- well, no great loss -- and it can't run on +the same stack as the client -- again, no great loss. +Valgrind operates on its own stack, VG_(stack), which +it switches to at startup, temporarily switching back to the client's +stack when doing system calls for the client. + +

    +Valgrind also receives signals on its own stack, +VG_(sigstack), but for different gruesome reasons +discussed below. + +

    +This nice clean switch-back-to-the-real-CPU-whenever-you-like story +is muddied by signals. Problem is that signals arrive at arbitrary +times and tend to slightly perturb the basic block count, with the +result that you can get close to the basic block causing a problem but +can't home in on it exactly. My kludgey hack is to define +SIGNAL_SIMULATION to 1 towards the bottom of +vg_syscall_mem.c, so that signal handlers are run on the +real CPU and don't change the BB counts. + +

    +A second hole in the switch-back-to-real-CPU story is that Valgrind's +way of delivering signals to the client is different from that of the +kernel. Specifically, the layout of the signal delivery frame, and +the mechanism used to detect a sighandler returning, are different. +So you can't expect to make the transition inside a sighandler and +still have things working, but in practice that's not much of a +restriction. + +

    +Valgrind's implementation of malloc, free, +etc, (in vg_clientmalloc.c, not the low-level stuff in +vg_malloc2.c) is somewhat complicated by the need to +handle switching back at arbitrary points. It does work tho. + + + +

    Correctness

    + +There's only one of me, and I have a Real Life (tm) as well as hacking +Valgrind [allegedly :-]. That means I don't have time to waste +chasing endless bugs in Valgrind. My emphasis is therefore on doing +everything as simply as possible, with correctness, stability and +robustness being the number one priority, more important than +performance or functionality. As a result: + + +

    +Some more specific things are: + +

    + +

    Current limitations

    + +No threads. I think fixing this is close to a research-grade problem. +

    +No MMX. Fixing this should be relatively easy, using the same giant +trick used for x86 FPU instructions. See below. +

    +Support for weird (non-POSIX) signal stuff is patchy. Does anybody +care? +

    + + + + +


    + +

    The instrumenting JITter

    + +This really is the heart of the matter. We begin with various side +issues. + +

    Run-time storage, and the use of host registers

    + +Valgrind translates client (original) basic blocks into instrumented +basic blocks, which live in the translation cache TC, until either the +client finishes or the translations are ejected from TC to make room +for newer ones. +

    +Since it generates x86 code in memory, Valgrind has complete control +of the use of registers in the translations. Now pay attention. I +shall say this only once, and it is important you understand this. In +what follows I will refer to registers in the host (real) cpu using +their standard names, %eax, %edi, etc. I +refer to registers in the simulated CPU by capitalising them: +%EAX, %EDI, etc. These two sets of +registers usually bear no direct relationship to each other; there is +no fixed mapping between them. This naming scheme is used fairly +consistently in the comments in the sources. +

    +Host registers, once things are up and running, are used as follows: +

    + +

    +The state of the simulated CPU is stored in memory, in +VG_(baseBlock), which is a block of 200 words IIRC. +Recall that %ebp points permanently at the start of this +block. Function vg_init_baseBlock decides what the +offsets of various entities in VG_(baseBlock) are to be, +and allocates word offsets for them. The code generator then emits +%ebp relative addresses to get at those things. The +sequence in which entities are allocated has been carefully chosen so +that the 32 most popular entities come first, because this means 8-bit +offsets can be used in the generated code. + +

    +If I was clever, I could make %ebp point 32 words along +VG_(baseBlock), so that I'd have another 32 words of +short-form offsets available, but that's just complicated, and it's +not important -- the first 32 words take 99% (or whatever) of the +traffic. + +

    +Currently, the sequence of stuff in VG_(baseBlock) is as +follows: +

    + +

    +As a general rule, the simulated machine's state lives permanently in +memory at VG_(baseBlock). However, the JITter does some +optimisations which allow the simulated integer registers to be +cached in real registers over multiple simulated instructions within +the same basic block. These are always flushed back into memory at +the end of every basic block, so that the in-memory state is +up-to-date between basic blocks. (This flushing is implied by the +statement above that the real machine's allocatable registers are +dead in between simulated blocks). + + +

    Startup, shutdown, and system calls

    + +Getting into of Valgrind (VG_(startup), called from +valgrind.so's initialisation section), really means +copying the real CPU's state into VG_(baseBlock), and +then installing our own stack pointer, etc, into the real CPU, and +then starting up the JITter. Exiting valgrind involves copying the +simulated state back to the real state. + +

    +Unfortunately, there's a complication at startup time. Problem is +that at the point where we need to take a snapshot of the real CPU's +state, the offsets in VG_(baseBlock) are not set up yet, +because to do so would involve disrupting the real machine's state +significantly. The way round this is to dump the real machine's state +into a temporary, static block of memory, +VG_(m_state_static). We can then set up the +VG_(baseBlock) offsets at our leisure, and copy into it +from VG_(m_state_static) at some convenient later time. +This copying is done by +VG_(copy_m_state_static_to_baseBlock). + +

    +On exit, the inverse transformation is (rather unnecessarily) used: +stuff in VG_(baseBlock) is copied to +VG_(m_state_static), and the assembly stub then copies +from VG_(m_state_static) into the real machine registers. + +

    +Doing system calls on behalf of the client (vg_syscall.S) +is something of a half-way house. We have to make the world look +sufficiently like that which the client would normally have to make +the syscall actually work properly, but we can't afford to lose +control. So the trick is to copy all of the client's state, except +its program counter, into the real CPU, do the system call, and +copy the state back out. Note that the client's state includes its +stack pointer register, so one effect of this partial restoration is +to cause the system call to be run on the client's stack, as it should +be. + +

    +As ever there are complications. We have to save some of our own state +somewhere when restoring the client's state into the CPU, so that we +can keep going sensibly afterwards. In fact the only thing which is +important is our own stack pointer, but for paranoia reasons I save +and restore our own FPU state as well, even though that's probably +pointless. + +

    +The complication on the above complication is, that for horrible +reasons to do with signals, we may have to handle a second client +system call whilst the client is blocked inside some other system +call (unbelievable!). That means there's two sets of places to +dump Valgrind's stack pointer and FPU state across the syscall, +and we decide which to use by consulting +VG_(syscall_depth), which is in turn maintained by +VG_(wrap_syscall). + + + +

    Introduction to UCode

    + +UCode lies at the heart of the x86-to-x86 JITter. The basic premise +is that dealing the the x86 instruction set head-on is just too darn +complicated, so we do the traditional compiler-writer's trick and +translate it into a simpler, easier-to-deal-with form. + +

    +In normal operation, translation proceeds through six stages, +coordinated by VG_(translate): +

      +
    1. Parsing of an x86 basic block into a sequence of UCode + instructions (VG_(disBB)). +

      +

    2. UCode optimisation (vg_improve), with the aim of + caching simulated registers in real registers over multiple + simulated instructions, and removing redundant simulated + %EFLAGS saving/restoring. +

      +

    3. UCode instrumentation (vg_instrument), which adds + value and address checking code. +

      +

    4. Post-instrumentation cleanup (vg_cleanup), removing + redundant value-check computations. +

      +

    5. Register allocation (vg_do_register_allocation), + which, note, is done on UCode. +

      +

    6. Emission of final instrumented x86 code + (VG_(emit_code)). +
    + +

    +Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode +transformation passes, all on straight-line blocks of UCode (type +UCodeBlock). Steps 2 and 4 are optimisation passes and +can be disabled for debugging purposes, with +--optimise=no and --cleanup=no respectively. + +

    +Valgrind can also run in a no-instrumentation mode, given +--instrument=no. This is useful for debugging the JITter +quickly without having to deal with the complexity of the +instrumentation mechanism too. In this mode, steps 3 and 4 are +omitted. + +

    +These flags combine, so that --instrument=no together with +--optimise=no means only steps 1, 5 and 6 are used. +--single-step=yes causes each x86 instruction to be +treated as a single basic block. The translations are terrible but +this is sometimes instructive. + +

    +The --stop-after=N flag switches back to the real CPU +after N basic blocks. It also re-JITs the final basic +block executed and prints the debugging info resulting, so this +gives you a way to get a quick snapshot of how a basic block looks as +it passes through the six stages mentioned above. If you want to +see full information for every block translated (probably not, but +still ...) find, in VG_(translate), the lines +
    dis = True; +
    dis = debugging_translation; +
    +and comment out the second line. This will spew out debugging +junk faster than you can possibly imagine. + + + +

    UCode operand tags: type Tag

    + +UCode is, more or less, a simple two-address RISC-like code. In +keeping with the x86 AT&T assembly syntax, generally speaking the +first operand is the source operand, and the second is the destination +operand, which is modified when the uinstr is notionally executed. + +

    +UCode instructions have up to three operand fields, each of which has +a corresponding Tag describing it. Possible values for +the tag are: + +

    + + +

    UCode instructions: type UInstr

    + +

    +UCode was carefully designed to make it possible to do register +allocation on UCode and then translate the result into x86 code +without needing any extra registers ... well, that was the original +plan, anyway. Things have gotten a little more complicated since +then. In what follows, UCode instructions are referred to as uinstrs, +to distinguish them from x86 instructions. Uinstrs of course have +uopcodes which are (naturally) different from x86 opcodes. + +

    +A uinstr (type UInstr) contains +various fields, not all of which are used by any one uopcode: +

    + +

    +UOpcodes (type Opcode) are divided into two groups: those +necessary merely to express the functionality of the x86 code, and +extra uopcodes needed to express the instrumentation. The former +group contains: +

    + +

    +Stages 1 and 2 of the 6-stage translation process mentioned above +deal purely with these uopcodes, and no others. They are +sufficient to express pretty much all the x86 32-bit protected-mode +instruction set, at +least everything understood by a pre-MMX original Pentium (P54C). + +

    +Stages 3, 4, 5 and 6 also deal with the following extra +"instrumentation" uopcodes. They are used to express all the +definedness-tracking and -checking machinery which valgrind does. In +later sections we show how to create checking code for each of the +uopcodes above. Note that these instrumentation uopcodes, although +some appearing complicated, have been carefully chosen so that +efficient x86 code can be generated for them. GNU superopt v2.5 did a +great job helping out here. Anyways, the uopcodes are as follows: + +

    + +

    +These 10 uopcodes are sufficient to express Valgrind's entire +definedness-checking semantics. In fact most of the interesting magic +is done by the TAG1 and TAG2 +suboperations. + +

    +First, however, I need to explain about V-vector operation sizes. +There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32 +V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations. +However there is also the mysterious size 0, which really means a +single V bit. Single V bits are used in various circumstances; in +particular, the definedness of %EFLAGS is modelled with a +single V bit. Now might be a good time to also point out that for +V bits, 1 means "undefined" and 0 means "defined". Similarly, for A +bits, 1 means "invalid address" and 0 means "valid address". This +seems counterintuitive (and so it is), but testing against zero on +x86s saves instructions compared to testing against all 1s, because +many ALU operations set the Z flag for free, so to speak. + +

    +With that in mind, the tag ops are: + +

    + +

    +That's all the tag ops. If you stare at this long enough, and then +run Valgrind and stare at the pre- and post-instrumented ucode, it +should be fairly obvious how the instrumentation machinery hangs +together. + +

    +One point, if you do this: in order to make it easy to differentiate +TempRegs carrying values from TempRegs +carrying V bit vectors, Valgrind prints the former as (for example) +t28 and the latter as q28; the fact that +they carry the same number serves to indicate their relationship. +This is purely for the convenience of the human reader; the register +allocator and code generator don't regard them as different. + + +

    Translation into UCode

    + +VG_(disBB) allocates a new UCodeBlock and +then uses disInstr to translate x86 instructions one at a +time into UCode, dumping the result in the UCodeBlock. +This goes on until a control-flow transfer instruction is encountered. + +

    +Despite the large size of vg_to_ucode.c, this translation +is really very simple. Each x86 instruction is translated entirely +independently of its neighbours, merrily allocating new +TempRegs as it goes. The idea is to have a simple +translator -- in reality, no more than a macro-expander -- and the -- +resulting bad UCode translation is cleaned up by the UCode +optimisation phase which follows. To give you an idea of some x86 +instructions and their translations (this is a complete basic block, +as Valgrind sees it): +

    +        0x40435A50:  incl %edx
    +
    +           0: GETL      %EDX, t0
    +           1: INCL      t0  (-wOSZAP)
    +           2: PUTL      t0, %EDX
    +
    +        0x40435A51:  movsbl (%edx),%eax
    +
    +           3: GETL      %EDX, t2
    +           4: LDB       (t2), t2
    +           5: WIDENL_Bs t2
    +           6: PUTL      t2, %EAX
    +
    +        0x40435A54:  testb $0x20, 1(%ecx,%eax,2)
    +
    +           7: GETL      %EAX, t6
    +           8: GETL      %ECX, t8
    +           9: LEA2L     1(t8,t6,2), t4
    +          10: LDB       (t4), t10
    +          11: MOVB      $0x20, t12
    +          12: ANDB      t12, t10  (-wOSZACP)
    +          13: INCEIPo   $9
    +
    +        0x40435A59:  jnz-8 0x40435A50
    +
    +          14: Jnzo      $0x40435A50  (-rOSZACP)
    +          15: JMPo      $0x40435A5B
    +
    + +

    +Notice how the block always ends with an unconditional jump to the +next block. This is a bit unnecessary, but makes many things simpler. + +

    +Most x86 instructions turn into sequences of GET, +PUT, LEA1, LEA2, +LOAD and STORE. Some complicated ones +however rely on calling helper bits of code in +vg_helpers.S. The ucode instructions PUSH, +POP, CALL, CALLM_S and +CALLM_E support this. The calling convention is somewhat +ad-hoc and is not the C calling convention. The helper routines must +save all integer registers, and the flags, that they use. Args are +passed on the stack underneath the return address, as usual, and if +result(s) are to be returned, it (they) are either placed in dummy arg +slots created by the ucode PUSH sequence, or just +overwrite the incoming args. + +

    +In order that the instrumentation mechanism can handle calls to these +helpers, VG_(saneUCodeBlock) enforces the following +restrictions on calls to helpers: + +

    + +Some of the translations may appear to have redundant +TempReg-to-TempReg moves. This helps the +next phase, UCode optimisation, to generate better code. + + + +

    UCode optimisation

    + +UCode is then subjected to an improvement pass +(vg_improve()), which blurs the boundaries between the +translations of the original x86 instructions. It's pretty +straightforward. Three transformations are done: + + + +The effect of these transformations on our short block is rather +unexciting, and shown below. On longer basic blocks they can +dramatically improve code quality. + +
    +at 3: delete GET, rename t2 to t0 in (4 .. 6)
    +at 7: delete GET, rename t6 to t0 in (8 .. 9)
    +at 1: annul flag write OSZAP due to later OSZACP
    +
    +Improved code:
    +           0: GETL      %EDX, t0
    +           1: INCL      t0
    +           2: PUTL      t0, %EDX
    +           4: LDB       (t0), t0
    +           5: WIDENL_Bs t0
    +           6: PUTL      t0, %EAX
    +           8: GETL      %ECX, t8
    +           9: LEA2L     1(t8,t0,2), t4
    +          10: LDB       (t4), t10
    +          11: MOVB      $0x20, t12
    +          12: ANDB      t12, t10  (-wOSZACP)
    +          13: INCEIPo   $9
    +          14: Jnzo      $0x40435A50  (-rOSZACP)
    +          15: JMPo      $0x40435A5B
    +
    + +

    UCode instrumentation

    + +Once you understand the meaning of the instrumentation uinstrs, +discussed in detail above, the instrumentation scheme is fairly +straighforward. Each uinstr is instrumented in isolation, and the +instrumentation uinstrs are placed before the original uinstr. +Our running example continues below. I have placed a blank line +after every original ucode, to make it easier to see which +instrumentation uinstrs correspond to which originals. + +

    +As mentioned somewhere above, TempRegs carrying values +have names like t28, and each one has a shadow carrying +its V bits, with names like q28. This pairing aids in +reading instrumented ucode. + +

    +One decision about all this is where to have "observation points", +that is, where to check that V bits are valid. I use a minimalistic +scheme, only checking where a failure of validity could cause the +original program to (seg)fault. So the use of values as memory +addresses causes a check, as do conditional jumps (these cause a check +on the definedness of the condition codes). And arguments +PUSHed for helper calls are checked, hence the wierd +restrictions on help call preambles described above. + +

    +Another decision is that once a value is tested, it is thereafter +regarded as defined, so that we do not emit multiple undefined-value +errors for the same undefined value. That means that +TESTV uinstrs are always followed by SETV +on the same (shadow) TempRegs. Most of these +SETVs are redundant and are removed by the +post-instrumentation cleanup phase. + +

    +The instrumentation for calling helper functions deserves further +comment. The definedness of results from a helper is modelled using +just one V bit. So, in short, we do pessimising casts of the +definedness of all the args, down to a single bit, and then +UifU these bits together. So this single V bit will say +"undefined" if any part of any arg is undefined. This V bit is then +pessimally cast back up to the result(s) sizes, as needed. If, by +seeing that all the args are got rid of with CLEAR and +none with POP, Valgrind sees that the result of the call +is not actually used, it immediately examines the result V bit with a +TESTV -- SETV pair. If it did not do this, +there would be no observation point to detect that the some of the +args to the helper were undefined. Of course, if the helper's results +are indeed used, we don't do this, since the result usage will +presumably cause the result definedness to be checked at some suitable +future point. + +

    +In general Valgrind tries to track definedness on a bit-for-bit basis, +but as the above para shows, for calls to helpers we throw in the +towel and approximate down to a single bit. This is because it's too +complex and difficult to track bit-level definedness through complex +ops such as integer multiply and divide, and in any case there is no +reasonable code fragments which attempt to (eg) multiply two +partially-defined values and end up with something meaningful, so +there seems little point in modelling multiplies, divides, etc, in +that level of detail. + +

    +Integer loads and stores are instrumented with firstly a test of the +definedness of the address, followed by a LOADV or +STOREV respectively. These turn into calls to +(for example) VG_(helperc_LOADV4). These helpers do two +things: they perform an address-valid check, and they load or store V +bits from/to the relevant address in the (simulated V-bit) memory. + +

    +FPU loads and stores are different. As above the definedness of the +address is first tested. However, the helper routine for FPU loads +(VGM_(fpu_read_check)) emits an error if either the +address is invalid or the referenced area contains undefined values. +It has to do this because we do not simulate the FPU at all, and so +cannot track definedness of values loaded into it from memory, so we +have to check them as soon as they are loaded into the FPU, ie, at +this point. We notionally assume that everything in the FPU is +defined. + +

    +It follows therefore that FPU writes first check the definedness of +the address, then the validity of the address, and finally mark the +written bytes as well-defined. + +

    +If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest +you use the same trick. It works provided that the FPU/MMX unit is +not used to merely as a conduit to copy partially undefined data from +one place in memory to another. Unfortunately the integer CPU is used +like that (when copying C structs with holes, for example) and this is +the cause of much of the elaborateness of the instrumentation here +described. + +

    +vg_instrument() in vg_translate.c actually +does the instrumentation. There are comments explaining how each +uinstr is handled, so we do not repeat that here. As explained +already, it is bit-accurate, except for calls to helper functions. +Unfortunately the x86 insns bt/bts/btc/btr are done by +helper fns, so bit-level accuracy is lost there. This should be fixed +by doing them inline; it will probably require adding a couple new +uinstrs. Also, left and right rotates through the carry flag (x86 +rcl and rcr) are approximated via a single +V bit; so far this has not caused anyone to complain. The +non-carry rotates, rol and ror, are much +more common and are done exactly. Re-visiting the instrumentation for +AND and OR, they seem rather verbose, and I wonder if it could be done +more concisely now. + +

    +The lowercase o on many of the uopcodes in the running +example indicates that the size field is zero, usually meaning a +single-bit operation. + +

    +Anyroads, the post-instrumented version of our running example looks +like this: + +

    +Instrumented code:
    +           0: GETVL     %EDX, q0
    +           1: GETL      %EDX, t0
    +
    +           2: TAG1o     q0 = Left4 ( q0 )
    +           3: INCL      t0
    +
    +           4: PUTVL     q0, %EDX
    +           5: PUTL      t0, %EDX
    +
    +           6: TESTVL    q0
    +           7: SETVL     q0
    +           8: LOADVB    (t0), q0
    +           9: LDB       (t0), t0
    +
    +          10: TAG1o     q0 = SWiden14 ( q0 )
    +          11: WIDENL_Bs t0
    +
    +          12: PUTVL     q0, %EAX
    +          13: PUTL      t0, %EAX
    +
    +          14: GETVL     %ECX, q8
    +          15: GETL      %ECX, t8
    +
    +          16: MOVL      q0, q4
    +          17: SHLL      $0x1, q4
    +          18: TAG2o     q4 = UifU4 ( q8, q4 )
    +          19: TAG1o     q4 = Left4 ( q4 )
    +          20: LEA2L     1(t8,t0,2), t4
    +
    +          21: TESTVL    q4
    +          22: SETVL     q4
    +          23: LOADVB    (t4), q10
    +          24: LDB       (t4), t10
    +
    +          25: SETVB     q12
    +          26: MOVB      $0x20, t12
    +
    +          27: MOVL      q10, q14
    +          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
    +          29: TAG2o     q10 = UifU1 ( q12, q10 )
    +          30: TAG2o     q10 = DifD1 ( q14, q10 )
    +          31: MOVL      q12, q14
    +          32: TAG2o     q14 = ImproveAND1_TQ ( t12, q14 )
    +          33: TAG2o     q10 = DifD1 ( q14, q10 )
    +          34: MOVL      q10, q16
    +          35: TAG1o     q16 = PCast10 ( q16 )
    +          36: PUTVFo    q16
    +          37: ANDB      t12, t10  (-wOSZACP)
    +
    +          38: INCEIPo   $9
    +
    +          39: GETVFo    q18
    +          40: TESTVo    q18
    +          41: SETVo     q18
    +          42: Jnzo      $0x40435A50  (-rOSZACP)
    +
    +          43: JMPo      $0x40435A5B
    +
    + + +

    UCode post-instrumentation cleanup

    + +

    +This pass, coordinated by vg_cleanup(), removes redundant +definedness computation created by the simplistic instrumentation +pass. It consists of two passes, +vg_propagate_definedness() followed by +vg_delete_redundant_SETVs. + +

    +vg_propagate_definedness() is a simple +constant-propagation and constant-folding pass. It tries to determine +which TempRegs containing V bits will always indicate +"fully defined", and it propagates this information as far as it can, +and folds out as many operations as possible. For example, the +instrumentation for an ADD of a literal to a variable quantity will be +reduced down so that the definedness of the result is simply the +definedness of the variable quantity, since the literal is by +definition fully defined. + +

    +vg_delete_redundant_SETVs removes SETVs on +shadow TempRegs for which the next action is a write. +I don't think there's anything else worth saying about this; it is +simple. Read the sources for details. + +

    +So the cleaned-up running example looks like this. As above, I have +inserted line breaks after every original (non-instrumentation) uinstr +to aid readability. As with straightforward ucode optimisation, the +results in this block are undramatic because it is so short; longer +blocks benefit more because they have more redundancy which gets +eliminated. + + +

    +at 29: delete UifU1 due to defd arg1
    +at 32: change ImproveAND1_TQ to MOV due to defd arg2
    +at 41: delete SETV
    +at 31: delete MOV
    +at 25: delete SETV
    +at 22: delete SETV
    +at 7: delete SETV
    +
    +           0: GETVL     %EDX, q0
    +           1: GETL      %EDX, t0
    +
    +           2: TAG1o     q0 = Left4 ( q0 )
    +           3: INCL      t0
    +
    +           4: PUTVL     q0, %EDX
    +           5: PUTL      t0, %EDX
    +
    +           6: TESTVL    q0
    +           8: LOADVB    (t0), q0
    +           9: LDB       (t0), t0
    +
    +          10: TAG1o     q0 = SWiden14 ( q0 )
    +          11: WIDENL_Bs t0
    +
    +          12: PUTVL     q0, %EAX
    +          13: PUTL      t0, %EAX
    +
    +          14: GETVL     %ECX, q8
    +          15: GETL      %ECX, t8
    +
    +          16: MOVL      q0, q4
    +          17: SHLL      $0x1, q4
    +          18: TAG2o     q4 = UifU4 ( q8, q4 )
    +          19: TAG1o     q4 = Left4 ( q4 )
    +          20: LEA2L     1(t8,t0,2), t4
    +
    +          21: TESTVL    q4
    +          23: LOADVB    (t4), q10
    +          24: LDB       (t4), t10
    +
    +          26: MOVB      $0x20, t12
    +
    +          27: MOVL      q10, q14
    +          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
    +          30: TAG2o     q10 = DifD1 ( q14, q10 )
    +          32: MOVL      t12, q14
    +          33: TAG2o     q10 = DifD1 ( q14, q10 )
    +          34: MOVL      q10, q16
    +          35: TAG1o     q16 = PCast10 ( q16 )
    +          36: PUTVFo    q16
    +          37: ANDB      t12, t10  (-wOSZACP)
    +
    +          38: INCEIPo   $9
    +          39: GETVFo    q18
    +          40: TESTVo    q18
    +          42: Jnzo      $0x40435A50  (-rOSZACP)
    +
    +          43: JMPo      $0x40435A5B
    +
    + + +

    Translation from UCode

    + +This is all very simple, even though vg_from_ucode.c +is a big file. Position-independent x86 code is generated into +a dynamically allocated array emitted_code; this is +doubled in size when it overflows. Eventually the array is handed +back to the caller of VG_(translate), who must copy +the result into TC and TT, and free the array. + +

    +This file is structured into four layers of abstraction, which, +thankfully, are glued back together with extensive +__inline__ directives. From the bottom upwards: + +

    + +

    +Some comments: +

    + +

    +And so ... that's the end of the documentation for the instrumentating +translator! It's really not that complex, because it's composed as a +sequence of simple(ish) self-contained transformations on +straight-line blocks of code. + + +

    Top-level dispatch loop

    + +Urk. In VG_(toploop). This is basically boring and +unsurprising, not to mention fiddly and fragile. It needs to be +cleaned up. + +

    +The only perhaps surprise is that the whole thing is run +on top of a setjmp-installed exception handler, because, +supposing a translation got a segfault, we have to bail out of the +Valgrind-supplied exception handler VG_(oursignalhandler) +and immediately start running the client's segfault handler, if it has +one. In particular we can't finish the current basic block and then +deliver the signal at some convenient future point, because signals +like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not +simply be re-tried. (I'm sure there is a clearer way to explain this). + + +

    Exceptions, creating new translations

    +

    Self-modifying code

    + +

    Lazy updates of the simulated program counter

    + +Simulated %EIP is not updated after every simulated x86 +insn as this was regarded as too expensive. Instead ucode +INCEIP insns move it along as and when necessary. +Currently we don't allow it to fall more than 4 bytes behind reality +(see VG_(disBB) for the way this works). +

    +Note that %EIP is always brought up to date by the inner +dispatch loop in VG_(dispatch), so that if the client +takes a fault we know at least which basic block this happened in. + + +

    The translation cache and translation table

    + +

    Signals

    + +Horrible, horrible. vg_signals.c. +Basically, since we have to intercept all system +calls anyway, we can see when the client tries to install a signal +handler. If it does so, we make a note of what the client asked to +happen, and ask the kernel to route the signal to our own signal +handler, VG_(oursignalhandler). This simply notes the +delivery of signals, and returns. + +

    +Every 1000 basic blocks, we see if more signals have arrived. If so, +VG_(deliver_signals) builds signal delivery frames on the +client's stack, and allows their handlers to be run. Valgrind places +in these signal delivery frames a bogus return address, +VG_(signalreturn_bogusRA), and checks all jumps to see +if any jump to it. If so, this is a sign that a signal handler is +returning, and if so Valgrind removes the relevant signal frame from +the client's stack, restores the from the signal frame the simulated +state before the signal was delivered, and allows the client to run +onwards. We have to do it this way because some signal handlers never +return, they just longjmp(), which nukes the signal +delivery frame. + +

    +The Linux kernel has a different but equally horrible hack for +detecting signal handler returns. Discovering it is left as an +exercise for the reader. + + + +

    Errors, error contexts, error reporting, suppressions

    +

    Client malloc/free

    +

    Low-level memory management

    +

    A and V bitmaps

    +

    Symbol table management

    +

    Dealing with system calls

    +

    Namespace management

    +

    GDB attaching

    +

    Non-dependence on glibc or anything else

    +

    The leak detector

    +

    Performance problems

    +

    Continuous sanity checking

    +

    Tracing, or not tracing, child processes

    +

    Assembly glue for syscalls

    + + +
    + +

    Extensions

    + +Some comments about Stuff To Do. + +

    Bugs

    + +Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS +(RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has +it looping at startup. I can't repro either behaviour. Needs +repro-ing and fixing. + + +

    Threads

    + +Doing a good job of thread support strikes me as almost a +research-level problem. The central issues are how to do fast cheap +locking of the VG_(primary_map) structure, whether or not +accesses to the individual secondary maps need locking, what +race-condition issues result, and whether the already-nasty mess that +is the signal simulator needs further hackery. + +

    +I realise that threads are the most-frequently-requested feature, and +I am thinking about it all. If you have guru-level understanding of +fast mutual exclusion mechanisms and race conditions, I would be +interested in hearing from you. + + +

    Verification suite

    + +Directory tests/ contains various ad-hoc tests for +Valgrind. However, there is no systematic verification or regression +suite, that, for example, exercises all the stuff in +vg_memory.c, to ensure that illegal memory accesses and +undefined value uses are detected as they should be. It would be good +to have such a suite. + + +

    Porting to other platforms

    + +It would be great if Valgrind was ported to FreeBSD and x86 NetBSD, +and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style +executables, not ELF ?) + +

    +The main difficulties, for an x86-ELF platform, seem to be: + +

    + +All in all, I think a port to x86-ELF *BSDs is not really very +difficult, and in some ways I would like to see it happen, because +that would force a more clear factoring of Valgrind into platform +dependent and independent pieces. Not to mention, *BSD folks also +deserve to use Valgrind just as much as the Linux crew do. + + +

    +


    + +

    Easy stuff which ought to be done

    + +

    MMX instructions

    + +MMX insns should be supported, using the same trick as for FPU insns. +If the MMX registers are not used to copy uninitialised junk from one +place to another in memory, this means we don't have to actually +simulate the internal MMX unit state, so the FPU hack applies. This +should be fairly easy. + + + +

    Fix stabs-info reader

    + +The machinery in vg_symtab2.c which reads "stabs" style +debugging info is pretty weak. It usually correctly translates +simulated program counter values into line numbers and procedure +names, but the file name is often completely wrong. I think the +logic used to parse "stabs" entries is weak. It should be fixed. +The simplest solution, IMO, is to copy either the logic or simply the +code out of GNU binutils which does this; since GDB can clearly get it +right, binutils (or GDB?) must have code to do this somewhere. + + + + + +

    BT/BTC/BTS/BTR

    + +These are x86 instructions which test, complement, set, or reset, a +single bit in a word. At the moment they are both incorrectly +implemented and incorrectly instrumented. + +

    +The incorrect instrumentation is due to use of helper functions. This +means we lose bit-level definedness tracking, which could wind up +giving spurious uninitialised-value use errors. The Right Thing to do +is to invent a couple of new UOpcodes, I think GET_BIT +and SET_BIT, which can be used to implement all 4 x86 +insns, get rid of the helpers, and give bit-accurate instrumentation +rules for the two new UOpcodes. + +

    +I realised the other day that they are mis-implemented too. The x86 +insns take a bit-index and a register or memory location to access. +For registers the bit index clearly can only be in the range zero to +register-width minus 1, and I assumed the same applied to memory +locations too. But evidently not; for memory locations the index can +be arbitrary, and the processor will index arbitrarily into memory as +a result. This too should be fixed. Sigh. Presumably indexing +outside the immediate word is not actually used by any programs yet +tested on Valgrind, for otherwise they (presumably) would simply not +work at all. If you plan to hack on this, first check the Intel docs +to make sure my understanding is really correct. + + + +

    Using PREFETCH instructions

    + +Here's a small but potentially interesting project for performance +junkies. Experiments with valgrind's code generator and optimiser(s) +suggest that reducing the number of instructions executed in the +translations and mem-check helpers gives disappointingly small +performance improvements. Perhaps this is because performance of +Valgrindified code is limited by cache misses. After all, each read +in the original program now gives rise to at least three reads, one +for the VG_(primary_map), one of the resulting +secondary, and the original. Not to mention, the instrumented +translations are 13 to 14 times larger than the originals. All in all +one would expect the memory system to be hammered to hell and then +some. + +

    +So here's an idea. An x86 insn involving a read from memory, after +instrumentation, will turn into ucode of the following form: +

    +    ... calculate effective addr, into ta and qa ...
    +    TESTVL qa             -- is the addr defined?
    +    LOADV (ta), qloaded   -- fetch V bits for the addr
    +    LOAD  (ta), tloaded   -- do the original load
    +
    +At the point where the LOADV is done, we know the actual +address (ta) from which the real LOAD will +be done. We also know that the LOADV will take around +20 x86 insns to do. So it seems plausible that doing a prefetch of +ta just before the LOADV might just avoid a +miss at the LOAD point, and that might be a significant +performance win. + +

    +Prefetch insns are notoriously tempermental, more often than not +making things worse rather than better, so this would require +considerable fiddling around. It's complicated because Intels and +AMDs have different prefetch insns with different semantics, so that +too needs to be taken into account. As a general rule, even placing +the prefetches before the LOADV insn is too near the +LOAD; the ideal distance is apparently circa 200 CPU +cycles. So it might be worth having another analysis/transformation +pass which pushes prefetches as far back as possible, hopefully +immediately after the effective address becomes available. + +

    +Doing too many prefetches is also bad because they soak up bus +bandwidth / cpu resources, so some cleverness in deciding which loads +to prefetch and which to not might be helpful. One can imagine not +prefetching client-stack-relative (%EBP or +%ESP) accesses, since the stack in general tends to show +good locality anyway. + +

    +There's quite a lot of experimentation to do here, but I think it +might make an interesting week's work for someone. + +

    +As of 15-ish March 2002, I've started to experiment with this, using +the AMD prefetch/prefetchw insns. + + + +

    User-defined permission ranges

    + +This is quite a large project -- perhaps a month's hacking for a +capable hacker to do a good job -- but it's potentially very +interesting. The outcome would be that Valgrind could detect a +whole class of bugs which it currently cannot. + +

    +The presentation falls into two pieces. + +

    +Part 1: user-defined address-range permission setting +

    + +Valgrind intercepts the client's malloc, +free, etc calls, watches system calls, and watches the +stack pointer move. This is currently the only way it knows about +which addresses are valid and which not. Sometimes the client program +knows extra information about its memory areas. For example, the +client could at some point know that all elements of an array are +out-of-date. We would like to be able to convey to Valgrind this +information that the array is now addressable-but-uninitialised, so +that Valgrind can then warn if elements are used before they get new +values. + +

    +What I would like are some macros like this: +

    +   VALGRIND_MAKE_NOACCESS(addr, len)
    +   VALGRIND_MAKE_WRITABLE(addr, len)
    +   VALGRIND_MAKE_READABLE(addr, len)
    +
    +and also, to check that memory is addressible/initialised, +
    +   VALGRIND_CHECK_ADDRESSIBLE(addr, len)
    +   VALGRIND_CHECK_INITIALISED(addr, len)
    +
    + +

    +I then include in my sources a header defining these macros, rebuild +my app, run under Valgrind, and get user-defined checks. + +

    +Now here's a neat trick. It's a nuisance to have to re-link the app +with some new library which implements the above macros. So the idea +is to define the macros so that the resulting executable is still +completely stand-alone, and can be run without Valgrind, in which case +the macros do nothing, but when run on Valgrind, the Right Thing +happens. How to do this? The idea is for these macros to turn into a +piece of inline assembly code, which (1) has no effect when run on the +real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane +person would ever write, which is important for avoiding false matches +in (2). So here's a suggestion: +

    +   VALGRIND_MAKE_NOACCESS(addr, len)
    +
    +becomes (roughly speaking) +
    +   movl addr, %eax
    +   movl len,  %ebx
    +   movl $1,   %ecx   -- 1 describes the action; MAKE_WRITABLE might be
    +                     -- 2, etc
    +   rorl $13, %ecx
    +   rorl $19, %ecx
    +   rorl $11, %eax
    +   rorl $21, %eax
    +
    +The rotate sequences have no effect, and it's unlikely they would +appear for any other reason, but they define a unique byte-sequence +which the JITter can easily spot. Using the operand constraints +section at the end of a gcc inline-assembly statement, we can tell gcc +that the assembly fragment kills %eax, %ebx, +%ecx and the condition codes, so this fragment is made +harmless when not running on Valgrind, runs quickly when not on +Valgrind, and does not require any other library support. + + +

    +Part 2: using it to detect interference between stack variables +

    + +Currently Valgrind cannot detect errors of the following form: +

    +void fooble ( void )
    +{
    +   int a[10];
    +   int b[10];
    +   a[10] = 99;
    +}
    +
    +Now imagine rewriting this as +
    +void fooble ( void )
    +{
    +   int spacer0;
    +   int a[10];
    +   int spacer1;
    +   int b[10];
    +   int spacer2;
    +   VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
    +   VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
    +   VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
    +   a[10] = 99;
    +}
    +
    +Now the invalid write is certain to hit spacer0 or +spacer1, so Valgrind will spot the error. + +

    +There are two complications. + +

    +The first is that we don't want to annotate sources by hand, so the +Right Thing to do is to write a C/C++ parser, annotator, prettyprinter +which does this automatically, and run it on post-CPP'd C/C++ source. +See http://www.cacheprof.org for an example of a system which +transparently inserts another phase into the gcc/g++ compilation +route. The parser/prettyprinter is probably not as hard as it sounds; +I would write it in Haskell, a powerful functional language well +suited to doing symbolic computation, with which I am intimately +familar. There is already a C parser written in Haskell by someone in +the Haskell community, and that would probably be a good starting +point. + +

    +The second complication is how to get rid of these +NOACCESS records inside Valgrind when the instrumented +function exits; after all, these refer to stack addresses and will +make no sense whatever when some other function happens to re-use the +same stack address range, probably shortly afterwards. I think I +would be inclined to define a special stack-specific macro +

    +   VALGRIND_MAKE_NOACCESS_STACK(addr, len)
    +
    +which causes Valgrind to record the client's %ESP at the +time it is executed. Valgrind will then watch for changes in +%ESP and discard such records as soon as the protected +area is uncovered by an increase in %ESP. I hesitate +with this scheme only because it is potentially expensive, if there +are hundreds of such records, and considering that changes in +%ESP already require expensive messing with stack access +permissions. + +

    +This is probably easier and more robust than for the instrumenter +program to try and spot all exit points for the procedure and place +suitable deallocation annotations there. Plus C++ procedures can +bomb out at any point if they get an exception, so spotting return +points at the source level just won't work at all. + +

    +Although some work, it's all eminently doable, and it would make +Valgrind into an even-more-useful tool. + + +

    + + + diff --git a/none/nl_main.html b/none/nl_main.html new file mode 100644 index 000000000..95f947178 --- /dev/null +++ b/none/nl_main.html @@ -0,0 +1,57 @@ + + + + Cachegrind + + + + + +

    Nulgrind

    +
    This manual was last updated on 2002-10-02
    +

    + +

    +njn25@cam.ac.uk
    +Copyright © 2000-2002 Nicholas Nethercote +

    +Nulgrind is licensed under the GNU General Public License, +version 2
    +Nulgrind is a Valgrind skin that does not very much at all. +

    + +

    + +

    1  Nulgrind

    + +Nulgrind is the minimal skin for Valgrind. It does no initialisation or +finalisation, and adds no instrumentation to the program's code. It is mainly +of use for Valgrind's developers for debugging and regression testing. +

    +Nonetheless you can run programs with Nulgrind. They will run roughly 5-10 +times more slowly than normal, for no useful effect. Note that you need to use +the option --skin=none to run Nulgrind (ie. not +--skin=nulgrind). + +


    + + +