From 5d93498d4d296e2ef048e4ced96412e24d83fc16 Mon Sep 17 00:00:00 2001
From: Julian Seward <jseward@acm.org>
Date: Mon, 11 Nov 2002 00:20:07 +0000
Subject: [PATCH] Add documentation back in, in its new form.  Still all very
 rough and totally borked, but pretty much all the duplication is gone, and
 there is a good start on a common core section in
 coregrind/coregrind_core.html.  At least I know where I'm going with all this
 now.

The Makefile.am's need to be fixed up.

Basic idea is that, when put together in a single directory, these
files make a coherent manual, starting at manual.html.  Fortunately
:-) "make install" does exactly that -- copies them to a single
directory.

After redundancy removal, there's more that 38000 words of
documentation here, according to wc.  Amazing.


git-svn-id: svn://svn.valgrind.org/valgrind/trunk@1284
---
 addrcheck/ac_main.html         |   10 +
 cachegrind/cg_main.html        |  752 ++++++++++++
 cachegrind/cg_techdocs.html    |  461 +++++++
 coregrind/coregrind_core.html  | 1270 +++++++++++++++++++
 coregrind/coregrind_intro.html |  176 +++
 coregrind/coregrind_skins.html |  687 +++++++++++
 docs/manual.html               |   92 ++
 helgrind/hg_main.html          |   80 ++
 lackey/lk_main.html            |   68 +
 memcheck/mc_main.html          |  830 +++++++++++++
 memcheck/mc_techdocs.html      | 2113 ++++++++++++++++++++++++++++++++
 none/nl_main.html              |   57 +
 12 files changed, 6596 insertions(+)
 create mode 100644 addrcheck/ac_main.html
 create mode 100644 cachegrind/cg_main.html
 create mode 100644 cachegrind/cg_techdocs.html
 create mode 100644 coregrind/coregrind_core.html
 create mode 100644 coregrind/coregrind_intro.html
 create mode 100644 coregrind/coregrind_skins.html
 create mode 100644 docs/manual.html
 create mode 100644 helgrind/hg_main.html
 create mode 100644 lackey/lk_main.html
 create mode 100644 memcheck/mc_main.html
 create mode 100644 memcheck/mc_techdocs.html
 create mode 100644 none/nl_main.html
diff --git a/addrcheck/ac_main.html b/addrcheck/ac_main.html
new file mode 100644
index 000000000..7aa2e9a52
--- /dev/null
+++ b/addrcheck/ac_main.html
@@ -0,0 +1,10 @@
+<html>
+  <head>
+    <title>AddrCheck</title>
+  </head>
+
+<body>
+(no docs yet, sorry)
+</body>
+</html>
+
diff --git a/cachegrind/cg_main.html b/cachegrind/cg_main.html
new file mode 100644
index 000000000..85462560e
--- /dev/null
+++ b/cachegrind/cg_main.html
@@ -0,0 +1,752 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Cachegrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title">&nbsp;</a>
+<h1 align=center>Cachegrind, version 1.0.0</h1>
+<center>This manual was last updated on 20020726</center>
+<p>
+
+<center>
+<a href="mailto:jseward@acm.org">jseward@acm.org</a><br>
+Copyright &copy; 2000-2002 Julian Seward
+<p>
+Cachegrind is licensed under the GNU General Public License, 
+version 2<br>
+An open-source tool for finding memory-management problems in
+Linux-x86 executables.
+</center>
+
+<p>
+
+<hr width="100%">
+<a name="contents"></a>
+<h2>Contents of this manual</h2>
+
+<h4>1&nbsp; <a href="#cache">How to use Cachegrind</a></h4>
+
+<h4>2&nbsp; <a href="techdocs.html">How Cachegrind works</a></h4>
+
+<hr width="100%">
+
+
+<a name="cache"></a>
+<h2>1&nbsp; Cache profiling</h2>
+Cachegrind is a tool for doing cache simulations and annotate your source
+line-by-line with the number of cache misses.  In particular, it records:
+<ul>
+  <li>L1 instruction cache reads and misses;
+  <li>L1 data cache reads and read misses, writes and write misses;
+  <li>L2 unified cache reads and read misses, writes and writes misses.
+</ul>
+On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
+and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for traditional profiling and test coverage.<p>
+
+Any feedback, bug-fixes, suggestions, etc, welcome.
+
+
+<h3>1.1&nbsp; Overview</h3>
+First off, as for normal Valgrind use, you probably want to compile with
+debugging info (the <code>-g</code> flag).  But by contrast with normal
+Valgrind use, you probably <b>do</b> want to turn optimisation on, since you
+should profile your program as it will be normally run.
+
+The two steps are:
+<ol>
+  <li>Run your program with <code>valgrind --skin=cachegrind</code> in front of
+      the normal command line invocation.  When the program finishes,
+      Valgrind will print summary cache statistics. It also collects
+      line-by-line information in a file
+      <code>cachegrind.out.<i>pid</i></code>, where <code><i>pid</i></code>
+      is the program's process id.
+      <p>
+      This step should be done every time you want to collect
+      information about a new program, a changed program, or about the
+      same program with different input.
+  </li>
+  <p>
+  <li>Generate a function-by-function summary, and possibly annotate
+      source files with 'cg_annotate'. Source files to annotate can be
+      specified manually, or manually on the command line, or
+      "interesting" source files can be annotated automatically with
+      the <code>--auto=yes</code> option.  You can annotate C/C++
+      files or assembly language files equally easily.
+      <p>
+      This step can be performed as many times as you like for each
+      Step 2.  You may want to do multiple annotations showing
+      different information each time.<p>
+  </li>
+</ol>
+
+The steps are described in detail in the following sections.<p>
+
+
+<h3>1.2&nbsp; Cache simulation specifics</h3>
+
+Cachegrind uses a simulation for a machine with a split L1 cache and a unified
+L2 cache.  This configuration is used for all (modern) x86-based machines we
+are aware of.  Old Cyrix CPUs had a unified I and D L1 cache, but they are
+ancient history now.<p>
+
+The more specific characteristics of the simulation are as follows.
+
+<ul>
+  <li>Write-allocate: when a write miss occurs, the block written to
+      is brought into the D1 cache.  Most modern caches have this
+      property.</li><p>
+
+  <li>Bit-selection hash function: the line(s) in the cache to which a
+      memory block maps is chosen by the middle bits M--(M+N-1) of the
+      byte address, where:
+      <ul>
+        <li>&nbsp;line size = 2^M bytes&nbsp;</li>
+        <li>(cache size / line size) = 2^N bytes</li>
+      </ul> </li><p>
+
+  <li>Inclusive L2 cache: the L2 cache replicates all the entries of
+      the L1 cache.  This is standard on Pentium chips, but AMD
+      Athlons use an exclusive L2 cache that only holds blocks evicted
+      from L1.  Ditto AMD Durons and most modern VIAs.</li><p>
+</ul>
+
+The cache configuration simulated (cache size, associativity and line size) is
+determined automagically using the CPUID instruction.  If you have an old
+machine that (a) doesn't support the CPUID instruction, or (b) supports it in
+an early incarnation that doesn't give any cache information, then Cachegrind
+will fall back to using a default configuration (that of a model 3/4 Athlon).
+Cachegrind will tell you if this happens.  You can manually specify one, two or
+all three levels (I1/D1/L2) of the cache from the command line using the
+<code>--I1</code>, <code>--D1</code> and <code>--L2</code> options.<p>
+
+Other noteworthy behaviour:
+
+<ul>
+  <li>References that straddle two cache lines are treated as follows:
+  <ul>
+    <li>If both blocks hit --&gt; counted as one hit</li>
+    <li>If one block hits, the other misses --&gt; counted as one miss</li>
+    <li>If both blocks miss --&gt; counted as one miss (not two)</li>
+  </ul><p></li>
+
+  <li>Instructions that modify a memory location (eg. <code>inc</code> and
+      <code>dec</code>) are counted as doing just a read, ie. a single data
+      reference.  This may seem strange, but since the write can never cause a
+      miss (the read guarantees the block is in the cache) it's not very
+      interesting.<p>
+
+      Thus it measures not the number of times the data cache is accessed, but
+      the number of times a data cache miss could occur.<p>
+      </li>
+</ul>
+
+If you are interested in simulating a cache with different properties, it is
+not particularly hard to write your own cache simulator, or to modify the
+existing ones in <code>vg_cachesim_I1.c</code>, <code>vg_cachesim_D1.c</code>,
+<code>vg_cachesim_L2.c</code> and <code>vg_cachesim_gen.c</code>.  We'd be
+interested to hear from anyone who does.
+
+<a name="profile"></a>
+<h3>1.3&nbsp; Profiling programs</h3>
+
+Cache profiling is enabled by using the <code>--skin=cachegrind</code>
+option to the <code>valgrind</code> shell script.  To gather cache profiling
+information about the program <code>ls -l</code>, type:
+
+<blockquote><code>valgrind --skin=cachegrind ls -l</code></blockquote>
+
+The program will execute (slowly).  Upon completion, summary statistics
+that look like this will be printed:
+
+<pre>
+==31751== I   refs:      27,742,716
+==31751== I1  misses:           276
+==31751== L2  misses:           275
+==31751== I1  miss rate:        0.0%
+==31751== L2i miss rate:        0.0%
+==31751== 
+==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
+==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
+==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
+==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
+==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
+==31751== 
+==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
+==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
+</pre>
+
+Cache accesses for instruction fetches are summarised first, giving the
+number of fetches made (this is the number of instructions executed, which
+can be useful to know in its own right), the number of I1 misses, and the
+number of L2 instruction (<code>L2i</code>) misses.<p>
+
+Cache accesses for data follow. The information is similar to that of the
+instruction fetches, except that the values are also shown split between reads
+and writes (note each row's <code>rd</code> and <code>wr</code> values add up
+to the row's total).<p>
+
+Combined instruction and data figures for the L2 cache follow that.<p>
+
+
+<h3>1.4&nbsp; Output file</h3>
+
+As well as printing summary information, Cachegrind also writes
+line-by-line cache profiling information to a file named
+<code>cachegrind.out.<i>pid</i></code>.  This file is human-readable, but is
+best interpreted by the accompanying program <code>cg_annotate</code>,
+described in the next section.
+<p>
+Things to note about the <code>cachegrind.out.<i>pid</i></code> file:
+<ul>
+  <li>It is written every time <code>valgrind --skin=cachegrind</code>
+      is run, and will overwrite any existing
+      <code>cachegrind.out.<i>pid</i></code> in the current directory (but
+      that won't happen very often because it takes some time for process ids
+      to be recycled).</li>
+  <p>
+  <li>It can be huge: <code>ls -l</code> generates a file of about
+      350KB.  Browsing a few files and web pages with a Konqueror
+      built with full debugging information generates a file
+      of around 15 MB.</li>
+</ul>
+
+Note that older versions of Cachegrind used a log file named
+<code>cachegrind.out</code> (i.e. no <code><i>.pid</i></code> suffix).
+The suffix serves two purposes.  Firstly, it means you don't have to rename old
+log files that you don't want to overwrite.  Secondly, and more importantly,
+it allows correct profiling with the <code>--trace-children=yes</code> option
+of programs that spawn child processes.
+
+<a name="profileflags"></a>
+<h3>1.5&nbsp; Cachegrind options</h3>
+Cachegrind accepts all the options that Valgrind does, although some of them
+(ones related to memory checking) don't do anything when cache profiling.<p>
+
+The interesting cache-simulation specific options are:
+
+<ul>
+  <li><code>--I1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br>
+      <code>--D1=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><br> 
+      <code>--L2=&lt;size&gt;,&lt;associativity&gt;,&lt;line_size&gt;</code><p> 
+      [default: uses CPUID for automagic cache configuration]<p>
+
+      Manually specifies the I1/D1/L2 cache configuration, where
+      <code>size</code> and <code>line_size</code> are measured in bytes.  The
+      three items must be comma-separated, but with no spaces, eg:
+
+      <blockquote>
+      <code>valgrind --skin=cachegrind --I1=65535,2,64</code>
+      </blockquote>
+
+      You can specify one, two or three of the I1/D1/L2 caches.  Any level not
+      manually specified will be simulated using the configuration found in the
+      normal way (via the CPUID instruction, or failing that, via defaults).
+</ul>
+
+  
+<a name="annotate"></a>
+<h3>1.6&nbsp; Annotating C/C++ programs</h3>
+
+Before using <code>cg_annotate</code>, it is worth widening your
+window to be at least 120-characters wide if possible, as the output
+lines can be quite long.
+<p>
+To get a function-by-function summary, run <code>cg_annotate
+--<i>pid</i></code> in a directory containing a
+<code>cachegrind.out.<i>pid</i></code> file.  The <code>--<i>pid</i></code>
+is required so that <code>cg_annotate</code> knows which log file to use when
+several are present.
+<p>
+The output looks like this:
+
+<pre>
+--------------------------------------------------------------------------------
+I1 cache:              65536 B, 64 B, 2-way associative
+D1 cache:              65536 B, 64 B, 2-way associative
+L2 cache:              262144 B, 64 B, 8-way associative
+Command:               concord vg_to_ucode.c
+Events recorded:       Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Events shown:          Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Event sort order:      Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+Threshold:             99%
+Chosen for annotation:
+Auto-annotation:       on
+
+--------------------------------------------------------------------------------
+Ir         I1mr I2mr Dr         D1mr   D2mr  Dw        D1mw   D2mw
+--------------------------------------------------------------------------------
+27,742,716  276  275 10,955,517 21,905 3,987 4,474,773 19,280 19,098  PROGRAM TOTALS
+
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
+--------------------------------------------------------------------------------
+8,821,482    5    5 2,242,702 1,621    73 1,794,230      0      0  getc.c:_IO_getc
+5,222,023    4    4 2,276,334    16    12   875,959      1      1  concord.c:get_word
+2,649,248    2    2 1,344,810 7,326 1,385         .      .      .  vg_main.c:strcmp
+2,521,927    2    2   591,215     0     0   179,398      0      0  concord.c:hash
+2,242,740    2    2 1,046,612   568    22   448,548      0      0  ctype.c:tolower
+1,496,937    4    4   630,874 9,000 1,400   279,388      0      0  concord.c:insert
+  897,991   51   51   897,831    95    30        62      1      1  ???:???
+  598,068    1    1   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__flockfile
+  598,068    0    0   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__funlockfile
+  598,024    4    4   213,580    35    16   149,506      0      0  vg_clientmalloc.c:malloc
+  446,587    1    1   215,973 2,167   430   129,948 14,057 13,957  concord.c:add_existing
+  341,760    2    2   128,160     0     0   128,160      0      0  vg_clientmalloc.c:vg_trap_here_WRAPPER
+  320,782    4    4   150,711   276     0    56,027     53     53  concord.c:init_hash_table
+  298,998    1    1   106,785     0     0    64,071      1      1  concord.c:create
+  149,518    0    0   149,516     0     0         1      0      0  ???:tolower@@GLIBC_2.0
+  149,518    0    0   149,516     0     0         1      0      0  ???:fgetc@@GLIBC_2.0
+   95,983    4    4    38,031     0     0    34,409  3,152  3,150  concord.c:new_word_node
+   85,440    0    0    42,720     0     0    21,360      0      0  vg_clientmalloc.c:vg_bogus_epilogue
+</pre>
+
+First up is a summary of the annotation options:
+                    
+<ul>
+  <li>I1 cache, D1 cache, L2 cache: cache configuration.  So you know the
+      configuration with which these results were obtained.</li><p>
+
+  <li>Command: the command line invocation of the program under
+      examination.</li><p>
+
+  <li>Events recorded: event abbreviations are:<p>
+  <ul>
+    <li><code>Ir  </code>:  I cache reads (ie. instructions executed)</li>
+    <li><code>I1mr</code>: I1 cache read misses</li>
+    <li><code>I2mr</code>: L2 cache instruction read misses</li>
+    <li><code>Dr  </code>:  D cache reads (ie. memory reads)</li>
+    <li><code>D1mr</code>: D1 cache read misses</li>
+    <li><code>D2mr</code>: L2 cache data read misses</li>
+    <li><code>Dw  </code>:  D cache writes (ie. memory writes)</li>
+    <li><code>D1mw</code>: D1 cache write misses</li>
+    <li><code>D2mw</code>: L2 cache data write misses</li>
+  </ul><p>
+      Note that D1 total accesses is given by <code>D1mr</code> +
+      <code>D1mw</code>, and that L2 total accesses is given by
+      <code>I2mr</code> + <code>D2mr</code> + <code>D2mw</code>.</li><p>
+
+  <li>Events shown: the events shown (a subset of events gathered).  This can
+      be adjusted with the <code>--show</code> option.</li><p>
+
+  <li>Event sort order: the sort order in which functions are shown.  For
+      example, in this case the functions are sorted from highest
+      <code>Ir</code> counts to lowest.  If two functions have identical
+      <code>Ir</code> counts, they will then be sorted by <code>I1mr</code>
+      counts, and so on.  This order can be adjusted with the
+      <code>--sort</code> option.<p>
+
+      Note that this dictates the order the functions appear.  It is <b>not</b>
+      the order in which the columns appear;  that is dictated by the "events
+      shown" line (and can be changed with the <code>--show</code> option).
+      </li><p>
+
+  <li>Threshold: <code>cg_annotate</code> by default omits functions
+      that cause very low numbers of misses to avoid drowning you in
+      information.  In this case, cg_annotate shows summaries the
+      functions that account for 99% of the <code>Ir</code> counts;
+      <code>Ir</code> is chosen as the threshold event since it is the
+      primary sort event.  The threshold can be adjusted with the
+      <code>--threshold</code> option.</li><p>
+
+  <li>Chosen for annotation: names of files specified manually for annotation; 
+      in this case none.</li><p>
+
+  <li>Auto-annotation: whether auto-annotation was requested via the 
+      <code>--auto=yes</code> option. In this case no.</li><p>
+</ul>
+
+Then follows summary statistics for the whole program. These are similar
+to the summary provided when running <code>valgrind --skin=cachegrind</code>.<p>
+  
+Then follows function-by-function statistics. Each function is
+identified by a <code>file_name:function_name</code> pair. If a column
+contains only a dot it means the function never performs
+that event (eg. the third row shows that <code>strcmp()</code>
+contains no instructions that write to memory). The name
+<code>???</code> is used if the the file name and/or function name
+could not be determined from debugging information. If most of the
+entries have the form <code>???:???</code> the program probably wasn't
+compiled with <code>-g</code>.  If any code was invalidated (either due to
+self-modifying code or unloading of shared objects) its counts are aggregated
+into a single cost centre written as <code>(discarded):(discarded)</code>.<p>
+
+It is worth noting that functions will come from three types of source files:
+<ol>
+  <li> From the profiled program (<code>concord.c</code> in this example).</li>
+  <li>From libraries (eg. <code>getc.c</code>)</li>
+  <li>From Valgrind's implementation of some libc functions (eg.
+      <code>vg_clientmalloc.c:malloc</code>).  These are recognisable because
+      the filename begins with <code>vg_</code>, and is probably one of
+      <code>vg_main.c</code>, <code>vg_clientmalloc.c</code> or
+      <code>vg_mylibc.c</code>.
+  </li>
+</ol>
+
+There are two ways to annotate source files -- by choosing them
+manually, or with the <code>--auto=yes</code> option. To do it
+manually, just specify the filenames as arguments to
+<code>cg_annotate</code>. For example, the output from running
+<code>cg_annotate concord.c</code> for our example produces the same
+output as above followed by an annotated version of
+<code>concord.c</code>, a section of which looks like:
+
+<pre>
+--------------------------------------------------------------------------------
+-- User-annotated source: concord.c
+--------------------------------------------------------------------------------
+Ir        I1mr I2mr Dr      D1mr  D2mr  Dw      D1mw   D2mw
+
+[snip]
+
+        .    .    .       .     .     .       .      .      .  void init_hash_table(char *file_name, Word_Node *table[])
+        3    1    1       .     .     .       1      0      0  {
+        .    .    .       .     .     .       .      .      .      FILE *file_ptr;
+        .    .    .       .     .     .       .      .      .      Word_Info *data;
+        1    0    0       .     .     .       1      1      1      int line = 1, i;
+        .    .    .       .     .     .       .      .      .
+        5    0    0       .     .     .       3      0      0      data = (Word_Info *) create(sizeof(Word_Info));
+        .    .    .       .     .     .       .      .      .
+    4,991    0    0   1,995     0     0     998      0      0      for (i = 0; i < TABLE_SIZE; i++)
+    3,988    1    1   1,994     0     0     997     53     52          table[i] = NULL;
+        .    .    .       .     .     .       .      .      .
+        .    .    .       .     .     .       .      .      .      /* Open file, check it. */
+        6    0    0       1     0     0       4      0      0      file_ptr = fopen(file_name, "r");
+        2    0    0       1     0     0       .      .      .      if (!(file_ptr)) {
+        .    .    .       .     .     .       .      .      .          fprintf(stderr, "Couldn't open '%s'.\n", file_name);
+        1    1    1       .     .     .       .      .      .          exit(EXIT_FAILURE);
+        .    .    .       .     .     .       .      .      .      }
+        .    .    .       .     .     .       .      .      .
+  165,062    1    1  73,360     0     0  91,700      0      0      while ((line = get_word(data, line, file_ptr)) != EOF)
+  146,712    0    0  73,356     0     0  73,356      0      0          insert(data->;word, data->line, table);
+        .    .    .       .     .     .       .      .      .
+        4    0    0       1     0     0       2      0      0      free(data);
+        4    0    0       1     0     0       2      0      0      fclose(file_ptr);
+        3    0    0       2     0     0       .      .      .  }
+</pre>
+
+(Although column widths are automatically minimised, a wide terminal is clearly
+useful.)<p>
+  
+Each source file is clearly marked (<code>User-annotated source</code>) as
+having been chosen manually for annotation.  If the file was found in one of
+the directories specified with the <code>-I</code>/<code>--include</code>
+option, the directory and file are both given.<p>
+
+Each line is annotated with its event counts.  Events not applicable for a line
+are represented by a `.';  this is useful for distinguishing between an event
+which cannot happen, and one which can but did not.<p> 
+
+Sometimes only a small section of a source file is executed.  To minimise
+uninteresting output, Valgrind only shows annotated lines and lines within a
+small distance of annotated lines.  Gaps are marked with the line numbers so
+you know which part of a file the shown code comes from, eg:
+
+<pre>
+(figures and code for line 704)
+-- line 704 ----------------------------------------
+-- line 878 ----------------------------------------
+(figures and code for line 878)
+</pre>
+
+The amount of context to show around annotated lines is controlled by the
+<code>--context</code> option.<p>
+
+To get automatic annotation, run <code>cg_annotate --auto=yes</code>.
+cg_annotate will automatically annotate every source file it can find that is
+mentioned in the function-by-function summary.  Therefore, the files chosen for
+auto-annotation  are affected by the <code>--sort</code> and
+<code>--threshold</code> options.  Each source file is clearly marked
+(<code>Auto-annotated source</code>) as being chosen automatically.  Any files
+that could not be found are mentioned at the end of the output, eg:    
+
+<pre>
+--------------------------------------------------------------------------------
+The following files chosen for auto-annotation could not be found:
+--------------------------------------------------------------------------------
+  getc.c
+  ctype.c
+  ../sysdeps/generic/lockfile.c
+</pre>
+
+This is quite common for library files, since libraries are usually compiled
+with debugging information, but the source files are often not present on a
+system.  If a file is chosen for annotation <b>both</b> manually and
+automatically, it is marked as <code>User-annotated source</code>.
+
+Use the <code>-I/--include</code> option to tell Valgrind where to look for
+source files if the filenames found from the debugging information aren't
+specific enough.
+
+Beware that cg_annotate can take some time to digest large
+<code>cachegrind.out.<i>pid</i></code> files, e.g. 30 seconds or more.  Also
+beware that auto-annotation can produce a lot of output if your program is
+large!
+
+
+<h3>1.7&nbsp; Annotating assembler programs</h3>
+
+Valgrind can annotate assembler programs too, or annotate the
+assembler generated for your C program.  Sometimes this is useful for
+understanding what is really happening when an interesting line of C
+code is translated into multiple instructions.<p>
+
+To do this, you just need to assemble your <code>.s</code> files with
+assembler-level debug information.  gcc doesn't do this, but you can
+use the GNU assembler with the <code>--gstabs</code> option to
+generate object files with this information, eg:
+
+<blockquote><code>as --gstabs foo.s</code></blockquote>
+
+You can then profile and annotate source files in the same way as for C/C++
+programs.
+
+
+<h3>1.8&nbsp; <code>cg_annotate</code> options</h3>
+<ul>
+  <li><code>--<i>pid</i></code></li><p>
+
+      Indicates which <code>cachegrind.out.<i>pid</i></code> file to read.
+      Not actually an option -- it is required.
+    
+  <li><code>-h, --help</code></li><p>
+  <li><code>-v, --version</code><p>
+
+      Help and version, as usual.</li>
+
+  <li><code>--sort=A,B,C</code> [default: order in 
+      <code>cachegrind.out.<i>pid</i></code>]<p>
+      Specifies the events upon which the sorting of the function-by-function
+      entries will be based.  Useful if you want to concentrate on eg. I cache
+      misses (<code>--sort=I1mr,I2mr</code>), or D cache misses
+      (<code>--sort=D1mr,D2mr</code>), or L2 misses
+      (<code>--sort=D2mr,I2mr</code>).</li><p>
+
+  <li><code>--show=A,B,C</code> [default: all, using order in
+      <code>cachegrind.out.<i>pid</i></code>]<p>
+      Specifies which events to show (and the column order). Default is to use
+      all present in the <code>cachegrind.out.<i>pid</i></code> file (and use
+      the order in the file).</li><p>
+
+  <li><code>--threshold=X</code> [default: 99%] <p>
+      Sets the threshold for the function-by-function summary.  Functions are
+      shown that account for more than X% of the primary sort event.  If
+      auto-annotating, also affects which files are annotated.
+      
+      Note: thresholds can be set for more than one of the events by appending
+      any events for the <code>--sort</code> option with a colon and a number
+      (no spaces, though).  E.g. if you want to see the functions that cover
+      99% of L2 read misses and 99% of L2 write misses, use this option:
+      
+      <blockquote><code>--sort=D2mr:99,D2mw:99</code></blockquote>
+      </li><p>
+
+  <li><code>--auto=no</code> [default]<br>
+      <code>--auto=yes</code> <p>
+      When enabled, automatically annotates every file that is mentioned in the
+      function-by-function summary that can be found.  Also gives a list of
+      those that couldn't be found.
+
+  <li><code>--context=N</code> [default: 8]<p>
+      Print N lines of context before and after each annotated line.  Avoids
+      printing large sections of source files that were not executed.  Use a 
+      large number (eg. 10,000) to show all source lines.
+      </li><p>
+
+  <li><code>-I=&lt;dir&gt;, --include=&lt;dir&gt;</code> 
+      [default: empty string]<p>
+      Adds a directory to the list in which to search for files.  Multiple
+      -I/--include options can be given to add multiple directories.
+</ul>
+  
+
+<h3>1.9&nbsp; Warnings</h3>
+There are a couple of situations in which cg_annotate issues warnings.
+
+<ul>
+  <li>If a source file is more recent than the
+      <code>cachegrind.out.<i>pid</i></code> file.  This is because the
+      information in <code>cachegrind.out.<i>pid</i></code> is only recorded
+      with line numbers, so if the line numbers change at all in the source
+      (eg.  lines added, deleted, swapped), any annotations will be
+      incorrect.<p>
+
+  <li>If information is recorded about line numbers past the end of a file.
+      This can be caused by the above problem, ie. shortening the source file
+      while using an old <code>cachegrind.out.<i>pid</i></code> file.  If this
+      happens, the figures for the bogus lines are printed anyway (clearly
+      marked as bogus) in case they are important.</li><p>
+</ul>
+
+
+<h3>1.10&nbsp; Things to watch out for</h3>
+Some odd things that can occur during annotation:
+
+<ul>
+  <li>If annotating at the assembler level, you might see something like this:
+
+      <pre>
+      1    0    0  .    .    .  .    .    .          leal -12(%ebp),%eax
+      1    0    0  .    .    .  1    0    0          movl %eax,84(%ebx)
+      2    0    0  0    0    0  1    0    0          movl $1,-20(%ebp)
+      .    .    .  .    .    .  .    .    .          .align 4,0x90
+      1    0    0  .    .    .  .    .    .          movl $.LnrB,%eax
+      1    0    0  .    .    .  1    0    0          movl %eax,-16(%ebp)
+      </pre>
+
+      How can the third instruction be executed twice when the others are
+      executed only once?  As it turns out, it isn't.  Here's a dump of the
+      executable, using <code>objdump -d</code>:
+
+      <pre>
+      8048f25:       8d 45 f4                lea    0xfffffff4(%ebp),%eax
+      8048f28:       89 43 54                mov    %eax,0x54(%ebx)
+      8048f2b:       c7 45 ec 01 00 00 00    movl   $0x1,0xffffffec(%ebp)
+      8048f32:       89 f6                   mov    %esi,%esi
+      8048f34:       b8 08 8b 07 08          mov    $0x8078b08,%eax
+      8048f39:       89 45 f0                mov    %eax,0xfffffff0(%ebp)
+      </pre>
+
+      Notice the extra <code>mov %esi,%esi</code> instruction.  Where did this
+      come from?  The GNU assembler inserted it to serve as the two bytes of
+      padding needed to align the <code>movl $.LnrB,%eax</code> instruction on
+      a four-byte boundary, but pretended it didn't exist when adding debug
+      information.  Thus when Valgrind reads the debug info it thinks that the
+      <code>movl $0x1,0xffffffec(%ebp)</code> instruction covers the address
+      range 0x8048f2b--0x804833 by itself, and attributes the counts for the
+      <code>mov %esi,%esi</code> to it.<p>
+  </li>
+
+  <li>Inlined functions can cause strange results in the function-by-function
+      summary.  If a function <code>inline_me()</code> is defined in
+      <code>foo.h</code> and inlined in the functions <code>f1()</code>,
+      <code>f2()</code> and <code>f3()</code> in <code>bar.c</code>, there will
+      not be a <code>foo.h:inline_me()</code> function entry.  Instead, there
+      will be separate function entries for each inlining site, ie.
+      <code>foo.h:f1()</code>, <code>foo.h:f2()</code> and
+      <code>foo.h:f3()</code>.  To find the total counts for
+      <code>foo.h:inline_me()</code>, add up the counts from each entry.<p>
+
+      The reason for this is that although the debug info output by gcc
+      indicates the switch from <code>bar.c</code> to <code>foo.h</code>, it
+      doesn't indicate the name of the function in <code>foo.h</code>, so
+      Valgrind keeps using the old one.<p>
+
+  <li>Sometimes, the same filename might be represented with a relative name
+      and with an absolute name in different parts of the debug info, eg:
+      <code>/home/user/proj/proj.h</code> and <code>../proj.h</code>.  In this
+      case, if you use auto-annotation, the file will be annotated twice with
+      the counts split between the two.<p>
+  </li>
+
+  <li>Files with more than 65,535 lines cause difficulties for the stabs debug
+      info reader.  This is because the line number in the <code>struct
+      nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
+      value.  Valgrind can handle some files with more than 65,535 lines
+      correctly by making some guesses to identify line number overflows.  But
+      some cases are beyond it, in which case you'll get a warning message
+      explaining that annotations for the file might be incorrect.<p>
+  </li>
+
+  <li>If you compile some files with <code>-g</code> and some without, some
+      events that take place in a file without debug info could be attributed
+      to the last line of a file with debug info (whichever one gets placed
+      before the non-debug-info file in the executable).<p>
+  </li>
+</ul>
+
+This list looks long, but these cases should be fairly rare.<p>
+
+Note: stabs is not an easy format to read.  If you come across bizarre
+annotations that look like might be caused by a bug in the stabs reader,
+please let us know.<p>
+
+
+<h3>1.11&nbsp; Accuracy</h3>
+Valgrind's cache profiling has a number of shortcomings:
+
+<ul>
+  <li>It doesn't account for kernel activity -- the effect of system calls on
+      the cache contents is ignored.</li><p>
+
+  <li>It doesn't account for other process activity (although this is probably
+      desirable when considering a single program).</li><p>
+
+  <li>It doesn't account for virtual-to-physical address mappings;  hence the
+      entire simulation is not a true representation of what's happening in the
+      cache.</li><p>
+
+  <li>It doesn't account for cache misses not visible at the instruction level,
+      eg. those arising from TLB misses, or speculative execution.</li><p>
+
+  <li>Valgrind's custom <code>malloc()</code> will allocate memory in different
+      ways to the standard <code>malloc()</code>, which could warp the results.
+      </li><p>
+
+  <li>Valgrind's custom threads implementation will schedule threads
+      differently to the standard one.  This too could warp the results for
+      threaded programs.
+      </li><p>
+
+  <li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
+      will incorrectly be counted as doing a data read if both the arguments
+      are registers, eg:
+
+      <blockquote><code>btsl %eax, %edx</code></blockquote>
+
+      This should only happen rarely.
+      </li><p>
+
+  <li>FPU instructions with data sizes of 28 and 108 bytes (e.g.
+      <code>fsave</code>) are treated as though they only access 16 bytes.
+      These instructions seem to be rare so hopefully this won't affect
+      accuracy much.
+      </li><p>
+</ul>
+
+Another thing worth nothing is that results are very sensitive.  Changing the
+size of the <code>valgrind.so</code> file, the size of the program being
+profiled, or even the length of its name can perturb the results.  Variations
+will be small, but don't expect perfectly repeatable results if your program
+changes at all.<p>
+
+While these factors mean you shouldn't trust the results to be super-accurate,
+hopefully they should be close enough to be useful.<p>
+
+
+<h3>1.12&nbsp; Todo</h3>
+<ul>
+  <li>Program start-up/shut-down calls a lot of functions that aren't
+      interesting and just complicate the output.  Would be nice to exclude
+      these somehow.</li>
+  <p>
+</ul> 
+<hr width="100%">
+</body>
+</html>
+
diff --git a/cachegrind/cg_techdocs.html b/cachegrind/cg_techdocs.html
new file mode 100644
index 000000000..3375ef066
--- /dev/null
+++ b/cachegrind/cg_techdocs.html
@@ -0,0 +1,461 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>The design and implementation of Valgrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title">&nbsp;</a>
+<h1 align=center>How Cachegrind works</h1>
+
+<center>
+Detailed technical notes for hackers, maintainers and the
+overly-curious<br>
+These notes pertain to snapshot 20020306<br>
+<p>
+<a href="mailto:jseward@acm.org">jseward@acm.org<br>
+<a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br>
+Copyright &copy; 2000-2002 Julian Seward
+<p>
+Valgrind is licensed under the GNU General Public License, 
+version 2<br>
+An open-source tool for finding memory-management problems in
+x86 GNU/Linux executables.
+</center>
+
+<p>
+
+
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode.  For example, for cache profiling we are interested in
+instructions that read and write memory;  in UCode there are only four
+instructions that do this:  <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>.  By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually.  Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses.  A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+  <li>An instruction-type identification tag (explained below)</li><p>
+  <li>Instruction size</li><p>
+  <li>Data reference size (<code>idCC</code> only)</li><p>
+  <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>.  This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing).  We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>.  So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there.  So the write access isn't really interesting, and
+Valgrind doesn't record it.  This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level.  By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+                
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+</pre>
+
+The first step is to allocate the cost centres.  This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes).  UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required.  From this we can determine how many cost centres we
+need for the basic block, and their sizes.  We can then allocate them in a
+single array.<p>
+
+Consider the example code above.  After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>.  So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 byte)
+|(uninit)|      data_size   (1 byte)
+|(uninit)|      (padding)   (1 byte)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array.  We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name.  This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+  <li>Fill in the gaps in the allocated cost centres.</li><p>
+  <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem.  As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|WRITE_CC|      tag         (1 byte)
+|7       |      instr_size  (1 byte)
+|4       |      data_size   (1 byte)
+|(uninit)|      (padding)   (1 byte)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated;  see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+</pre>
+
+Consider the first instruction's UCode.  Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers.  Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function.  The address is known at this point because we are doing a
+simultaneous pass through the cost centre array.  This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call).  Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions;  it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar.  The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too.  This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions.  (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function.  But this would slow the calls down
+(two or three extra arguments pushed onto the stack).  Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre;  bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication.  Valgrind has a limited size
+cache for basic block translations;  if it fills up, old translations are
+discarded.  If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution!  So when a basic block is
+translated, we first look for its cost centre array in the hash table.  If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above.  But if there is a cost centre array already, it must be a
+retranslation.  In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward.  It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean.  The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>;  these calls are inlined so that only
+one function call is done per simulated x86 instruction.  The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions.  Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time;  the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+  <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+  <li><code>cmd</code> is a command line invocation.</li><p>
+  <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+  <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+  <li><code>ws</code> is whitespace.</li><p>
+  <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>.  If the number in each <code>line</code> is less,
+cg_annotate treats those missing as though they were a "." entry.  <p>
+
+A <code>file_line</code> changes the current file name.  A <code>fn_line</code>
+changes the current function name.  A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name.  A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>.  "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name.  (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+  <li>The basic block-level cost centre storage allows almost free cost centre
+      lookup.</li><p>
+  
+  <li>Only one function call is made per instruction simulated;  even this
+      accounts for a sizeable percentage of execution time, but it seems
+      unavoidable if we want flexibility in the cache simulator.</li><p>
+
+  <li>Unchanging information about an instruction is stored in its cost centre,
+      avoiding unnecessary argument pushing, and minimising UCode
+      instrumentation bloat.</li><p>
+
+  <li>Summary counts are calculated at the end, rather than during
+      execution.</li><p>
+
+  <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+      information; file format was carefully chosen to minimise file
+      sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by cg_annotate.  It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them.  It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events.  A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, cg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information.  The only part of cg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
+</body>
+</html>
diff --git a/coregrind/coregrind_core.html b/coregrind/coregrind_core.html
new file mode 100644
index 000000000..7e6083636
--- /dev/null
+++ b/coregrind/coregrind_core.html
@@ -0,0 +1,1270 @@
+
+
+<a name="core"></a>
+<h2>2&nbsp; Using and understanding the valgrind core services</h2>
+
+This section describes the core services, flags and behaviours.  That
+means it is relevant regardless of what particular skin you are using.
+A point of terminology: most references to "valgrind" in the rest of
+this section (Section 2) refer to the valgrind core services.
+
+
+<a name="core-whatdoes"></a>
+<h3>2.1&nbsp; What it does with your program</h3>
+
+Valgrind is designed to be as non-intrusive as possible. It works
+directly with existing executables. You don't need to recompile,
+relink, or otherwise modify, the program to be checked. Simply place
+the word <code>valgrind</code> at the start of the command line
+normally used to run the program, and tell it what skin you want to
+use.
+
+<p>
+So, for example, if you want to run the command <code>ls -l</code>
+using the heavyweight memory-checking tool, issue the command:
+<code>valgrind --skin=memcheck ls -l</code>.  The <code>--skin=</code>
+parameter tells the core which skin is to be used.
+
+<p>
+To preserve compatibility with the 1.0.X series, if you do not specify
+a skin, the default is to use the memcheck skin.  That means the above
+example simplifies to: <code>valgrind ls -l</code>.
+
+<p>Regardless of which skin is in use, Valgrind takes control of your
+program before it starts.  Debugging information is read from the
+executable and associated libraries, so that error messages can be
+phrased in terms of source code locations (if that is appropriate).
+
+<p>
+Your program is then run on a synthetic x86 CPU provided by the
+valgrind core.  As new code is executed for the first time, the core
+hands the code to the selected skin.  The skin adds its own
+instrumentation code to this and hands the result back to the core,
+which coordinates the continued execution of this instrumented code.
+
+<p>
+The amount of instrumentation code added varies widely between skins.
+At one end of the scale, the memcheck skin adds code to check every
+memory access and every value computed, increasing the size of the
+code at least 12 times, and making it run 25-50 times slower than
+natively.  At the other end of the spectrum, the ultra-trivial "none"
+skin adds no instrumentation at all and causes in total "only" about a
+4 times slowdown.  
+
+<p>
+Valgrind simulates every single instruction your program executes.
+Because of this, the active skin checks, or profiles, not only the
+code in your application but also in all supporting dynamically-linked
+(<code>.so</code>-format) libraries, including the GNU C library, the
+X client libraries, Qt, if you work with KDE, and so on.  
+
+<p>
+If -- as is usually the case -- you're using one of the
+error-detection skins, valgrind will often detect errors in
+libraries, for example the GNU C or X11 libraries, which you have to
+use.  Since you're probably using valgrind to debug your own
+application, and not those libraries, you don't want to see those
+errors and probably can't fix them anyway.
+
+<p>
+So, rather than swamping you with errors in which you are not
+interested, Valgrind allows you to selectively suppress errors, by
+recording them in a suppressions file which is read when Valgrind
+starts up.  The build mechanism attempts to select suppressions which
+give reasonable behaviour for the libc and XFree86 versions detected
+on your machine.
+
+<p>
+Different skins report different kinds of errors.  The suppression
+mechanism therefore allows you to say which skin or skin(s) each
+suppression applies to.
+
+
+
+<a name="starta"></a>
+<h3>2.2&nbsp; Getting started</h3>
+
+First off, consider whether it might be beneficial to recompile your
+application and supporting libraries with debugging info enabled (the
+<code>-g</code> flag).  Without debugging info, the best valgrind
+will be able to do is guess which function a particular piece of code
+belongs to, which makes both error messages and profiling output
+nearly useless.  With <code>-g</code>, you'll potentially get messages
+which point directly to the relevant source code lines.
+
+<p>
+You don't have to do this, but doing so helps Valgrind produce more
+accurate and less confusing error reports.  Chances are you're set up
+like this already, if you intended to debug your program with GNU gdb,
+or some other debugger.
+
+<p>
+This paragraph applies only if you plan to use the memcheck
+skin (which is the default).  On rare occasions, optimisation levels
+at <code>-O2</code> and above have been observed to generate code which
+fools memcheck into wrongly reporting uninitialised value
+errors.  We have looked in detail into fixing this, and unfortunately 
+the result is that doing so would give a further significant slowdown
+in what is already a slow skin.  So the best solution is to turn off
+optimisation altogether.  Since this often makes things unmanagably
+slow, a plausible compromise is to use <code>-O</code>.  This gets 
+you the majority of the benefits of higher optimisation levels whilst
+keeping relatively small the chances of false complaints from memcheck.
+All other skins (as far as we know) are unaffected by optimisation
+level.
+
+<p>
+Valgrind understands both the older "stabs" debugging format, used by
+gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1
+and later.  We continue to refine and debug our debug-info readers,
+although the majority of effort will naturally enough go into the 
+newer DWARF2 reader.
+
+<p>
+Then just run your application, but place <code>valgrind
+--skin=the-selected-skin</code> in front of your usual command-line
+invokation.  Note that you should run the real (machine-code)
+executable here.  If your application is started by, for example, a
+shell or perl script, you'll need to modify it to invoke Valgrind on
+the real executables.  Running such scripts directly under Valgrind
+will result in you getting error reports pertaining to
+<code>/bin/sh</code>, <code>/usr/bin/perl</code>, or whatever
+interpreter you're using.  This almost certainly isn't what you want
+and can be confusing.  You can probably force the issue by 
+giving the flag <code>--trace-children=yes</code>, but confusion is
+still highly likely.
+
+
+<a name="core-comment"></a>
+<h3>2.3&nbsp; The commentary</h3>
+
+Valgrind writes a commentary, a stream of text, detailing error
+reports and other significant events.  All lines in the commentary
+have following form:<br>
+<pre>
+  ==12345== some-message-from-Valgrind
+</pre>
+
+<p>The <code>12345</code>  is the process ID.  This scheme makes it easy
+to distinguish program output from Valgrind commentary, and also easy
+to differentiate commentaries from different processes which have
+become merged together, for whatever reason.
+
+<p>By default, Valgrind writes only essential messages to the commentary,
+so as to avoid flooding you with information of secondary importance.
+If you want more information about what is happening, re-run, passing
+the <code>-v</code> flag to Valgrind.
+
+<p>
+Version 2 of valgrind gives significantly more flexibility than 1.0.X
+does about where that stream is sent to.  You have three options:
+
+<ul>
+<li>The default: send it to a file descriptor, which is by default 2
+    (stderr).  So, if you give the core no options, it will write 
+    commentary to the standard error stream.  If you want to send 
+    it to some other file descriptor, for example number 9,
+    you can specify <code>--logfile-fd=9</code>.
+<p>
+<li>A less intrusive option is to write the commentary to a file, 
+    which you specify by <code>--logfile=filename</code>.  Note 
+    carefully that the commentary is <b>not</b> written to the file
+    you specify, but instead to one called
+    <code>filename.pid12345</code>, if for example the pid of the
+    traced process is 12345.  This is helpful when valgrinding a whole
+    tree of processes at once, since it means that each process writes
+    to its own logfile, rather than the result being jumbled up in one
+    big logfile.
+<p>
+<li>The least intrusive option is to send the commentary to a network
+    socket.  The socket is specified as an IP address and port number
+    pair, like this: <code>--logsocket=192.168.0.1:12345</code> if you
+    want to send the output to host IP 192.168.0.1 port 12345 (I have
+    no idea if 12345 is a port of pre-existing significance).  You can
+    also omit the port number: <code>--logsocket=192.168.0.1</code>, 
+    in which case a default port of 1500 is used.  This default is
+    defined by the constant <code>VG_CLO_DEFAULT_LOGPORT</code>
+    in the sources.
+    <p>
+    Note, unfortunately, that you have to use an IP address here --
+    for technical reasons, valgrind's core itself can't use the GNU C
+    library, and this makes it difficult to do hostname-to-IP lookups.
+    <p>
+    Writing to a network socket it pretty useless if you don't have
+    something listening at the other end.  We provide a simple
+    listener program, <code>valgrind-listener</code>, which accepts 
+    connections on the specified port and copies whatever it is sent
+    to stdout.  Probably someone will tell us this is a horrible
+    security risk.  It seems likely that people will write more
+    sophisticated listeners in the fullness of time.
+    <p>
+    valgrind-listener can accept simultaneous connections from up to 50
+    valgrinded processes.  In front of each line of output it prints
+    the current number of active connections in round brackets.  
+    <p>
+    valgrind-listener accepts two command-line flags:
+    <ul>
+    <li><code>-e</code> or <code>--exit-at-zero</code>: when the
+        number of connected processes falls back to zero, exit.
+        Without this, it will run forever, that is, until you send it
+        Control-C.
+    <p>
+    <li><code>portnumber</code>: changes the port it listens on from
+        the default (1500).  The specified port must be in the range
+        1024 to 65535.  The same restriction applies to port numbers
+        specified by a <code>--logsocket=</code> to valgrind itself.
+    </ul>
+    <p>
+    If a valgrinded process fails to connect to a listener, for
+    whatever reason (the listener isn't running, invalid or
+    unreachable host or port, etc), valgrind switches back to writing
+    the commentary to stderr.  The same goes for any process which
+    loses an established connection to a listener.  In other words,
+    killing the listener doesn't kill the processes sending data to
+    it.
+</ul>
+<p>
+Here is an important point about the relationship between the
+commentary and profiling output from skins.  The commentary contains a
+mix of messages from the valgrind core and the selected skin.  If the
+skin reports errors, it will report them to the commentary.  However,
+if the skin does profiling, the profile data will be written to a file
+of some kind, depending on the skin, and independent of what
+<code>--log*</code> options are in force.  The commentary is intended
+to be a low-bandwidth, human-readable channel.  Profiling data, on the
+other hand, is usually voluminous and not meaningful without further
+processing, which is why we have chosen this arrangement.
+
+
+<a name="report"></a>
+<h3>2.4&nbsp; Reporting of errors</h3>
+
+When one of the error-checking skins (memcheck, addrcheck, helgrind)
+detects something bad happening in the program, an error message is
+written to the commentary.  For example:<br>
+<pre>
+  ==25832== Invalid read of size 4
+  ==25832==    at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45)
+  ==25832==    by 0x80487AF: main (bogon.cpp:66)
+  ==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
+  ==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
+  ==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
+</pre>
+
+<p>
+This message says that the program did an illegal 4-byte read of
+address 0xBFFFF74C, which, as far as memcheck can tell, is not a valid
+stack address, nor corresponds to any currently malloc'd or free'd
+blocks.  The read is happening at line 45 of <code>bogon.cpp</code>,
+called from line 66 of the same file, etc.  For errors associated with
+an identified malloc'd/free'd block, for example reading free'd
+memory, Valgrind reports not only the location where the error
+happened, but also where the associated block was malloc'd/free'd.
+
+<p>
+Valgrind remembers all error reports.  When an error is detected,
+it is compared against old reports, to see if it is a duplicate.  If
+so, the error is noted, but no further commentary is emitted.  This
+avoids you being swamped with bazillions of duplicate error reports.
+
+<p>
+If you want to know how many times each error occurred, run with the
+<code>-v</code> option.  When execution finishes, all the reports are
+printed out, along with, and sorted by, their occurrence counts.  This
+makes it easy to see which errors have occurred most frequently.
+
+<p>
+Errors are reported before the associated operation actually happens.
+If you're using a skin (memcheck, addrcheck) which does address
+checking, and your program attempts to read from address zero, the
+skin will emit a message to this effect, and the program will then
+duly die with a segmentation fault.
+
+<p>
+In general, you should try and fix errors in the order that they are
+reported.  Not doing so can be confusing.  For example, a program
+which copies uninitialised values to several memory locations, and
+later uses them, will generate several error messages, when run on
+memcheck.  The first such error message may well give the most direct
+clue to the root cause of the problem.
+
+<p>
+The process of detecting duplicate errors is quite an expensive one
+and can become a significant performance overhead if your program
+generates huge quantities of errors.  To avoid serious problems here,
+Valgrind will simply stop collecting errors after 300 different errors
+have been seen, or 30000 errors in total have been seen.  In this
+situation you might as well stop your program and fix it, because
+Valgrind won't tell you anything else useful after this.  Note that
+the 300/30000 limits apply after suppressed errors are removed.  These
+limits are defined in <code>vg_include.h</code> and can be increased
+if necessary.
+
+<p>
+To avoid this cutoff you can use the <code>--error-limit=no</code>
+flag.  Then valgrind will always show errors, regardless of how many
+there are.  Use this flag carefully, since it may have a dire effect
+on performance.
+
+
+<a name="suppress"></a>
+<h3>2.5&nbsp; Suppressing errors</h3>
+
+The error-checking skins detect numerous problems in the base
+libraries, such as the GNU C library, and the XFree86 client
+libraries, which come pre-installed on your GNU/Linux system.  You
+can't easily fix these, but you don't want to see these errors (and
+yes, there are many!)  So Valgrind reads a list of errors to suppress
+at startup.  A default suppression file is cooked up by the
+<code>./configure</code> script when the system is built.
+
+<p>
+You can modify and add to the suppressions file at your leisure,
+or, better, write your own.  Multiple suppression files are allowed.
+This is useful if part of your project contains errors you can't or
+don't want to fix, yet you don't want to continuously be reminded of
+them.
+
+<p>
+Each error to be suppressed is described very specifically, to
+minimise the possibility that a suppression-directive inadvertantly
+suppresses a bunch of similar errors which you did want to see.  The
+suppression mechanism is designed to allow precise yet flexible
+specification of errors to suppress.
+
+<p>
+If you use the <code>-v</code> flag, at the end of execution, Valgrind
+prints out one line for each used suppression, giving its name and the
+number of times it got used.  Here's the suppressions used by a run of
+<code>valgrind --skin=memcheck ls -l</code>:
+<pre>
+  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r
+  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r
+  --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object
+</pre>
+
+
+<a name="flags"></a>
+<h3>2.6&nbsp; Command-line flags for the valgrind core</h3>
+
+
+As mentioned above, valgrind's core accepts a common set of flags.
+The skins also accept skin-specific flags, which are documented
+seperately for each skin.  
+
+You invoke Valgrind like this:
+<pre>
+  valgrind [options-for-Valgrind] your-prog [options for your-prog]
+</pre>
+
+<p>Note that Valgrind also reads options from the environment variable
+<code>$VALGRIND_OPTS</code>, and processes them before the command-line
+options.  Options for the valgrind core may be freely mixed with those
+for the selected skin.
+
+<p>Valgrind's default settings succeed in giving reasonable behaviour
+in most cases.  Available options, in no particular order, are as
+follows:
+<ul>
+  <li><code>--help</code><br>
+      <p>Show help for all options, both for the core and for the
+      selected skin.
+
+  <li><code>--version</code><br> <p>Show the version number of the
+      valgrind core.  Skins can have their own version numbers.  There
+      is a scheme in place to ensure that skins only execute when the
+      core version is one they are known to work with.  This was done
+      to minimise the chances of strange problems arising from
+      skin-vs-core version incompatibilities.  </li><br><p>
+
+  <li><code>-v --verbose</code><br> <p>Be more verbose.  Gives extra
+      information on various aspects of your program, such as: the
+      shared objects loaded, the suppressions used, the progress of
+      the instrumentation and execution engines, and warnings about
+      unusual behaviour.  Repeating the flag increases the verbosity
+      level.  </li><br><p>
+
+  <li><code>-q --quiet</code><br>
+      <p>Run silently, and only print error messages.  Useful if you
+      are running regression tests or have some other automated test
+      machinery.
+      </li><br><p>
+
+  <li><code>--demangle=no</code><br>
+      <code>--demangle=yes</code> [the default]
+      <p>Disable/enable automatic demangling (decoding) of C++ names.
+      Enabled by default.  When enabled, Valgrind will attempt to
+      translate encoded C++ procedure names back to something
+      approaching the original.  The demangler handles symbols mangled
+      by g++ versions 2.X and 3.X.
+
+      <p>An important fact about demangling is that function
+      names mentioned in suppressions files should be in their mangled
+      form.  Valgrind does not demangle function names when searching
+      for applicable suppressions, because to do otherwise would make
+      suppressions file contents dependent on the state of Valgrind's
+      demangling machinery, and would also be slow and pointless.
+      </li><br><p>
+
+  <li><code>--num-callers=&lt;number&gt;</code> [default=4]<br>
+      <p>By default, Valgrind shows four levels of function call names
+      to help you identify program locations.  You can change that
+      number with this option.  This can help in determining the
+      program's location in deeply-nested call chains.  Note that errors
+      are commoned up using only the top three function locations (the
+      place in the current function, and that of its two immediate
+      callers).  So this doesn't affect the total number of errors
+      reported.  
+      <p>
+      The maximum value for this is 50.  Note that higher settings
+      will make Valgrind run a bit more slowly and take a bit more
+      memory, but can be useful when working with programs with
+      deeply-nested call chains.  
+      </li><br><p>
+
+  <li><code>--gdb-attach=no</code> [the default]<br>
+      <code>--gdb-attach=yes</code>
+      <p>When enabled, Valgrind will pause after every error shown,
+      and print the line
+      <br>
+      <code>---- Attach to GDB ? --- [Return/N/n/Y/y/C/c] ----</code>
+      <p>
+      Pressing <code>Ret</code>, or <code>N</code> <code>Ret</code>
+      or <code>n</code> <code>Ret</code>, causes Valgrind not to
+      start GDB for this error.
+      <p>
+      <code>Y</code> <code>Ret</code>
+      or <code>y</code> <code>Ret</code> causes Valgrind to
+      start GDB, for the program at this point.  When you have
+      finished with GDB, quit from it, and the program will continue.
+      Trying to continue from inside GDB doesn't work.
+      <p>
+      <code>C</code> <code>Ret</code>
+      or <code>c</code> <code>Ret</code> causes Valgrind not to
+      start GDB, and not to ask again.
+      <p>
+      <code>--gdb-attach=yes</code> conflicts with
+      <code>--trace-children=yes</code>.  You can't use them together.
+      Valgrind refuses to start up in this situation.  1 May 2002:
+      this is a historical relic which could be easily fixed if it
+      gets in your way.  Mail me and complain if this is a problem for
+      you.
+      <p>
+      Nov 2002: if you're sending output to a logfile or to a network
+      socket, I guess this option doesn't make any sense.  Caveat emptor.
+      </li><br><p>
+
+  <li><code>--alignment=&lt;number></code> [default: 4]<br> <p>By
+      default valgrind's <code>malloc</code>, <code>realloc</code>,
+      etc, return 4-byte aligned addresses.  These are suitable for
+      any accesses on x86 processors. 
+      Some programs might however assume that <code>malloc</code> et
+      al return 8- or more aligned memory.
+      These programs are broken and should be fixed, but
+      if this is impossible for whatever reason the alignment can be
+      increased using this parameter.  The supplied value must be
+      between 4 and 4096 inclusive, and must be a power of two.</li><br><p>
+
+  <li><code>--sloppy-malloc=no</code> [the default]<br>
+      <code>--sloppy-malloc=yes</code>
+      <p>When enabled, all requests for malloc/calloc are rounded up
+      to a whole number of machine words -- in other words, made
+      divisible by 4.  For example, a request for 17 bytes of space
+      would result in a 20-byte area being made available.  This works
+      around bugs in sloppy libraries which assume that they can
+      safely rely on malloc/calloc requests being rounded up in this
+      fashion.  Without the workaround, these libraries tend to
+      generate large numbers of errors when they access the ends of
+      these areas.  
+      <p>
+      Valgrind snapshots dated 17 Feb 2002 and later are
+      cleverer about this problem, and you should no longer need to 
+      use this flag.  To put it bluntly, if you do need to use this
+      flag, your program violates the ANSI C semantics defined for
+      <code>malloc</code> and <code>free</code>, even if it appears to
+      work correctly, and you should fix it, at least if you hope for
+      maximum portability.
+      </li><br><p>
+
+  <li><code>--trace-children=no</code> [the default]<br>
+      <code>--trace-children=yes</code>
+      <p>When enabled, Valgrind will trace into child processes.  This
+      is confusing and usually not what you want, so is disabled by
+      default.
+      </li><br><p>
+
+  <li><code>--logfile-fd=&lt;number></code> [default: 2, stderr]
+      <p>Specifies that Valgrind should send all of its
+      messages to the specified file descriptor.  The default, 2, is
+      the standard error channel (stderr).  Note that this may
+      interfere with the client's own use of stderr.  
+      </li><br><p>
+
+  <li><code>--logfile=&lt;filename></code>
+      <p>Specifies that Valgrind should send all of its
+      messages to the specified file.  In fact, the file name used
+      is created by concatenating the text <code>filename</code>,
+      ".pid" and the process ID, so as to create a file per process.
+      The specified file name may not be the empty string.
+      </li><br><p>
+
+  <li><code>--logsocket=&lt;ip-address:port-number></code>
+      <p>Specifies that Valgrind should send all of its messages to
+      the specified port at the specified IP address.  The port may be
+      omitted, in which case port 1500 is used.  If a connection
+      cannot be made to the specified socket, valgrind falls back to
+      writing output to the standard error (stderr).  This option is
+      intended to be used in conjunction with the
+      <code>valgrind-listener</code> program.  For further details,
+      see section <a href="#core-comment">2.3</a>.
+      </li><br><p>
+
+  <li><code>--suppressions=&lt;filename></code> 
+      [default: $PREFIX/lib/valgrind/default.supp]
+      <p>Specifies an extra
+      file from which to read descriptions of errors to suppress.  You
+      may use as many extra suppressions files as you
+      like.
+      </li><br><p>
+
+  <li><code>--error-limit=yes</code> [default]<br>
+      <code>--error-limit=no</code> <p>When enabled, valgrind stops
+      reporting errors after 30000 in total, or 300 different ones,
+      have been seen.  This is to stop the error tracking machinery
+      from becoming a huge performance overhead in programs with many
+      errors.  
+      </li><br><p>
+
+  <li><code>--run-libc-freeres=yes</code> [the default]<br>
+      <code>--run-libc-freeres=no</code>
+      <p>The GNU C library (<code>libc.so</code>), which is used by
+      all programs, may allocate memory for its own uses.  Usually it
+      doesn't bother to free that memory when the program ends - there
+      would be no point, since the Linux kernel reclaims all process
+      resources when a process exits anyway, so it would just slow
+      things down.
+      <p>
+      The glibc authors realised that this behaviour causes leak
+      checkers, such as Valgrind, to falsely report leaks in glibc,
+      when a leak check is done at exit.  In order to avoid this, they
+      provided a routine called <code>__libc_freeres</code>
+      specifically to make glibc release all memory it has allocated.
+      The MemCheck and AddrCheck skins therefore try and run
+      <code>__libc_freeres</code> at exit.
+      <p>
+      Unfortunately, in some versions of glibc,
+      <code>__libc_freeres</code> is sufficiently buggy to cause
+      segmentation faults.  This is particularly noticeable on Red Hat
+      7.1.  So this flag is provided in order to inhibit the run of
+      <code>__libc_freeres</code>.  If your program seems to run fine
+      on valgrind, but segfaults at exit, you may find that
+      <code>--run-libc-freeres=no</code> fixes that, although at the
+      cost of possibly falsely reporting space leaks in
+      <code>libc.so</code>.
+      </li><br><p>
+
+  <li><code>--weird-hacks=hack1,hack2,...</code>
+      Pass miscellaneous hints to Valgrind which slightly modify the
+      simulated behaviour in nonstandard or dangerous ways, possibly
+      to help the simulation of strange features.  By default no hacks
+      are enabled.  Use with caution!  Currently known hacks are:
+      <p>
+      <ul>
+      <li><code>ioctl-VTIME</code> Use this if you have a program
+          which sets readable file descriptors to have a timeout by
+          doing <code>ioctl</code> on them with a
+          <code>TCSETA</code>-style command <b>and</b> a non-zero
+          <code>VTIME</code> timeout value.  This is considered
+          potentially dangerous and therefore is not engaged by
+          default, because it is (remotely) conceivable that it could
+          cause threads doing <code>read</code> to incorrectly block
+          the entire process.
+          <p>
+          You probably want to try this one if you have a program
+          which unexpectedly blocks in a <code>read</code> from a file
+          descriptor which you know to have been messed with by
+          <code>ioctl</code>.  This could happen, for example, if the
+          descriptor is used to read input from some kind of screen
+          handling library.
+          <p>
+          To find out if your program is blocking unexpectedly in the
+          <code>read</code> system call, run with
+          <code>--trace-syscalls=yes</code> flag.
+      <p>
+      <li><code>truncate-writes</code> Use this if you have a threaded
+          program which appears to unexpectedly block whilst writing
+          into a pipe.  The effect is to modify all calls to
+          <code>write()</code> so that requests to write more than
+          4096 bytes are treated as if they only requested a write of
+          4096 bytes.  Valgrind does this by changing the
+          <code>count</code> argument of <code>write()</code>, as
+          passed to the kernel, so that it is at most 4096.  The
+          amount of data written will then be less than the client
+          program asked for, but the client should have a loop around
+          its <code>write()</code> call to check whether the requested
+          number of bytes have been written.  If not, it should issue
+          further <code>write()</code> calls until all the data is
+          written.
+          <p>
+          This all sounds pretty dodgy to me, which is why I've made
+          this behaviour only happen on request.  It is not the
+          default behaviour.  At the time of writing this (30 June
+          2002) I have only seen one example where this is necessary,
+          so either the problem is extremely rare or nobody is using
+          Valgrind :-)
+          <p>
+          On experimentation I see that <code>truncate-writes</code>
+          doesn't interact well with <code>ioctl-VTIME</code>, so you
+          probably don't want to try both at once.
+          <p>
+          As above, to find out if your program is blocking
+          unexpectedly in the <code>write()</code> system call, you
+          may find the <code>--trace-syscalls=yes
+          --trace-sched=yes</code> flags useful.
+      </ul>
+      </li><br><p>
+</ul>
+
+There are also some options for debugging Valgrind itself.  You
+shouldn't need to use them in the normal run of things.  Nevertheless:
+
+<ul>
+
+  <li><code>--single-step=no</code> [default]<br>
+      <code>--single-step=yes</code>
+      <p>When enabled, each x86 insn is translated separately into
+      instrumented code.  When disabled, translation is done on a
+      per-basic-block basis, giving much better translations.</li><br>
+      <p>
+
+  <li><code>--optimise=no</code><br>
+      <code>--optimise=yes</code> [default]
+      <p>When enabled, various improvements are applied to the
+      intermediate code, mainly aimed at allowing the simulated CPU's
+      registers to be cached in the real CPU's registers over several
+      simulated instructions.</li><br>
+      <p>
+
+  <li><code>--profile=no</code><br>
+      <code>--profile=yes</code> [default]
+      <p>When enabled, does crude internal profiling of valgrind 
+      itself.  This is not for profiling your programs.  Rather it is
+      to allow the developers to assess where valgrind is spending
+      its time.  The skins must be built for profiling for this to
+      work.
+      </li><br><p>
+
+  <li><code>--trace-syscalls=no</code> [default]<br>
+      <code>--trace-syscalls=yes</code>
+      <p>Enable/disable tracing of system call intercepts.</li><br>
+      <p>
+
+  <li><code>--trace-signals=no</code> [default]<br>
+      <code>--trace-signals=yes</code>
+      <p>Enable/disable tracing of signal handling.</li><br>
+      <p>
+
+  <li><code>--trace-sched=no</code> [default]<br>
+      <code>--trace-sched=yes</code>
+      <p>Enable/disable tracing of thread scheduling events.</li><br>
+      <p>
+
+  <li><code>--trace-pthread=none</code> [default]<br>
+      <code>--trace-pthread=some</code> <br>
+      <code>--trace-pthread=all</code>
+      <p>Specifies amount of trace detail for pthread-related events.</li><br>
+      <p>
+
+  <li><code>--trace-symtab=no</code> [default]<br>
+      <code>--trace-symtab=yes</code>
+      <p>Enable/disable tracing of symbol table reading.</li><br>
+      <p>
+
+  <li><code>--trace-malloc=no</code> [default]<br>
+      <code>--trace-malloc=yes</code>
+      <p>Enable/disable tracing of malloc/free (et al) intercepts.
+      </li><br>
+      <p>
+
+  <li><code>--stop-after=&lt;number></code> 
+      [default: infinity, more or less]
+      <p>After &lt;number> basic blocks have been executed, shut down
+      Valgrind and switch back to running the client on the real CPU.
+      </li><br>
+      <p>
+
+  <li><code>--dump-error=&lt;number></code> [default: inactive]
+      <p>After the program has exited, show gory details of the
+      translation of the basic block containing the &lt;number>'th
+      error context.  When used with <code>--single-step=yes</code>,
+      can show the exact x86 instruction causing an error.  This is
+      all fairly dodgy and doesn't work at all if threads are
+      involved.</li><br>
+      <p>
+</ul>
+
+
+<a name="clientreq"></a>
+<h3>2.8&nbsp; The Client Request mechanism</h3>
+
+Valgrind has a trapdoor mechanism via which the client program can
+pass all manner of requests and queries to Valgrind.  Internally, this
+is used extensively to make malloc, free, signals, threads, etc, work,
+although you don't see that.
+<p>
+For your convenience, a subset of these so-called client requests is
+provided to allow you to tell Valgrind facts about the behaviour of
+your program, and conversely to make queries.  In particular, your
+program can tell Valgrind about changes in memory range permissions
+that Valgrind would not otherwise know about, and so allows clients to
+get Valgrind to do arbitrary custom checks.
+<p>
+Clients need to include the header file <code>valgrind.h</code> to
+make this work.  The macros therein have the magical property that
+they generate code in-line which Valgrind can spot.  However, the code
+does nothing when not run on Valgrind, so you are not forced to run
+your program on Valgrind just because you use the macros in this file.
+Also, you are not required to link your program with any extra
+supporting libraries.
+<p>
+A brief description of the available macros:
+<ul>
+<li><code>VALGRIND_MAKE_NOACCESS</code>,
+    <code>VALGRIND_MAKE_WRITABLE</code> and
+    <code>VALGRIND_MAKE_READABLE</code>.  These mark address
+    ranges as completely inaccessible, accessible but containing
+    undefined data, and accessible and containing defined data,
+    respectively.  Subsequent errors may have their faulting
+    addresses described in terms of these blocks.  Returns a
+    "block handle".  Returns zero when not run on Valgrind.
+<p>
+<li><code>VALGRIND_DISCARD</code>: At some point you may want
+    Valgrind to stop reporting errors in terms of the blocks
+    defined by the previous three macros.  To do this, the above
+    macros return a small-integer "block handle".  You can pass
+    this block handle to <code>VALGRIND_DISCARD</code>.  After
+    doing so, Valgrind will no longer be able to relate
+    addressing errors to the user-defined block associated with
+    the handle.  The permissions settings associated with the
+    handle remain in place; this just affects how errors are
+    reported, not whether they are reported.  Returns 1 for an
+    invalid handle and 0 for a valid handle (although passing
+    invalid handles is harmless).  Always returns 0 when not run
+    on Valgrind.
+<p>
+<li><code>VALGRIND_CHECK_NOACCESS</code>,
+    <code>VALGRIND_CHECK_WRITABLE</code> and
+    <code>VALGRIND_CHECK_READABLE</code>: check immediately
+    whether or not the given address range has the relevant
+    property, and if not, print an error message.  Also, for the
+    convenience of the client, returns zero if the relevant
+    property holds; otherwise, the returned value is the address
+    of the first byte for which the property is not true.
+    Always returns 0 when not run on Valgrind.
+<p>
+<li><code>VALGRIND_CHECK_NOACCESS</code>: a quick and easy way
+    to find out whether Valgrind thinks a particular variable
+    (lvalue, to be precise) is addressible and defined.  Prints
+    an error message if not.  Returns no value.
+<p>
+<li><code>VALGRIND_MAKE_NOACCESS_STACK</code>: a highly
+    experimental feature.  Similarly to
+    <code>VALGRIND_MAKE_NOACCESS</code>, this marks an address
+    range as inaccessible, so that subsequent accesses to an
+    address in the range gives an error.  However, this macro
+    does not return a block handle.  Instead, all annotations
+    created like this are reviewed at each client
+    <code>ret</code> (subroutine return) instruction, and those
+    which now define an address range block the client's stack
+    pointer register (<code>%esp</code>) are automatically
+    deleted.
+    <p>
+    In other words, this macro allows the client to tell
+    Valgrind about red-zones on its own stack.  Valgrind
+    automatically discards this information when the stack
+    retreats past such blocks.  Beware: hacky and flaky, and
+    probably interacts badly with the new pthread support.
+<p>
+<li><code>RUNNING_ON_VALGRIND</code>: returns 1 if running on
+    Valgrind, 0 if running on the real CPU.
+<p>
+<li><code>VALGRIND_DO_LEAK_CHECK</code>: run the memory leak detector
+    right now.  Returns no value.  I guess this could be used to
+    incrementally check for leaks between arbitrary places in the
+    program's execution.  Warning: not properly tested!
+<p>
+<li><code>VALGRIND_DISCARD_TRANSLATIONS</code>: discard translations
+    of code in the specified address range.  Useful if you are
+    debugging a JITter or some other dynamic code generation system.
+    After this call, attempts to execute code in the invalidated
+    address range will cause valgrind to make new translations of that
+    code, which is probably the semantics you want.  Note that this is
+    implemented naively, and involves checking all 200191 entries in
+    the translation table to see if any of them overlap the specified
+    address range.  So try not to call it often, or performance will
+    nosedive.  Note that you can be clever about this: you only need
+    to call it when an area which previously contained code is
+    overwritten with new code.  You can choose to write code into
+    fresh memory, and just call this occasionally to discard large
+    chunks of old code all at once.
+    <p>
+    Warning: minimally tested, especially for the cache simulator.
+</ul>
+<p>
+
+
+<a name="pthreads"></a>
+<h3>2.9&nbsp; Support for POSIX Pthreads</h3>
+
+As of late April 02, Valgrind supports programs which use POSIX
+pthreads.  Doing this has proved technically challenging but is now
+mostly complete.  It works well enough for significant threaded
+applications to work.
+<p>
+It works as follows: threaded apps are (dynamically) linked against
+<code>libpthread.so</code>.  Usually this is the one installed with
+your Linux distribution.  Valgrind, however, supplies its own
+<code>libpthread.so</code> and automatically connects your program to
+it instead.
+<p>
+The fake <code>libpthread.so</code> and Valgrind cooperate to
+implement a user-space pthreads package.  This approach avoids the 
+horrible implementation problems of implementing a truly
+multiprocessor version of Valgrind, but it does mean that threaded
+apps run only on one CPU, even if you have a multiprocessor machine.
+<p>
+Valgrind schedules your threads in a round-robin fashion, with all
+threads having equal priority.  It switches threads every 50000 basic
+blocks (typically around 300000 x86 instructions), which means you'll
+get a much finer interleaving of thread executions than when run
+natively.  This in itself may cause your program to behave differently
+if you have some kind of concurrency, critical race, locking, or
+similar, bugs.
+<p>
+The current (valgrind-1.0 release) state of pthread support is as
+follows:
+<ul>
+<li>Mutexes, condition variables, thread-specific data,
+    <code>pthread_once</code>, reader-writer locks, semaphores,
+    cleanup stacks, cancellation and thread detaching currently work.
+    Various attribute-like calls are handled but ignored; you get a
+    warning message.
+<p>
+<li>Currently the following syscalls are thread-safe (nonblocking):
+    <code>write</code> <code>read</code> <code>nanosleep</code>
+    <code>sleep</code> <code>select</code> <code>poll</code> 
+    <code>recvmsg</code> and
+    <code>accept</code>.
+<p>
+<li>Signals in pthreads are now handled properly(ish): 
+    <code>pthread_sigmask</code>, <code>pthread_kill</code>,
+    <code>sigwait</code> and <code>raise</code> are now implemented.
+    Each thread has its own signal mask, as POSIX requires.
+    It's a bit kludgey -- there's a system-wide pending signal set,
+    rather than one for each thread.  But hey.
+</ul>
+
+
+As of 18 May 02, the following threaded programs now work fine on my
+RedHat 7.2 box: Opera 6.0Beta2, KNode in KDE 3.0, Mozilla-0.9.2.1 and
+Galeon-0.11.3, both as supplied with RedHat 7.2.  Also Mozilla 1.0RC2.
+OpenOffice 1.0.  MySQL 3.something (the current stable release).
+
+<a name="install"></a>
+<h3>2.10&nbsp; Building and installing</h3>
+
+We now use the standard Unix <code>./configure</code>,
+<code>make</code>, <code>make install</code> mechanism, and I have
+attempted to ensure that it works on machines with kernel 2.2 or 2.4
+and glibc 2.1.X or 2.2.X.  I don't think there is much else to say.
+There are no options apart from the usual <code>--prefix</code> that
+you should give to <code>./configure</code>.
+
+<p>
+The <code>configure</code> script tests the version of the X server
+currently indicated by the current <code>$DISPLAY</code>.  This is a
+known bug.  The intention was to detect the version of the current
+XFree86 client libraries, so that correct suppressions could be
+selected for them, but instead the test checks the server version.
+This is just plain wrong.
+
+<p>
+If you are building a binary package of Valgrind for distribution,
+please read <code>README_PACKAGERS</code>.  It contains some important
+information.
+
+<p>
+Apart from that there is no excitement here.  Let me know if you have
+build problems.
+
+
+
+<a name="problems"></a>
+<h3>2.11&nbsp; If you have problems</h3>
+Mail me (<a href="mailto:jseward@acm.org">jseward@acm.org</a>).
+
+<p>See <a href="#limits">Section 4</a> for the known limitations of
+Valgrind, and for a list of programs which are known not to work on
+it.
+
+<p>The translator/instrumentor has a lot of assertions in it.  They
+are permanently enabled, and I have no plans to disable them.  If one
+of these breaks, please mail me!
+
+<p>If you get an assertion failure on the expression
+<code>chunkSane(ch)</code> in <code>vg_free()</code> in
+<code>vg_malloc.c</code>, this may have happened because your program
+wrote off the end of a malloc'd block, or before its beginning.
+Valgrind should have emitted a proper message to that effect before
+dying in this way.  This is a known problem which I should fix.
+<p>
+
+<hr width="100%">
+
+
+<a name="signals"></a>
+<h3>3.4&nbsp; Signals</h3>
+
+Valgrind provides suitable handling of signals, so, provided you stick
+to POSIX stuff, you should be ok.  Basic sigaction() and sigprocmask()
+are handled.  Signal handlers may return in the normal way or do
+longjmp(); both should work ok.  As specified by POSIX, a signal is
+blocked in its own handler.  Default actions for signals should work
+as before.  Etc, etc.
+
+<p>Under the hood, dealing with signals is a real pain, and Valgrind's
+simulation leaves much to be desired.  If your program does
+way-strange stuff with signals, bad things may happen.  If so, let me
+know.  I don't promise to fix it, but I'd at least like to be aware of
+it.
+
+
+
+<a name="limits"></a>
+<h2>4&nbsp; Limitations</h2>
+
+The following list of limitations seems depressingly long.  However,
+most programs actually work fine.
+
+<p>Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on
+a kernel 2.2.X or 2.4.X system, subject to the following constraints:
+
+<ul>
+  <li>No MMX, SSE, SSE2, 3DNow instructions.  If the translator
+      encounters these, Valgrind will simply give up.  It may be
+      possible to add support for them at a later time. Intel added a
+      few instructions such as "cmov" to the integer instruction set
+      on Pentium and later processors, and these are supported.
+      Nevertheless it's safest to think of Valgrind as implementing
+      the 486 instruction set.</li><br>
+      <p>
+
+  <li>Pthreads support is improving, but there are still significant
+      limitations in that department.  See the section above on
+      Pthreads.  Note that your program must be dynamically linked
+      against <code>libpthread.so</code>, so that Valgrind can
+      substitute its own implementation at program startup time.  If
+      you're statically linked against it, things will fail
+      badly.</li><br>
+      <p>
+
+  <li>Valgrind assumes that the floating point registers are not used
+      as intermediaries in memory-to-memory copies, so it immediately
+      checks V bits in floating-point loads/stores.  If you want to
+      write code which copies around possibly-uninitialised values,
+      you must ensure these travel through the integer registers, not
+      the FPU.</li><br>
+      <p>
+
+  <li>If your program does its own memory management, rather than
+      using malloc/new/free/delete, it should still work, but
+      Valgrind's error checking won't be so effective.</li><br>
+      <p>
+
+  <li>Valgrind's signal simulation is not as robust as it could be.
+      Basic POSIX-compliant sigaction and sigprocmask functionality is
+      supplied, but it's conceivable that things could go badly awry
+      if you do wierd things with signals.  Workaround: don't.
+      Programs that do non-POSIX signal tricks are in any case
+      inherently unportable, so should be avoided if
+      possible.</li><br>
+      <p>
+
+  <li>Programs which switch stacks are not well handled.  Valgrind
+      does have support for this, but I don't have great faith in it.
+      It's difficult -- there's no cast-iron way to decide whether a
+      large change in %esp is as a result of the program switching
+      stacks, or merely allocating a large object temporarily on the
+      current stack -- yet Valgrind needs to handle the two situations
+      differently.  1 May 02: this probably interacts badly with the
+      new pthread support.  I haven't checked properly.</li><br>
+      <p>
+
+  <li>x86 instructions, and system calls, have been implemented on
+      demand.  So it's possible, although unlikely, that a program
+      will fall over with a message to that effect.  If this happens,
+      please mail me ALL the details printed out, so I can try and
+      implement the missing feature.</li><br>
+      <p>
+
+  <li>x86 floating point works correctly, but floating-point code may
+      run even more slowly than integer code, due to my simplistic
+      approach to FPU emulation.</li><br>
+      <p>
+
+  <li>You can't Valgrind-ize statically linked binaries.  Valgrind
+      relies on the dynamic-link mechanism to gain control at
+      startup.</li><br>
+      <p>
+
+  <li>Memory consumption of your program is majorly increased whilst
+      running under Valgrind.  This is due to the large amount of
+      adminstrative information maintained behind the scenes.  Another
+      cause is that Valgrind dynamically translates the original
+      executable.  Translated, instrumented code is 14-16 times larger
+      than the original (!) so you can easily end up with 30+ MB of
+      translations when running (eg) a web browser.
+      </li>
+</ul>
+
+Programs which are known not to work are:
+
+<ul>
+  <li>emacs starts up but immediately concludes it is out of memory
+      and aborts.  Emacs has it's own memory-management scheme, but I
+      don't understand why this should interact so badly with
+      Valgrind.  Emacs works fine if you build it to use the standard
+      malloc/free routines.</li><br>
+      <p>
+</ul>
+
+Known platform-specific limitations, as of release 1.0.0:
+
+<ul>
+  <li>On Red Hat 7.3, there have been reports of link errors (at
+      program start time) for threaded programs using
+      <code>__pthread_clock_gettime</code> and
+      <code>__pthread_clock_settime</code>.  This appears to be due to
+      <code>/lib/librt-2.2.5.so</code> needing them.  Unfortunately I
+      do not understand enough about this problem to fix it properly,
+      and I can't reproduce it on my test RedHat 7.3 system.  Please
+      mail me if you have more information / understanding.  </li><br>
+      <p>
+  <li>
+      1.0.0 now partially works on Red Hat 7.3.92 ("Limbo"
+      public beta).  However, don't expect a smooth ride.
+      Basically valgrind won't work as-is with any 
+      glibc-2.3 based system.  Limbo is just a little pre glibc-2.3 
+      and it just about works.  Limbo is also gcc-3.1 based and so
+      suffers from the problems in the following point.</li><br>
+      <p>
+  <li>
+      Inlining of string functions with gcc-3.1 or above causes a
+      large number of false reports of uninitialised value uses.  I
+      know what the problem is and roughly how to fix it, but I need
+      to devise a reasonably efficient fix.  Try to reduce the
+      optimisation level, or use <code>-fno-builtin-strlen</code> in
+      the meantime.  Or use an earlier gcc.</li><br>
+      <p>
+</ul>
+
+
+<p><hr width="100%">
+
+
+<a name="howitworks"></a>
+<h2>5&nbsp; How it works -- a rough overview</h2>
+Some gory details, for those with a passion for gory details.  You
+don't need to read this section if all you want to do is use Valgrind.
+
+<a name="startb"></a>
+<h3>5.1&nbsp; Getting started</h3>
+
+Valgrind is compiled into a shared object, valgrind.so.  The shell
+script valgrind sets the LD_PRELOAD environment variable to point to
+valgrind.so.  This causes the .so to be loaded as an extra library to
+any subsequently executed dynamically-linked ELF binary, viz, the
+program you want to debug.
+
+<p>The dynamic linker allows each .so in the process image to have an
+initialisation function which is run before main().  It also allows
+each .so to have a finalisation function run after main() exits.
+
+<p>When valgrind.so's initialisation function is called by the dynamic
+linker, the synthetic CPU to starts up.  The real CPU remains locked
+in valgrind.so for the entire rest of the program, but the synthetic
+CPU returns from the initialisation function.  Startup of the program
+now continues as usual -- the dynamic linker calls all the other .so's
+initialisation routines, and eventually runs main().  This all runs on
+the synthetic CPU, not the real one, but the client program cannot
+tell the difference.
+
+<p>Eventually main() exits, so the synthetic CPU calls valgrind.so's
+finalisation function.  Valgrind detects this, and uses it as its cue
+to exit.  It prints summaries of all errors detected, possibly checks
+for memory leaks, and then exits the finalisation routine, but now on
+the real CPU.  The synthetic CPU has now lost control -- permanently
+-- so the program exits back to the OS on the real CPU, just as it
+would have done anyway.
+
+<p>On entry, Valgrind switches stacks, so it runs on its own stack.
+On exit, it switches back.  This means that the client program
+continues to run on its own stack, so we can switch back and forth
+between running it on the simulated and real CPUs without difficulty.
+This was an important design decision, because it makes it easy (well,
+significantly less difficult) to debug the synthetic CPU.
+
+
+<a name="engine"></a>
+<h3>5.2&nbsp; The translation/instrumentation engine</h3>
+
+Valgrind does not directly run any of the original program's code.  Only
+instrumented translations are run.  Valgrind maintains a translation
+table, which allows it to find the translation quickly for any branch
+target (code address).  If no translation has yet been made, the
+translator - a just-in-time translator - is summoned.  This makes an
+instrumented translation, which is added to the collection of
+translations.  Subsequent jumps to that address will use this
+translation.
+
+<p>Valgrind no longer directly supports detection of self-modifying
+code.  Such checking is expensive, and in practice (fortunately)
+almost no applications need it.  However, to help people who are
+debugging dynamic code generation systems, there is a Client Request 
+(basically a macro you can put in your program) which directs Valgrind
+to discard translations in a given address range.  So Valgrind can
+still work in this situation provided the client tells it when
+code has become out-of-date and needs to be retranslated.
+
+<p>The JITter translates basic blocks -- blocks of straight-line-code
+-- as single entities.  To minimise the considerable difficulties of
+dealing with the x86 instruction set, x86 instructions are first
+translated to a RISC-like intermediate code, similar to sparc code,
+but with an infinite number of virtual integer registers.  Initially
+each insn is translated seperately, and there is no attempt at
+instrumentation.
+
+<p>The intermediate code is improved, mostly so as to try and cache
+the simulated machine's registers in the real machine's registers over
+several simulated instructions.  This is often very effective.  Also,
+we try to remove redundant updates of the simulated machines's
+condition-code register.
+
+<p>The intermediate code is then instrumented, giving more
+intermediate code.  There are a few extra intermediate-code operations
+to support instrumentation; it is all refreshingly simple.  After
+instrumentation there is a cleanup pass to remove redundant value
+checks.
+
+<p>This gives instrumented intermediate code which mentions arbitrary
+numbers of virtual registers.  A linear-scan register allocator is
+used to assign real registers and possibly generate spill code.  All
+of this is still phrased in terms of the intermediate code.  This
+machinery is inspired by the work of Reuben Thomas (Mite).
+
+<p>Then, and only then, is the final x86 code emitted.  The
+intermediate code is carefully designed so that x86 code can be
+generated from it without need for spare registers or other
+inconveniences.
+
+<p>The translations are managed using a traditional LRU-based caching
+scheme.  The translation cache has a default size of about 14MB.
+
+<a name="track"></a>
+
+<h3>5.3&nbsp; Tracking the status of memory</h3> Each byte in the
+process' address space has nine bits associated with it: one A bit and
+eight V bits.  The A and V bits for each byte are stored using a
+sparse array, which flexibly and efficiently covers arbitrary parts of
+the 32-bit address space without imposing significant space or
+performance overheads for the parts of the address space never
+visited.  The scheme used, and speedup hacks, are described in detail
+at the top of the source file vg_memory.c, so you should read that for
+the gory details.
+
+<a name="sys_calls"></a>
+
+<h3>5.4 System calls</h3>
+All system calls are intercepted.  The memory status map is consulted
+before and updated after each call.  It's all rather tiresome.  See
+vg_syscall_mem.c for details.
+
+<a name="sys_signals"></a>
+
+<h3>5.5&nbsp; Signals</h3>
+All system calls to sigaction() and sigprocmask() are intercepted.  If
+the client program is trying to set a signal handler, Valgrind makes a
+note of the handler address and which signal it is for.  Valgrind then
+arranges for the same signal to be delivered to its own handler.
+
+<p>When such a signal arrives, Valgrind's own handler catches it, and
+notes the fact.  At a convenient safe point in execution, Valgrind
+builds a signal delivery frame on the client's stack and runs its
+handler.  If the handler longjmp()s, there is nothing more to be said.
+If the handler returns, Valgrind notices this, zaps the delivery
+frame, and carries on where it left off before delivering the signal.
+
+<p>The purpose of this nonsense is that setting signal handlers
+essentially amounts to giving callback addresses to the Linux kernel.
+We can't allow this to happen, because if it did, signal handlers
+would run on the real CPU, not the simulated one.  This means the
+checking machinery would not operate during the handler run, and,
+worse, memory permissions maps would not be updated, which could cause
+spurious error reports once the handler had returned.
+
+<p>An even worse thing would happen if the signal handler longjmp'd
+rather than returned: Valgrind would completely lose control of the
+client program.
+
+<p>Upshot: we can't allow the client to install signal handlers
+directly.  Instead, Valgrind must catch, on behalf of the client, any
+signal the client asks to catch, and must delivery it to the client on
+the simulated CPU, not the real one.  This involves considerable
+gruesome fakery; see vg_signals.c for details.
+<p>
+
+<hr width="100%">
+
+<a name="example"></a>
+<h2>6&nbsp; Example</h2>
+This is the log for a run of a small program. The program is in fact
+correct, and the reported error is as the result of a potentially serious
+code generation bug in GNU g++ (snapshot 20010527).
+<pre>
+sewardj@phoenix:~/newmat10$
+~/Valgrind-6/valgrind -v ./bogon 
+==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
+==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
+==25832== Startup, with flags:
+==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
+==25832== reading syms from /lib/ld-linux.so.2
+==25832== reading syms from /lib/libc.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
+==25832== reading syms from /lib/libm.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
+==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
+==25832== reading syms from /proc/self/exe
+==25832== loaded 5950 symbols, 142333 line number locations
+==25832== 
+==25832== Invalid read of size 4
+==25832==    at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45)
+==25832==    by 0x80487AF: main (bogon.cpp:66)
+==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
+==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
+==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
+==25832==
+==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
+==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
+==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
+==25832== For a detailed leak analysis, rerun with: --leak-check=yes
+==25832==
+==25832== exiting, did 1881 basic blocks, 0 misses.
+==25832== 223 translations, 3626 bytes in, 56801 bytes out.
+</pre>
+<p>The GCC folks fixed this about a week before gcc-3.0 shipped.
+<hr width="100%">
+<p>
+
+</body>
+</html>
+
diff --git a/coregrind/coregrind_intro.html b/coregrind/coregrind_intro.html
new file mode 100644
index 000000000..e561da410
--- /dev/null
+++ b/coregrind/coregrind_intro.html
@@ -0,0 +1,176 @@
+
+
+<a name="intro"></a>
+<h2>1&nbsp; Introduction</h2>
+
+<a name="intro-overview"></a>
+<h3>1.1&nbsp; An overview of Valgrind</h3>
+
+Valgrind is a flexible tool for profiling and debugging Linux-x86
+executables.  The tool consists of a core, which provides a synthetic
+x86 CPU in software, and a series of "skins", each of which is a
+debugging or profiling tool.  The architecture is modular, so that new
+skins can be created easily and without disturbing the existing
+structure.
+
+<p>
+A number of useful skins are supplied as standard.  In summary, these
+are:
+
+<ul>
+<li>The <b>memcheck</b> skin detects memory-management problems in
+    your programs.  It provides services identical to those supplied
+    by the valgrind-1.0.X series.  Memcheck is basically
+    valgrind-1.0.X packaged up into a skin.
+    <p>
+    All reads and writes of memory are checked, and calls to
+    malloc/new/free/delete are intercepted. As a result, memcheck can
+    detect the following problems:
+    <ul>
+        <li>Use of uninitialised memory</li>
+        <li>Reading/writing memory after it has been free'd</li>
+        <li>Reading/writing off the end of malloc'd blocks</li>
+        <li>Reading/writing inappropriate areas on the stack</li>
+        <li>Memory leaks -- where pointers to malloc'd blocks are lost
+            forever</li>
+        <li>Mismatched use of malloc/new/new [] vs free/delete/delete []</li>
+        <li>Some misuses of the POSIX pthreads API</li>
+    </ul>
+    <p>
+    Problems like these can be difficult to find by other means, often
+    lying undetected for long periods, then causing occasional,
+    difficult-to-diagnose crashes.
+<p>
+<li><b>cachegrind</b> is a packaging of Nick Nethercote's cache
+    profiler from valgrind-1.0.X.  It performs detailed simulation of
+    the I1, D1 and L2 caches in your CPU and so can accurately
+    pinpoint the sources of cache misses in your code.  If you desire,
+    it will show the number of cache misses, memory references and
+    instructions accrueing to each line of source code, with
+    per-function, per-module and whole-program summaries.  If you ask
+    really nicely it will even show counts for each individual x86
+    instruction.
+    <p>
+    Cachegrind auto-detects your machine's cache configuration
+    using the <code>CPUID</code> instruction, and so needs no further
+    config info, in most cases.
+    <p>
+    Cachegrind is nicely complemented by Josef Weidendorfer's
+    amazing KCacheGrind visualisation tool (<A
+    HREF="http://kcachegrind.sourceforge.net">
+    http://kcachegrind.sourceforge.net</A>), a KDE application which
+    presents these profiling results in a graphical and
+    easier-to-understand form.
+<p>
+<li>The new <b>addrcheck</b> skin is a lightweight version of
+    memcheck.  It is identical to memcheck except
+    for the single detail that it does not do any uninitialised-value
+    checks.  All of the other checks -- primarily the fine-grained
+    address checking -- are still done.  The downside of this is that
+    you don't catch the uninitialised-value errors that
+    memcheck can find.
+    <p>
+    But the upside is significant: programs run about twice as fast as
+    they do on memcheck, and a lot less memory is used.  It
+    still finds reads/writes of freed memory, memory off the end of
+    blocks and in other invalid places, bugs which you really want to
+    find before release!
+    <p>
+    Because addrcheck is lighter and faster than
+    memcheck, you can run more programs for longer, and so you
+    may be able to cover more test scenarios.  Addrcheck was 
+    created because one of us (Julian) wanted to be able to 
+    run a complete KDE desktop session with checking.  As of early 
+    November 2002, we have been able to run KDE-3.0.3 on a 1.7 GHz P4
+    with 512 MB of memory, using addrcheck.  Although the
+    result is not stellar, it's quite usable, and it seems plausible
+    to run KDE for long periods at a time like this, collecting up
+    all the addressing errors that appear.
+<p>
+<li><b>helgrind</b> is a new debugging skin, designed to find data
+    races in multithreaded programs.  What helgrind looks for
+    is memory locations which are accessed by more than on (POSIX
+    p-)thread, but for which no consistently used (pthread_mutex_)lock
+    can be found.  Such locations are indicative of missing
+    synchronisation between threads, and could cause hard-to-find
+    timing-dependent problems.
+    <p>
+    Helgrind ("Hell's Gate", in Norse mythology) implements the
+    so-called "Eraser" data-race-detection algorithm, along with
+    various refinements (thread-segment lifetimes) which reduce the
+    number of false errors it reports.  It is as yet somewhat of an
+    experimental skin, so your feedback is especially welcomed here.
+    <p>
+    Helgrind has been hacked on extensively by Jeremy
+    Fitzhardinge, and we have him to thank for getting it to a
+    releasable state.
+</ul>
+
+A number of minor skins (<b>corecheck</b>, <b>lackey</b> and
+<b>none</b>) are also supplied.  These aren't particularly useful --
+they exist to illustrate how to create simple skins and to help the
+valgrind developers in various ways.
+
+
+<p>
+Valgrind is closely tied to details of the CPU, operating system and
+to a less extent, compiler and basic C libraries. This makes it
+difficult to make it portable, so we have chosen at the outset to
+concentrate on what we believe to be a widely used platform: Linux on
+x86s.  Valgrind uses the standard Unix <code>./configure</code>,
+<code>make</code>, <code>make install</code> mechanism, and we have
+attempted to ensure that it works on machines with kernel 2.2 or 2.4
+and glibc 2.1.X, 2.2.X or 2.3.X.  This should cover the vast majority
+of modern Linux installations.
+
+
+<p>
+Valgrind is licensed under the GNU General Public License, version
+2. Read the file LICENSE in the source distribution for details.  Some
+of the PThreads test cases, <code>pth_*.c</code>, are taken from
+"Pthreads Programming" by Bradford Nichols, Dick Buttlar &amp;
+Jacqueline Proulx Farrell, ISBN 1-56592-115-1, published by O'Reilly
+&amp; Associates, Inc.
+
+
+
+
+<a name="intro-navigation"></a>
+<h3>1.2&nbsp; How to navigate this manual</h3>
+
+Valgrind is structured as a set of core services supporting a number
+of profiling and debugging tools ("skins").  This manual is structured
+similarly.  Below, we continue with a description of the valgrind
+core, how to use it, and the flags it supports.
+
+<p>
+The skins each have their own chapters in this manual.  You only need
+to read the documentation for the skin(s) you actually use, although
+you may find it helpful to be at least a little bit familar with what
+all skins do.
+
+<p>
+If you're new to all this, you're most likely to be using the Memcheck
+skin, since that's the one selected by default.  So, read the rest of
+this page, and the section Memcheck.
+
+<p>
+Be aware that the core understands some command line flags, and the
+skins then have their own flags which they know about.  This means
+there is no central place describing all the flags that are accepted
+-- you have to read the flags documentation both for valgrind's core
+(below) and for the skin you want to use.
+
+<p>
+<h4>For users migrating from valgrind-1.0.X</h4>
+<p>
+Valgrind-2.0.X is a major redesign of the 1.0.X series.  You should at
+least be familiar with the concept of the new core/skin division,
+as explained above in the Introduction.  Having said that, we've tried
+to make the command line handling and behaviour as
+backwards-compatible as we can.  In particular, just running
+<code>valgrind [args-for-valgrind] my_prog [args-for-my-prog]</code>
+should work pretty much as before.
+
+<p>
+
diff --git a/coregrind/coregrind_skins.html b/coregrind/coregrind_skins.html
new file mode 100644
index 000000000..a17397139
--- /dev/null
+++ b/coregrind/coregrind_skins.html
@@ -0,0 +1,687 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Valgrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title">&nbsp;</a>
+<h1 align=center>Valgrind Skins</h1>
+<center>
+  A guide to writing new skins for Valgrind<br>
+  This guide was last updated on 20020926
+</center>
+<p>
+
+<center>
+<a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
+Nick Nethercote, October 2002
+<p>
+Valgrind is licensed under the GNU General Public License, 
+version 2<br>
+An open-source tool for supervising execution of Linux-x86 executables.
+</center>
+
+<p>
+
+<hr width="100%">
+<a name="contents"></a>
+<h2>Contents of this manual</h2>
+
+<h4>1&nbsp; <a href="#intro">Introduction</a></h4>
+    1.1&nbsp; <a href="#supexec">Supervised Execution</a><br>
+    1.2&nbsp; <a href="#skins">Skins</a><br>
+    1.3&nbsp; <a href="#execspaces">Execution Spaces</a><br>
+
+<h4>2&nbsp; <a href="#writingaskin">Writing a Skin</a></h4>
+    2.1&nbsp; <a href="#whywriteaskin">Why write a skin?</a><br>
+    2.2&nbsp; <a href="#howskinswork">How skins work</a><br>
+    2.3&nbsp; <a href="#gettingcode">Getting the code</a><br>
+    2.4&nbsp; <a href="#gettingstarted">Getting started</a><br>
+    2.5&nbsp; <a href="#writingcode">Writing the code</a><br>
+    2.6&nbsp; <a href="#init">Initialisation</a><br>
+    2.7&nbsp; <a href="#instr">Instrumentation</a><br>
+    2.8&nbsp; <a href="#fini">Finalisation</a><br>
+    2.9&nbsp; <a href="#otherimportantinfo">Other important information</a><br>
+    2.10&nbsp; <a href="#wordsofadvice">Words of advice</a><br>
+
+<h4>3&nbsp; <a href="#advancedtopics">Advanced Topics</a></h4>
+    3.1&nbsp; <a href="#suppressions">Suppressions</a><br>
+    3.2&nbsp; <a href="#documentation">Documentation</a><br>
+    3.3&nbsp; <a href="#regressiontests">Regression tests</a><br>
+    3.4&nbsp; <a href="#profiling">Profiling</a><br>
+    3.5&nbsp; <a href="#othermakefilehackery">Other makefile hackery</a><br>
+    3.6&nbsp; <a href="#interfaceversions">Core/skin interface versions</a><br>
+
+<h4>4&nbsp; <a href="#finalwords">Final Words</a></h4>
+
+<hr width="100%">
+
+<a name="intro"></a>
+<h2>1&nbsp; Introduction</h2>
+
+<a name="supexec"></a>
+<h3>1.1&nbsp; Supervised Execution</h3>
+
+Valgrind provides a generic infrastructure for supervising the execution of
+programs.  This is done by providing a way to instrument programs in very
+precise ways, making it relatively easy to support activities such as dynamic
+error detection and profiling.<p>
+
+Although writing a skin is not easy, and requires learning quite a few things
+about Valgrind, it is much easier than instrumenting a program from scratch
+yourself.
+
+<a name="skins"></a>
+<h3>1.2&nbsp; Skins</h3>
+The key idea behind Valgrind's architecture is the division between its
+``core'' and ``skins''.
+<p>
+The core provides the common low-level infrastructure to support program
+instrumentation, including the x86-to-x86 JIT compiler, low-level memory
+manager, signal handling and a scheduler (for pthreads).   It also provides
+certain services that are useful to some but not all skins, such as support
+for error recording and suppression.
+<p>
+But the core leaves certain operations undefined, which must be filled by skins.
+Most notably, skins define how program code should be instrumented.  They can
+also define certain variables to indicate to the core that they would like to
+use certain services, or be notified when certain interesting events occur.
+<p>
+Each skin that is written defines a new program supervision tool.  Writing a
+new tool just requires writing a new skin.  The core takes care of all the hard
+work.
+<p>
+
+<a name="execspaces"></a>
+<h3>1.3&nbsp; Execution Spaces</h3>
+An important concept to understand before writing a skin is that there are
+three spaces in which program code executes:
+
+<ol>
+  <li>User space: this covers most of the program's execution.  The skin is
+      given the code and can instrument it any way it likes, providing (more or
+      less) total control over the code.<p>
+
+      Code executed in user space includes all the program code, almost all of
+      the C library (including things like the dynamic linker), and almost
+      all parts of all other libraries.
+  </li><p>
+
+  <li>Core space: a small proportion of the program's execution takes place
+      entirely within Valgrind's core.  This includes:<p>
+
+      <ul>
+        <li>Dynamic memory management (<code>malloc()</code> etc.)</li>
+
+        <li>Pthread operations and scheduling</li>
+
+        <li>Signal handling</li>
+      </ul><p>
+
+      A skin has no control over these operations;  it never ``sees'' the code
+      doing this work and thus cannot instrument it.  However, the core
+      provides hooks so a skin can be notified when certain interesting events
+      happen, for example when when dynamic memory is allocated or freed, the
+      stack pointer is changed, or a pthread mutex is locked, etc.<p>
+
+      Note that these hooks only notify skins of events relevant to user 
+      space.  For example, when the core allocates some memory for its own use,
+      the skin is not notified of this, because it's not directly part of the
+      supervised program's execution.
+  </li><p>
+      
+  <li>Kernel space: execution in the kernel.  Two kinds:<p>
+   
+      <ol>
+        <li>System calls:  can't be directly observed by either the skin or the
+            core.  But the core does have some idea of what happens to the
+            arguments, and it provides hooks for a skin to wrap system calls.
+        </li><p>
+
+        <li>Other: all other kernel activity (e.g. process scheduling) is
+            totally opaque and irrelevant to the program.
+        </li><p>
+      </ol>
+  </li><p>
+
+  It should be noted that a skin only has direct control over code executed in
+  user space.  This is the vast majority of code executed, but it is not
+  absolutely all of it, so any profiling information recorded by a skin won't
+  be totally accurate.
+</ol>
+
+
+<a name="writingaskin"></a>
+<h2>2&nbsp; Writing a Skin</h2>
+
+<a name="whywriteaskin"</a>
+<h3>2.1&nbsp; Why write a skin?</h3>
+
+Before you write a skin, you should have some idea of what it should do.  What
+is it you want to know about your programs of interest?  Consider some existing
+skins:
+
+<ul>
+  <li>memcheck: among other things, performs fine-grained validity and
+      addressibility checks of every memory reference performed by the program
+      </li><p>
+
+  <li>addrcheck: performs lighterweight addressibility checks of every memory
+      reference performed by the program</li><p>
+
+  <li>cachegrind: tracks every instruction and memory reference to simulate
+      instruction and data caches, tracking cache accesses and misses that
+      occur on every line in the program</li><p>
+
+  <li>helgrind: tracks every memory access and mutex lock/unlock to determine
+      if a program contains any data races</li><p>
+
+  <li>lackey: does simple counting of various things: the number of calls to a
+      particular function (<code>_dl_runtime_resolve()</code>);  the number of
+      basic blocks, x86 instruction, UCode instructions executed;  the number
+      of branches executed and the proportion of those which were taken.</li><p>
+</ul>
+
+These examples give a reasonable idea of what kinds of things Valgrind can be
+used for.  The instrumentation can range from very lightweight (e.g. counting
+the number of times a particular function is called) to very intrusive (e.g.
+memcheck's memory checking).
+
+<a name="howskinswork"</a>
+<h3>2.2&nbsp; How skins work</h3>
+
+Skins must define various functions for instrumenting programs that are called
+by Valgrind's core, yet they must be implemented in such a way that they can be
+written and compiled without touching Valgrind's core.  This is important,
+because one of our aims is to allow people to write and distribute their own
+skins that can be plugged into Valgrind's core easily.<p>
+
+This is achieved by packaging each skin into a separate shared object which is
+then loaded ahead of the core shared object <code>valgrind.so</code>, using the
+dynamic linker's <code>LD_PRELOAD</code> variable.  Any functions defined in
+the skin that share the name with a function defined in core (such as
+the instrumentation function <code>SK_(instrument)()</code>) override the
+core's definition.  Thus the core can call the necessary skin functions.<p>
+
+This magic is all done for you;  the shared object used is chosen with the
+<code>--skin</code> option to the <code>valgrind</code> startup script.  The
+default skin used is <code>memcheck</code>, Valgrind's original memory checker.
+
+<a name="gettingcode"</a>
+<h3>2.3&nbsp; Getting the code</h3>
+
+To write your own skin, you'll need to check out a copy of Valgrind from the
+CVS repository, rather than using a packaged distribution.  This is because it
+contains several extra files needed for writing skins.<p>
+
+To check out the code from the CVS repository, first login:
+<blockquote><code>
+cvs -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind login
+</code></blockquote>
+
+Then checkout the code.  To get a copy of the current development version
+(recommended for the brave only):
+<blockquote><code>
+cvs -z3 -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind co valgrind
+</code></blockquote>
+
+To get a copy of the stable released branch:
+<blockquote><code>
+cvs -z3 -d:pserver:anonymous@cvs.valgrind.sourceforge.net:/cvsroot/valgrind co -r <i>TAG</i> valgrind
+</code></blockquote>
+
+where <code><i>TAG</i></code> has the form <code>VALGRIND_X_Y_Z</code> for
+version X.Y.Z.
+
+<a name="gettingstarted"</a>
+<h3>2.4&nbsp; Getting started</h3>
+
+Valgrind uses GNU <code>automake</code> and <code>autoconf</code> for the
+creation of Makefiles and configuration.  But don't worry, these instructions
+should be enough to get you started even if you know nothing about those
+tools.<p>
+
+In what follows, all filenames are relative to Valgrind's top-level directory
+<code>valgrind/</code>.
+
+<ol>
+  <li>Choose a name for the skin, and an abbreviation that can be used as a
+      short prefix.  We'll use <code>foobar</code> and <code>fb</code> as an
+      example.
+  </li><p>
+
+  <li>Make a new directory <code>foobar/</code> which will hold the skin.
+  </li><p>
+
+  <li>Copy <code>example/Makefile.am</code> into <code>foobar/</code>.
+      Edit it by replacing all occurrences of the string
+      ``<code>example</code>'' with ``<code>foobar</code>'' and the one
+      occurrence of the string ``<code>ex_</code>'' with ``<code>fb_</code>''.
+      It might be worth trying to understand this file, at least a little;  you
+      might have to do more complicated things with it later on.  In
+      particular, the name of the <code>vgskin_foobar_so_SOURCES</code> variable
+      determines the name of the skin's shared object, which determines what
+      name must be passed to the <code>--skin</code> option to use the skin.
+  </li><p>
+
+  <li>Copy <code>example/ex_main.c</code> into
+      <code>foobar/</code>, renaming it as <code>fb_main.c</code>.
+      Edit it by changing the five lines in <code>SK_(pre_clo_init)()</code>
+      to something appropriate for the skin.  These fields are used in the
+      startup message, except for <code>bug_reports_to</code> which is used
+      if a skin assertion fails.
+  </li><p>
+
+  <li>Edit <code>Makefile.am</code>, adding the new directory
+      <code>foobar</code> to the <code>SUBDIRS</code> variable.
+  </li><p>
+
+  <li>Edit <code>configure.in</code>, adding <code>foobar/Makefile</code> to the
+      <code>AC_OUTPUT</code> list.
+  </li><p>
+
+  <li>Run:
+      <pre>
+    autogen.sh
+    ./configure --prefix=`pwd`/inst
+    make install</pre>
+
+      It should automake, configure and compile without errors, putting copies
+      of the skin's shared object <code>vgskin_foobar.so</code> in
+      <code>foobar/</code> and
+      <code>inst/lib/valgrind/</code>.
+  </li><p>
+
+  <li>You can test it with a command like
+      <pre>
+    inst/bin/valgrind --skin=foobar date</pre>
+
+      (almost any program should work; <code>date</code> is just an example).  
+      The output should be something like this:
+      <pre>
+==738== foobar-0.0.1, a foobarring tool for x86-linux.
+==738== Copyright (C) 2002, and GNU GPL'd, by J. Random Hacker.
+==738== Built with valgrind-1.1.0, a program execution monitor.
+==738== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward.
+==738== Estimated CPU clock rate is 1400 MHz
+==738== For more details, rerun with: -v
+==738== 
+Wed Sep 25 10:31:54 BST 2002
+==738==</pre>
+
+      The skin does nothing except run the program uninstrumented.
+  </li><p>
+</ol>
+
+These steps don't have to be followed exactly - you can choose different names
+for your source files, and use a different <code>--prefix</code> for
+<code>./configure</code>.<p>
+
+Now that we've setup, built and tested the simplest possible skin, onto the
+interesting stuff...
+
+
+<a name="writingcode"></a>
+<h3>2.5&nbsp; Writing the code</h3>
+
+A skin must define at least these four functions:
+<pre>
+    SK_(pre_clo_init)()
+    SK_(post_clo_init)()
+    SK_(instrument)()
+    SK_(fini)()
+</pre>
+
+Also, it must use the macro <code>VG_DETERMINE_INTERFACE_VERSION</code>
+exactly once in its source code.  If it doesn't, you will get a link error
+involving <code>VG_(skin_interface_major_version)</code>.  This macro is
+used to ensure the core/skin interface used by the core and a plugged-in
+skin are binary compatible.
+
+In addition, if a skin wants to use some of the optional services provided by
+the core, it may have to define other functions.
+
+<a name="init"></a>
+<h3>2.6&nbsp; Initialisation</h3>
+
+Most of the initialisation should be done in <code>SK_(pre_clo_init)()</code>.
+Only use <code>SK_(post_clo_init)()</code> if a skin provides command line
+options and must do some initialisation after option processing takes place
+(``<code>clo</code>'' stands for ``command line options'').<p>
+
+The first argument to <code>SK_(pre_clo_init)()</code> must be initialised with
+various ``details'' for a skin.  These are all compulsory except for
+<code>version</code>.  They are used when constructing the startup message,
+except for <code></code> which is used if <code>VG_(skin_panic)()</code> is
+ever called, or a skin assertion fails.<p>
+
+The second argument to <code>SK_(pre_clo_init)()</code> must be initialised with
+the ``needs'' for a skin.  They are mostly booleans, and can be left untouched
+(they default to <code>False</code>).  They determine whether a skin can do
+various things such as:  record, report and suppress errors; process command
+line options;  wrap system calls;  record extra information about malloc'd
+blocks, etc.<p>
+
+For example, if a skin wants the core's help in recording and reporting errors,
+it must set the <code>skin_errors</code> need to <code>True</code>, and then
+provide definitions of six functions for comparing errors, printing out errors,
+reading suppressions from a suppressions file, etc.  While writing these
+functions requires some work, it's much less than doing error handling from
+scratch because the core is doing most of the work.  See the type
+<code>VgNeeds</code> in <code>include/vg_skin.h</code> for full details of all
+the needs.<p>
+
+The third argument to <code>SK_(pre_clo_init)()</code> must be initialised to
+indicate which events in core the skin wants to be notified about.  These
+include things such as blocks of memory being malloc'd, the stack pointer
+changing, a mutex being locked, etc.  If a skin wants to know about this,
+it should set the relevant pointer in the structure to point to a function,
+which will be called when that event happens.<p>
+
+For example, if the skin want to be notified when a new block of memory is
+malloc'd, it should set the <code>new_mem_heap</code> function pointer, and the
+assigned function will be called each time this happens.  See the type
+<code>VgTrackEvents</code> in <code>include/vg_skin.h</code> for full details
+of all the trackable events.<p>
+
+<a name="instr"></a>
+<h3>2.7&nbsp; Instrumentation</h3>
+
+<code>SK_(instrument)()</code> is the interesting one.  It allows you to
+instrument <i>UCode</i>, which is Valgrind's RISC-like intermediate language.
+UCode is described in the <a href="techdocs.html">technical docs</a>.
+
+The easiest way to instrument UCode is to insert calls to C functions when
+interesting things happen.  See the skin ``lackey''
+(<code>lackey/lk_main.c</code>) for a simple example of this, or
+Cachegrind (<code>cachegrind/cg_main.c</code>) for a more complex
+example.<p>
+
+A much more complicated way to instrument UCode, albeit one that might result
+in faster instrumented programs, is to extend UCode with new UCode
+instructions.  This is recommended for advanced Valgrind hackers only!  See the
+``memcheck'' skin for an example.
+
+<a name="fini"></a>
+<h3>2.8&nbsp; Finalisation</h3>
+
+This is where you can present the final results, such as a summary of the
+information collected.  Any log files should be written out at this point.
+
+<a name="otherimportantinfo"></a>
+<h3>2.9&nbsp; Other important information</h3>
+
+Please note that the core/skin split infrastructure is all very new, and not
+very well documented.  Here are some important points, but there are
+undoubtedly many others that I should note but haven't thought of.<p>
+
+The file <code>include/vg_skin.h</code> contains all the types,
+macros, functions, etc. that a skin should (hopefully) need, and is the only
+<code>.h</code> file a skin should need to <code>#include</code>.<p>
+
+In particular, you probably shouldn't use anything from the C library (there
+are deep reasons for this, trust us).  Valgrind provides an implementation of a
+reasonable subset of the C library, details of which are in
+<code>vg_skin.h</code>.<p>
+
+Similarly, when writing a skin, you shouldn't need to look at any of the code
+in Valgrind's core.  Although it might be useful sometimes to help understand
+something.<p>
+
+<code>vg_skin.h</code> has a reasonable amount of documentation in it that
+should hopefully be enough to get you going.  But ultimately, the skins
+distributed (memcheck, addrcheck, cachegrind, lackey, etc.) are probably the
+best documentation of all, for the moment.<p>
+
+Note that the <code>VG_</code> and <code>SK_</code> macros are used heavily.
+These just prepend longer strings in front of names to avoid potential
+namespace clashes.  We strongly recommend using the <code>SK_</code> macro
+for any global functions and variables in your skin.<p>
+
+<a name="wordsofadvice"</a>
+<h3>2.10&nbsp; Words of Advice</h3>
+
+Writing and debugging skins is not trivial.  Here are some suggestions for
+solving common problems.<p>
+
+If you are getting segmentation faults in C functions used by your skin, the
+usual GDB command:
+<blockquote><code>gdb <i>prog</i> core</code></blockquote>
+usually gives the location of the segmentation fault.<p>
+
+If you want to debug C functions used by your skin, you can attach GDB to
+Valgrind with some effort:
+<ul>
+  <li>Enable the following code in <code>coregrind/vg_main.c</code> by
+  changing <code>if (0)</code> into <code>if (1)</code>:
+<pre>
+   /* Hook to delay things long enough so we can get the pid and
+      attach GDB in another shell. */
+   if (0) { 
+      Int p, q;
+      for (p = 0; p < 50000; p++)
+         for (q = 0; q < 50000; q++) ;
+   }
+      </li><p>
+      and rebuild Valgrind.
+
+  <li>Then run: 
+      <blockquote><code>valgrind <i>prog</i></code></blockquote>
+
+      Valgrind starts the program, printing its process id, and then delays for
+      a few seconds (you may have to change the loop bounds to get a suitable
+      delay).</li><p>
+      
+  <li>In a second shell run: 
+  
+      <blockquote><code>gdb <i>prog</i> <i>pid</i></code></blockquote></li><p>
+</ul>
+
+GDB may be able to give you useful information.  Note that by default
+most of the system is built with <code>-fomit-frame-pointer</code>,
+and you'll need to get rid of this to extract useful tracebacks from
+GDB.<p>
+
+If you just want to know whether a program point has been reached, using the
+<code>OINK</code> macro (in <code> include/vg_skin.h</code>) can be easier than
+using GDB.<p>
+
+If you are having problems with your UCode instrumentation, it's likely that
+GDB won't be able to help at all.  In this case, Valgrind's
+<code>--trace-codegen</code> option is invaluable for observing the results of
+instrumentation.<p>
+
+The other debugging command line options can be useful too (run <code>valgrind
+-h</code> for the list).<p>
+
+<a name="advancedtopics"></a>
+<h2>3&nbsp; Advanced Topics</h2>
+
+Once a skin becomes more complicated, there are some extra things you may
+want/need to do.
+
+<a name="suppressions"</a>
+<h3>3.1&nbsp; Suppressions</h3>
+
+If your skin reports errors and you want to suppress some common ones, you can
+add suppressions to the suppression files.  The relevant files are 
+<code>valgrind/*.supp</code>;  the final suppression file is aggregated from
+these files by combining the relevant <code>.supp</code> files depending on the
+versions of linux, X and glibc on a system.
+
+<a name="documentation"</a>
+<h3>3.2&nbsp; Documentation</h3>
+
+If you are feeling conscientious and want to write some HTML documentation for
+your skin, follow these steps (using <code>foobar</code> as the example skin
+name again):
+
+<ol>
+  <li>Make a directory <code>foobar/docs/</code>.
+  </li><p>
+
+  <li>Edit <code>foobar/Makefile.am</code>, adding <code>docs</code> to
+      the <code>SUBDIRS</code> variable.
+  </li><p>
+
+  <li>Edit <code>configure.in</code>, adding
+      <code>foobar/docs/Makefile</code> to the <code>AC_OUTPUT</code> list.
+  </li><p>
+
+  <li>Write <code>foobar/docs/Makefile.am</code>.  Use
+      <code>memcheck/docs/Makefile.am</code> as an example.
+  </li>
+
+  <li>Write the documentation;  the top-level file should be called
+      <code>foobar/docs/index.html</code>.
+  </li><p>
+
+  <li>(optional) Add a link in the main documentation index
+      <code>docs/index.html</code> to
+      <code>foobar/docs/index.html</code>
+  </li><p>
+</ol>
+
+<a name="regressiontests"</a>
+<h3>3.3&nbsp; Regression tests</h3>
+
+Valgrind has some support for regression tests.  If you want to write
+regression tests for your skin:
+
+<ol>
+  <li>Make a directory <code>foobar/tests/</code>.
+  </li><p>
+
+  <li>Edit <code>foobar/Makefile.am</code>, adding <code>tests</code> to
+      the <code>SUBDIRS</code> variable.
+  </li><p>
+
+  <li>Edit <code>configure.in</code>, adding
+      <code>foobar/tests/Makefile</code> to the <code>AC_OUTPUT</code> list.
+  </li><p>
+
+  <li>Write <code>foobar/tests/Makefile.am</code>.  Use
+      <code>memcheck/tests/Makefile.am</code> as an example.
+  </li><p>
+
+  <li>Write the tests, <code>.vgtest</code> test description files, 
+      <code>.stdout.exp</code> and <code>.stderr.exp</code> expected output
+      files.  (Note that Valgrind's output goes to stderr.)  Some details
+      on writing and running tests are given in the comments at the top of the
+      testing script <code>tests/vg_regtest</code>.
+  </li><p>
+
+  <li>Write a filter for stderr results <code>foobar/tests/filter_stderr</code>.
+      It can call the existing filters in <code>tests/</code>.  See
+      <code>memcheck/tests/filter_stderr</code> for an example;  in particular
+      note the <code>$dir</code> trick that ensures the filter works correctly
+      from any directory.
+  </li><p>
+</ol>
+
+<a name="profiling"</a>
+<h3>3.4&nbsp; Profiling</h3>
+
+To do simple tick-based profiling of a skin, include the line 
+<blockquote>
+#include "vg_profile.c"
+</blockquote>
+in the skin somewhere, and rebuild (you may have to <code>make clean</code>
+first).  Then run Valgrind with the <code>--profile=yes</code> option.<p>
+
+The profiler is stack-based;  you can register a profiling event with
+<code>VGP_(register_profile_event)()</code> and then use the
+<code>VGP_PUSHCC</code> and <code>VGP_POPCC</code> macros to record time spent
+doing certain things.  New profiling event numbers must not overlap with the
+core profiling event numbers.  See <code>include/vg_skin.h</code> for details
+and the ``memcheck'' skin for an example.
+
+
+<a name="othermakefilehackery"</a>
+<h3>3.5&nbsp; Other makefile hackery</h3>
+
+If you add any directories under <code>valgrind/foobar/</code>, you will
+need to add an appropriate <code>Makefile.am</code> to it, and add a
+corresponding entry to the <code>AC_OUTPUT</code> list in
+<code>valgrind/configure.in</code>.<p>
+
+If you add any scripts to your skin (see Cachegrind for an example) you need to
+add them to the <code>bin_SCRIPTS</code> variable in
+<code>valgrind/foobar/Makefile.am</code>.<p>
+
+
+<a name="interfaceversions"</a>
+<h3>3.5&nbsp; Core/skin interface versions</h3>
+
+In order to allow for the core/skin interface to evolve over time, Valgrind
+uses a basic interface versioning system.  All a skin has to do is use the
+<code>VG_DETERMINE_INTERFACE_VERSION</code> macro exactly once in its code.
+If not, a link error will occur when the skin is built.
+<p>
+The interface version number has the form X.Y.  Changes in Y indicate binary
+compatible changes.  Changes in X indicate binary incompatible changes.  If
+the core and skin has the same major version number X they should work
+together.  If X doesn't match, Valgrind will abort execution with an
+explanation of the problem.
+<p>
+This approach was chosen so that if the interface changes in the future,
+old skins won't work and the reason will be clearly explained, instead of
+possibly crashing mysteriously.  We have attempted to minimise the potential
+for binary incompatible changes by means such as minimising the use of naked
+structs in the interface.
+
+<a name="finalwords"></a>
+<h2>4&nbsp; Final Words</h2>
+
+This whole core/skin business is very new and experimental, and under active
+development.<p>
+
+The first consequence of this is that the core/skin interface is quite
+immature.  It will almost certainly change in the future;  we have no intention
+of freezing it and then regretting the inevitable stupidities.  Hopefully most
+of the future changes will be to add new features, hooks, functions, etc,
+rather than to change old ones, which should cause a minimum of trouble for
+existing skins, and we've put some effort into future-proofing the interface
+to avoid binary incompatibility.  But we can't guarantee anything.  The
+versioning system should catch any incompatibilities.  Just something to be
+aware of.<p>
+
+The second consequence of this is that we'd love to hear your feedback about
+it:
+
+<ul>
+  <li>If you love it or hate it</li><p>
+  <li>If you find bugs</li><p>
+  <li>If you write a skin</li><p>
+  <li>If you have suggestions for new features, needs, trackable events,
+      functions</li><p>
+  <li>If you have suggestions for making skins easier to write
+  </li><p>
+  <li>If you have suggestions for improving this documentation  </li><p>
+  <li>If you don't understand something</li><p>
+</ul>
+
+or anything else!<p>
+
+Happy programming.
+
diff --git a/docs/manual.html b/docs/manual.html
new file mode 100644
index 000000000..d3670199a
--- /dev/null
+++ b/docs/manual.html
@@ -0,0 +1,92 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Valgrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title">&nbsp;</a>
+<h1 align=center>Valgrind, version 2.0.0</h1>
+<center>This manual was last updated on 10 November 2002</center>
+<p>
+
+<center>
+<a href="mailto:jseward@acm.org">jseward@acm.org</a>,
+   <a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
+Copyright &copy; 2000-2002 Julian Seward, Nick Nethercote
+<p>
+
+Valgrind is licensed under the GNU General Public License, version
+2<br>
+
+An open-source tool for debugging and profiling Linux-x86 executables.
+</center>
+
+<p>
+
+<hr width="100%">
+<a name="contents"></a>
+<h2>Contents of this manual</h2>
+
+<h4>1&nbsp; <a href="#intro">Introduction</a></h4>
+    1.1&nbsp; <a href="#whatfor">What Valgrind is for</a><br>
+    1.2&nbsp; <a href="#whatdoes">What it does with your program</a>
+
+<h4>2&nbsp; <a href="#howtouse">How to use it, and how to make sense 
+    of the results</a></h4>
+    2.1&nbsp; <a href="#starta">Getting started</a><br>
+    2.2&nbsp; <a href="#comment">The commentary</a><br>
+    2.3&nbsp; <a href="#report">Reporting of errors</a><br>
+    2.4&nbsp; <a href="#suppress">Suppressing errors</a><br>
+    2.5&nbsp; <a href="#flags">Command-line flags</a><br>
+    2.6&nbsp; <a href="#errormsgs">Explaination of error messages</a><br>
+    2.7&nbsp; <a href="#suppfiles">Writing suppressions files</a><br>
+    2.8&nbsp; <a href="#clientreq">The Client Request mechanism</a><br>
+    2.9&nbsp; <a href="#pthreads">Support for POSIX pthreads</a><br>
+    2.10&nbsp; <a href="#install">Building and installing</a><br>
+    2.11&nbsp; <a href="#problems">If you have problems</a><br>
+
+<h4>3&nbsp; <a href="#machine">Details of the checking machinery</a></h4>
+    3.1&nbsp; <a href="#vvalue">Valid-value (V) bits</a><br>
+    3.2&nbsp; <a href="#vaddress">Valid-address (A)&nbsp;bits</a><br>
+    3.3&nbsp; <a href="#together">Putting it all together</a><br>
+    3.4&nbsp; <a href="#signals">Signals</a><br>
+    3.5&nbsp; <a href="#leaks">Memory leak detection</a><br>
+
+<h4>4&nbsp; <a href="#limits">Limitations</a></h4>
+
+<h4>5&nbsp; <a href="#howitworks">How it works -- a rough overview</a></h4>
+    5.1&nbsp; <a href="#startb">Getting started</a><br>
+    5.2&nbsp; <a href="#engine">The translation/instrumentation engine</a><br>
+    5.3&nbsp; <a href="#track">Tracking the status of memory</a><br>
+    5.4&nbsp; <a href="#sys_calls">System calls</a><br>
+    5.5&nbsp; <a href="#sys_signals">Signals</a><br>
+
+<h4>6&nbsp; <a href="#example">An example</a></h4>
+
+<h4>7&nbsp; <a href="#cache">Cache profiling</a></h4>
+
+<h4>8&nbsp; <a href="techdocs.html">The design and implementation of Valgrind</a></h4>
+
+<hr width="100%">
+
+
diff --git a/helgrind/hg_main.html b/helgrind/hg_main.html
new file mode 100644
index 000000000..b9d72f9bc
--- /dev/null
+++ b/helgrind/hg_main.html
@@ -0,0 +1,80 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Cachegrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title"></a>
+<h1 align=center>Helgrind</h1>
+<center>This manual was last updated on 2002-10-03</center>
+<p>
+
+<center>
+<a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
+Copyright &copy; 2000-2002 Nicholas Nethercote
+<p>
+Helgrind is licensed under the GNU General Public License, 
+version 2<br>
+Helgrind is a Valgrind skin for detecting data races in threaded programs.
+</center>
+
+<p>
+
+<h2>1&nbsp; Helgrind</h2>
+
+Helgrind is a Valgrind skin for detecting data races in C and C++ programs
+that use the Pthreads library.
+<p>
+It uses the Eraser algorithm described in 
+<blockquote>
+    Eraser: A Dynamic Data Race Detector for Multithreaded Programs<br>
+    Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro and 
+    Thomas Anderson<br>
+    ACM Transactions on Computer Systems, 15(4):391-411<br>
+    November 1997.
+</blockquote>
+
+It is unfortunately in a rather mangy state and probably doesn't work at all.
+We include it partly because it may serve as a useful example skin, and partly
+in case anybody is inspired to improve it and get it working.
+<p>
+If you are inspired, we'd love to hear from you.  And if you are successful,
+you might like to include some improvements to the basic Eraser algorithm
+described in Section 4.2 of
+
+<blockquote>
+    Runtime Checking of Multithreaded Applications with Visual Threads
+    Jerry J. Harrow, Jr.<br>
+    Proceedings of the 7th International SPIN Workshop on Model Checking of 
+    Software<br>
+    Stanford, California, USA<br>
+    August 2000<br>
+    LNCS 1885, pp331--342<br>
+    K. Havelund, J. Penix, and W. Visser, editors.<br>
+</blockquote>
+
+
+<hr width="100%">
+</body>
+</html>
+
diff --git a/lackey/lk_main.html b/lackey/lk_main.html
new file mode 100644
index 000000000..72f1e8425
--- /dev/null
+++ b/lackey/lk_main.html
@@ -0,0 +1,68 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Cachegrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title"></a>
+<h1 align=center>Lackey</h1>
+<center>This manual was last updated on 2002-10-03</center>
+<p>
+
+<center>
+<a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
+Copyright &copy; 2000-2002 Nicholas Nethercote
+<p>
+Lackey is licensed under the GNU General Public License, 
+version 2<br>
+Lackey is an example Valgrind skin that does some very basic program
+measurement.
+</center>
+
+<p>
+
+<h2>1&nbsp; Lackey</h2>
+
+Lackey is a simple Valgrind skin that does some basic program measurement.
+It adds quite a lot of simple instrumentation to the program's code.  It is
+primarily intended to be of use as an example skin.
+<p>
+It measures three things:
+
+<ol>
+<li>The number of calls to <code>_dl_runtime_resolve()</code>, the function
+    in glibc's dynamic linker that resolves function lookups into shared 
+    objects.<p>
+
+<li>The number of UCode instructions (UCode is Valgrind's RISC-like
+    intermediate language), x86 instructions, and basic blocks executed by the
+    program, and some ratios between the three counts.<p>
+
+<li>The number of conditional branches encountered and the proportion of those
+    taken.<p>
+</ol>
+
+<hr width="100%">
+</body>
+</html>
+
diff --git a/memcheck/mc_main.html b/memcheck/mc_main.html
new file mode 100644
index 000000000..32177f30f
--- /dev/null
+++ b/memcheck/mc_main.html
@@ -0,0 +1,830 @@
+
+---------------------------
+     
+  <li><code>--partial-loads-ok=yes</code> [the default]<br>
+      <code>--partial-loads-ok=no</code>
+      <p>Controls how Valgrind handles word (4-byte) loads from
+      addresses for which some bytes are addressible and others
+      are not.  When <code>yes</code> (the default), such loads
+      do not elicit an address error.  Instead, the loaded V bytes
+      corresponding to the illegal addresses indicate undefined, and
+      those corresponding to legal addresses are loaded from shadow 
+      memory, as usual.
+      <p>
+      When <code>no</code>, loads from partially
+      invalid addresses are treated the same as loads from completely
+      invalid addresses: an illegal-address error is issued,
+      and the resulting V bytes indicate valid data.
+      </li><br><p>
+
+  <li><code>--freelist-vol=&lt;number></code> [default: 1000000]
+      <p>When the client program releases memory using free (in C) or
+      delete (C++), that memory is not immediately made available for
+      re-allocation.  Instead it is marked inaccessible and placed in
+      a queue of freed blocks.  The purpose is to delay the point at
+      which freed-up memory comes back into circulation.  This
+      increases the chance that Valgrind will be able to detect
+      invalid accesses to blocks for some significant period of time
+      after they have been freed.  
+      <p>
+      This flag specifies the maximum total size, in bytes, of the
+      blocks in the queue.  The default value is one million bytes.
+      Increasing this increases the total amount of memory used by
+      Valgrind but may detect invalid uses of freed blocks which would
+      otherwise go undetected.</li><br><p>
+
+  <li><code>--leak-check=no</code> [default]<br>
+      <code>--leak-check=yes</code> 
+      <p>When enabled, search for memory leaks when the client program
+      finishes.  A memory leak means a malloc'd block, which has not
+      yet been free'd, but to which no pointer can be found.  Such a
+      block can never be free'd by the program, since no pointer to it
+      exists.  Leak checking is disabled by default because it tends
+      to generate dozens of error messages.  </li><br><p>
+
+  <li><code>--show-reachable=no</code> [default]<br>
+      <code>--show-reachable=yes</code> 
+      <p>When disabled, the memory leak detector only shows blocks for
+      which it cannot find a pointer to at all, or it can only find a
+      pointer to the middle of.  These blocks are prime candidates for
+      memory leaks.  When enabled, the leak detector also reports on
+      blocks which it could find a pointer to.  Your program could, at
+      least in principle, have freed such blocks before exit.
+      Contrast this to blocks for which no pointer, or only an
+      interior pointer could be found: they are more likely to
+      indicate memory leaks, because you do not actually have a
+      pointer to the start of the block which you can hand to
+      <code>free</code>, even if you wanted to.  </li><br><p>
+
+  <li><code>--leak-resolution=low</code> [default]<br>
+      <code>--leak-resolution=med</code> <br>
+      <code>--leak-resolution=high</code>
+      <p>When doing leak checking, determines how willing Valgrind is
+      to consider different backtraces to be the same.  When set to
+      <code>low</code>, the default, only the first two entries need
+      match.  When <code>med</code>, four entries have to match.  When
+      <code>high</code>, all entries need to match.  
+      <p>
+      For hardcore leak debugging, you probably want to use
+      <code>--leak-resolution=high</code> together with 
+      <code>--num-callers=40</code> or some such large number.  Note
+      however that this can give an overwhelming amount of
+      information, which is why the defaults are 4 callers and
+      low-resolution matching.
+      <p>
+      Note that the <code>--leak-resolution=</code> setting does not
+      affect Valgrind's ability to find leaks.  It only changes how
+      the results are presented.
+      </li><br><p>
+
+  <li><code>--workaround-gcc296-bugs=no</code> [default]<br>
+      <code>--workaround-gcc296-bugs=yes</code> <p>When enabled,
+      assume that reads and writes some small distance below the stack
+      pointer <code>%esp</code> are due to bugs in gcc 2.96, and does
+      not report them.  The "small distance" is 256 bytes by default.
+      Note that gcc 2.96 is the default compiler on some popular Linux
+      distributions (RedHat 7.X, Mandrake) and so you may well need to
+      use this flag.  Do not use it if you do not have to, as it can
+      cause real errors to be overlooked.  Another option is to use a
+      gcc/g++ which does not generate accesses below the stack
+      pointer.  2.95.3 seems to be a good choice in this respect.
+      <p>
+      Unfortunately (27 Feb 02) it looks like g++ 3.0.4 has a similar
+      bug, so you may need to issue this flag if you use 3.0.4.  A
+      while later (early Apr 02) this is confirmed as a scheduling bug
+      in g++-3.0.4.
+      </li><br><p>
+
+  <li><code>--cleanup=no</code><br>
+      <code>--cleanup=yes</code> [default]
+      <p>When enabled, various improvments are applied to the
+      post-instrumented intermediate code, aimed at removing redundant
+      value checks.</li><br>
+      <p>
+
+
+
+
+<a name="errormsgs"></a>
+<h3>2.6&nbsp; Explaination of error messages</h3>
+
+Despite considerable sophistication under the hood, Valgrind can only
+really detect two kinds of errors, use of illegal addresses, and use
+of undefined values.  Nevertheless, this is enough to help you
+discover all sorts of memory-management nasties in your code.  This
+section presents a quick summary of what error messages mean.  The
+precise behaviour of the error-checking machinery is described in
+<a href="#machine">Section 4</a>.
+
+
+<h4>2.6.1&nbsp; Illegal read / Illegal write errors</h4>
+For example:
+<pre>
+  Invalid read of size 4
+     at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
+     by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
+     by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326)
+     by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
+     Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
+</pre>
+
+<p>This happens when your program reads or writes memory at a place
+which Valgrind reckons it shouldn't.  In this example, the program did
+a 4-byte read at address 0xBFFFF0E0, somewhere within the
+system-supplied library libpng.so.2.1.0.9, which was called from
+somewhere else in the same library, called from line 326 of
+qpngio.cpp, and so on.
+
+<p>Valgrind tries to establish what the illegal address might relate
+to, since that's often useful.  So, if it points into a block of
+memory which has already been freed, you'll be informed of this, and
+also where the block was free'd at.  Likewise, if it should turn out
+to be just off the end of a malloc'd block, a common result of
+off-by-one-errors in array subscripting, you'll be informed of this
+fact, and also where the block was malloc'd.
+
+<p>In this example, Valgrind can't identify the address.  Actually the
+address is on the stack, but, for some reason, this is not a valid
+stack address -- it is below the stack pointer, %esp, and that isn't
+allowed.  In this particular case it's probably caused by gcc
+generating invalid code, a known bug in various flavours of gcc.
+
+<p>Note that Valgrind only tells you that your program is about to
+access memory at an illegal address.  It can't stop the access from
+happening.  So, if your program makes an access which normally would
+result in a segmentation fault, you program will still suffer the same
+fate -- but you will get a message from Valgrind immediately prior to
+this.  In this particular example, reading junk on the stack is
+non-fatal, and the program stays alive.
+
+
+<h4>2.6.2&nbsp; Use of uninitialised values</h4>
+For example:
+<pre>
+  Conditional jump or move depends on uninitialised value(s)
+     at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
+     by 0x402E8476: _IO_printf (printf.c:36)
+     by 0x8048472: main (tests/manuel1.c:8)
+     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
+</pre>
+
+<p>An uninitialised-value use error is reported when your program uses
+a value which hasn't been initialised -- in other words, is undefined.
+Here, the undefined value is used somewhere inside the printf()
+machinery of the C library.  This error was reported when running the
+following small program:
+<pre>
+  int main()
+  {
+    int x;
+    printf ("x = %d\n", x);
+  }
+</pre>
+
+<p>It is important to understand that your program can copy around
+junk (uninitialised) data to its heart's content.  Valgrind observes
+this and keeps track of the data, but does not complain.  A complaint
+is issued only when your program attempts to make use of uninitialised
+data.  In this example, x is uninitialised.  Valgrind observes the
+value being passed to _IO_printf and thence to _IO_vfprintf, but makes
+no comment.  However, _IO_vfprintf has to examine the value of x so it
+can turn it into the corresponding ASCII string, and it is at this
+point that Valgrind complains.
+
+<p>Sources of uninitialised data tend to be:
+<ul>
+  <li>Local variables in procedures which have not been initialised,
+      as in the example above.</li><br><p>
+
+  <li>The contents of malloc'd blocks, before you write something
+      there.  In C++, the new operator is a wrapper round malloc, so
+      if you create an object with new, its fields will be
+      uninitialised until you (or the constructor) fill them in, which
+      is only Right and Proper.</li>
+</ul>
+
+
+
+<h4>2.6.3&nbsp; Illegal frees</h4>
+For example:
+<pre>
+  Invalid free()
+     at 0x4004FFDF: free (ut_clientmalloc.c:577)
+     by 0x80484C7: main (tests/doublefree.c:10)
+     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
+     by 0x80483B1: (within tests/doublefree)
+     Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
+     at 0x4004FFDF: free (ut_clientmalloc.c:577)
+     by 0x80484C7: main (tests/doublefree.c:10)
+     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
+     by 0x80483B1: (within tests/doublefree)
+</pre>
+<p>Valgrind keeps track of the blocks allocated by your program with
+malloc/new, so it can know exactly whether or not the argument to
+free/delete is legitimate or not.  Here, this test program has
+freed the same block twice.  As with the illegal read/write errors,
+Valgrind attempts to make sense of the address free'd.  If, as
+here, the address is one which has previously been freed, you wil
+be told that -- making duplicate frees of the same block easy to spot.
+
+
+<h4>2.6.4&nbsp; When a block is freed with an inappropriate
+deallocation function</h4>
+In the following example, a block allocated with <code>new[]</code>
+has wrongly been deallocated with <code>free</code>:
+<pre>
+  Mismatched free() / delete / delete []
+     at 0x40043249: free (vg_clientfuncs.c:171)
+     by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
+     by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
+     by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
+     Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
+     at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
+     by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
+     by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
+     by 0x4C21788F: OLEFilter::convert(QCString const &amp;) (olefilter.cc:272)
+</pre>
+The following was told to me be the KDE 3 developers.  I didn't know
+any of it myself.  They also implemented the check itself.
+<p>
+In C++ it's important to deallocate memory in a way compatible with
+how it was allocated.  The deal is:
+<ul>
+<li>If allocated with <code>malloc</code>, <code>calloc</code>,
+    <code>realloc</code>, <code>valloc</code> or
+    <code>memalign</code>, you must deallocate with <code>free</code>.
+<li>If allocated with <code>new[]</code>, you must deallocate with
+    <code>delete[]</code>.
+<li>If allocated with <code>new</code>, you must deallocate with
+    <code>delete</code>.
+</ul>
+The worst thing is that on Linux apparently it doesn't matter if you
+do muddle these up, and it all seems to work ok, but the same program
+may then crash on a different platform, Solaris for example.  So it's
+best to fix it properly.  According to the KDE folks "it's amazing how
+many C++ programmers don't know this".  
+<p>
+Pascal Massimino adds the following clarification:
+<code>delete[]</code> must be called associated with a
+<code>new[]</code> because the compiler stores the size of the array
+and the pointer-to-member to the destructor of the array's content
+just before the pointer actually returned.  This implies a
+variable-sized overhead in what's returned by <code>new</code> or
+<code>new[]</code>.  It rather surprising how compilers [Ed:
+runtime-support libraries?] are robust to mismatch in
+<code>new</code>/<code>delete</code>
+<code>new[]</code>/<code>delete[]</code>.
+
+
+<h4>2.6.5&nbsp; Passing system call parameters with inadequate
+read/write permissions</h4>
+
+Valgrind checks all parameters to system calls.  If a system call
+needs to read from a buffer provided by your program, Valgrind checks
+that the entire buffer is addressible and has valid data, ie, it is
+readable.  And if the system call needs to write to a user-supplied
+buffer, Valgrind checks that the buffer is addressible.  After the
+system call, Valgrind updates its administrative information to
+precisely reflect any changes in memory permissions caused by the
+system call.
+
+<p>Here's an example of a system call with an invalid parameter:
+<pre>
+  #include &lt;stdlib.h>
+  #include &lt;unistd.h>
+  int main( void )
+  {
+    char* arr = malloc(10);
+    (void) write( 1 /* stdout */, arr, 10 );
+    return 0;
+  }
+</pre>
+
+<p>You get this complaint ...
+<pre>
+  Syscall param write(buf) contains uninitialised or unaddressable byte(s)
+     at 0x4035E072: __libc_write
+     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
+     by 0x80483B1: (within tests/badwrite)
+     by &lt;bogus frame pointer> ???
+     Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
+     at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
+     by 0x80484A0: main (tests/badwrite.c:6)
+     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
+     by 0x80483B1: (within tests/badwrite)
+</pre>
+
+<p>... because the program has tried to write uninitialised junk from
+the malloc'd block to the standard output.
+
+
+<h4>2.6.6&nbsp; Warning messages you might see</h4>
+
+Most of these only appear if you run in verbose mode (enabled by
+<code>-v</code>):
+<ul>
+<li> <code>More than 50 errors detected.  Subsequent errors
+     will still be recorded, but in less detail than before.</code>
+     <br>
+     After 50 different errors have been shown, Valgrind becomes 
+     more conservative about collecting them.  It then requires only 
+     the program counters in the top two stack frames to match when
+     deciding whether or not two errors are really the same one.
+     Prior to this point, the PCs in the top four frames are required
+     to match.  This hack has the effect of slowing down the
+     appearance of new errors after the first 50.  The 50 constant can
+     be changed by recompiling Valgrind.
+<p>
+<li> <code>More than 300 errors detected.  I'm not reporting any more.
+     Final error counts may be inaccurate.  Go fix your
+     program!</code>
+     <br>
+     After 300 different errors have been detected, Valgrind ignores
+     any more.  It seems unlikely that collecting even more different
+     ones would be of practical help to anybody, and it avoids the
+     danger that Valgrind spends more and more of its time comparing
+     new errors against an ever-growing collection.  As above, the 300
+     number is a compile-time constant.
+<p>
+<li> <code>Warning: client switching stacks?</code>
+     <br>
+     Valgrind spotted such a large change in the stack pointer, %esp,
+     that it guesses the client is switching to a different stack.
+     At this point it makes a kludgey guess where the base of the new
+     stack is, and sets memory permissions accordingly.  You may get
+     many bogus error messages following this, if Valgrind guesses
+     wrong.  At the moment "large change" is defined as a change of
+     more that 2000000 in the value of the %esp (stack pointer)
+     register.
+<p>
+<li> <code>Warning: client attempted to close Valgrind's logfile fd &lt;number>
+     </code>
+     <br>
+     Valgrind doesn't allow the client
+     to close the logfile, because you'd never see any diagnostic
+     information after that point.  If you see this message,
+     you may want to use the <code>--logfile-fd=&lt;number></code> 
+     option to specify a different logfile file-descriptor number.
+<p>
+<li> <code>Warning: noted but unhandled ioctl &lt;number></code>
+     <br>
+     Valgrind observed a call to one of the vast family of
+     <code>ioctl</code> system calls, but did not modify its
+     memory status info (because I have not yet got round to it).
+     The call will still have gone through, but you may get spurious
+     errors after this as a result of the non-update of the memory info.
+<p>
+<li> <code>Warning: set address range perms: large range &lt;number></code>
+     <br> 
+     Diagnostic message, mostly for my benefit, to do with memory 
+     permissions.
+</ul>
+
+
+<a name="suppfiles"></a>
+<h3>2.7&nbsp; Writing suppressions files</h3>
+
+A suppression file describes a bunch of errors which, for one reason
+or another, you don't want Valgrind to tell you about.  Usually the
+reason is that the system libraries are buggy but unfixable, at least
+within the scope of the current debugging session.  Multiple
+suppressions files are allowed.  By default, Valgrind uses
+<code>$PREFIX/lib/valgrind/default.supp</code>.
+
+<p>
+You can ask to add suppressions from another file, by specifying
+<code>--suppressions=/path/to/file.supp</code>.
+
+<p>Each suppression has the following components:<br>
+<ul>
+
+  <li>Its name.  This merely gives a handy name to the suppression, by
+      which it is referred to in the summary of used suppressions
+      printed out when a program finishes.  It's not important what
+      the name is; any identifying string will do.
+      <p>
+
+  <li>The nature of the error to suppress.  Either: 
+      <code>Value1</code>, 
+      <code>Value2</code>,
+      <code>Value4</code> or
+      <code>Value8</code>,
+      meaning an uninitialised-value error when
+      using a value of 1, 2, 4 or 8 bytes.
+      Or
+      <code>Cond</code> (or its old name, <code>Value0</code>),
+      meaning use of an uninitialised CPU condition code.  Or: 
+      <code>Addr1</code>,
+      <code>Addr2</code>, 
+      <code>Addr4</code> or 
+      <code>Addr8</code>, meaning an invalid address during a
+      memory access of 1, 2, 4 or 8 bytes respectively.  Or 
+      <code>Param</code>,
+      meaning an invalid system call parameter error.  Or
+      <code>Free</code>, meaning an invalid or mismatching free.
+      Or <code>PThread</code>, meaning any kind of complaint to do
+      with the PThreads API.</li><br>
+      <p>
+
+  <li>The "immediate location" specification.  For Value and Addr
+      errors, it is either the name of the function in which the error
+      occurred, or, failing that, the full path of the .so file or
+      executable containing the error location.  For Param errors,
+      is the name of the offending system call parameter.  For Free
+      errors, is the name of the function doing the freeing (eg,
+      <code>free</code>, <code>__builtin_vec_delete</code>, etc)</li><br>
+      <p>
+
+  <li>The caller of the above "immediate location".  Again, either a
+      function or shared-object/executable name.</li><br>
+      <p>
+
+  <li>Optionally, one or two extra calling-function or object names,
+      for greater precision.</li>
+</ul>
+
+<p>
+Locations may be either names of shared objects/executables or wildcards
+matching function names.  They begin <code>obj:</code> and <code>fun:</code>
+respectively.  Function and object names to match against may use the 
+wildcard characters <code>*</code> and <code>?</code>.
+
+A suppression only suppresses an error when the error matches all the
+details in the suppression.  Here's an example:
+<pre>
+  {
+    __gconv_transform_ascii_internal/__mbrtowc/mbtowc
+    Value4
+    fun:__gconv_transform_ascii_internal
+    fun:__mbr*toc
+    fun:mbtowc
+  }
+</pre>
+
+<p>What is means is: suppress a use-of-uninitialised-value error, when
+the data size is 4, when it occurs in the function
+<code>__gconv_transform_ascii_internal</code>, when that is called
+from any function of name matching <code>__mbr*toc</code>, 
+when that is called from
+<code>mbtowc</code>.  It doesn't apply under any other circumstances.
+The string by which this suppression is identified to the user is
+__gconv_transform_ascii_internal/__mbrtowc/mbtowc.
+
+<p>Another example:
+<pre>
+  {
+    libX11.so.6.2/libX11.so.6.2/libXaw.so.7.0
+    Value4
+    obj:/usr/X11R6/lib/libX11.so.6.2
+    obj:/usr/X11R6/lib/libX11.so.6.2
+    obj:/usr/X11R6/lib/libXaw.so.7.0
+  }
+</pre>
+
+<p>Suppress any size 4 uninitialised-value error which occurs anywhere
+in <code>libX11.so.6.2</code>, when called from anywhere in the same
+library, when called from anywhere in <code>libXaw.so.7.0</code>.  The
+inexact specification of locations is regrettable, but is about all
+you can hope for, given that the X11 libraries shipped with Red Hat
+7.2 have had their symbol tables removed.
+
+<p>Note -- since the above two examples did not make it clear -- that
+you can freely mix the <code>obj:</code> and <code>fun:</code>
+styles of description within a single suppression record.
+
+
+
+
+
+<a name="machine"></a>
+<h2>3&nbsp; Details of the checking machinery</h2>
+
+Read this section if you want to know, in detail, exactly what and how
+Valgrind is checking.
+
+<a name="vvalue"></a>
+<h3>3.1&nbsp; Valid-value (V) bits</h3>
+
+It is simplest to think of Valgrind implementing a synthetic Intel x86
+CPU which is identical to a real CPU, except for one crucial detail.
+Every bit (literally) of data processed, stored and handled by the
+real CPU has, in the synthetic CPU, an associated "valid-value" bit,
+which says whether or not the accompanying bit has a legitimate value.
+In the discussions which follow, this bit is referred to as the V
+(valid-value) bit.
+
+<p>Each byte in the system therefore has a 8 V bits which follow
+it wherever it goes.  For example, when the CPU loads a word-size item
+(4 bytes) from memory, it also loads the corresponding 32 V bits from
+a bitmap which stores the V bits for the process' entire address
+space.  If the CPU should later write the whole or some part of that
+value to memory at a different address, the relevant V bits will be
+stored back in the V-bit bitmap.
+
+<p>In short, each bit in the system has an associated V bit, which
+follows it around everywhere, even inside the CPU.  Yes, the CPU's
+(integer and <code>%eflags</code>) registers have their own V bit
+vectors.
+
+<p>Copying values around does not cause Valgrind to check for, or
+report on, errors.  However, when a value is used in a way which might
+conceivably affect the outcome of your program's computation, the
+associated V bits are immediately checked.  If any of these indicate
+that the value is undefined, an error is reported.
+
+<p>Here's an (admittedly nonsensical) example:
+<pre>
+  int i, j;
+  int a[10], b[10];
+  for (i = 0; i &lt; 10; i++) {
+    j = a[i];
+    b[i] = j;
+  }
+</pre>
+
+<p>Valgrind emits no complaints about this, since it merely copies
+uninitialised values from <code>a[]</code> into <code>b[]</code>, and
+doesn't use them in any way.  However, if the loop is changed to
+<pre>
+  for (i = 0; i &lt; 10; i++) {
+    j += a[i];
+  }
+  if (j == 77) 
+     printf("hello there\n");
+</pre>
+then Valgrind will complain, at the <code>if</code>, that the
+condition depends on uninitialised values.
+
+<p>Most low level operations, such as adds, cause Valgrind to 
+use the V bits for the operands to calculate the V bits for the
+result.  Even if the result is partially or wholly undefined,
+it does not complain.
+
+<p>Checks on definedness only occur in two places: when a value is
+used to generate a memory address, and where control flow decision
+needs to be made.  Also, when a system call is detected, valgrind
+checks definedness of parameters as required.
+
+<p>If a check should detect undefinedness, an error message is
+issued.  The resulting value is subsequently regarded as well-defined.
+To do otherwise would give long chains of error messages.  In effect,
+we say that undefined values are non-infectious.
+
+<p>This sounds overcomplicated.  Why not just check all reads from
+memory, and complain if an undefined value is loaded into a CPU register? 
+Well, that doesn't work well, because perfectly legitimate C programs routinely
+copy uninitialised values around in memory, and we don't want endless complaints
+about that.  Here's the canonical example.  Consider a struct
+like this:
+<pre>
+  struct S { int x; char c; };
+  struct S s1, s2;
+  s1.x = 42;
+  s1.c = 'z';
+  s2 = s1;
+</pre>
+
+<p>The question to ask is: how large is <code>struct S</code>, in
+bytes?  An int is 4 bytes and a char one byte, so perhaps a struct S
+occupies 5 bytes?  Wrong.  All (non-toy) compilers I know of will
+round the size of <code>struct S</code> up to a whole number of words,
+in this case 8 bytes.  Not doing this forces compilers to generate
+truly appalling code for subscripting arrays of <code>struct
+S</code>'s.
+
+<p>So s1 occupies 8 bytes, yet only 5 of them will be initialised.
+For the assignment <code>s2 = s1</code>, gcc generates code to copy
+all 8 bytes wholesale into <code>s2</code> without regard for their
+meaning.  If Valgrind simply checked values as they came out of
+memory, it would yelp every time a structure assignment like this
+happened.  So the more complicated semantics described above is
+necessary.  This allows gcc to copy <code>s1</code> into
+<code>s2</code> any way it likes, and a warning will only be emitted
+if the uninitialised values are later used.
+
+<p>One final twist to this story.  The above scheme allows garbage to
+pass through the CPU's integer registers without complaint.  It does
+this by giving the integer registers V tags, passing these around in
+the expected way.  This complicated and computationally expensive to
+do, but is necessary.  Valgrind is more simplistic about
+floating-point loads and stores.  In particular, V bits for data read
+as a result of floating-point loads are checked at the load
+instruction.  So if your program uses the floating-point registers to
+do memory-to-memory copies, you will get complaints about
+uninitialised values.  Fortunately, I have not yet encountered a
+program which (ab)uses the floating-point registers in this way.
+
+<a name="vaddress"></a>
+<h3>3.2&nbsp; Valid-address (A) bits</h3>
+
+Notice that the previous section describes how the validity of values
+is established and maintained without having to say whether the
+program does or does not have the right to access any particular
+memory location.  We now consider the latter issue.
+
+<p>As described above, every bit in memory or in the CPU has an
+associated valid-value (V) bit.  In addition, all bytes in memory, but
+not in the CPU, have an associated valid-address (A) bit.  This
+indicates whether or not the program can legitimately read or write
+that location.  It does not give any indication of the validity or the
+data at that location -- that's the job of the V bits -- only whether
+or not the location may be accessed.
+
+<p>Every time your program reads or writes memory, Valgrind checks the
+A bits associated with the address.  If any of them indicate an
+invalid address, an error is emitted.  Note that the reads and writes
+themselves do not change the A bits, only consult them.
+
+<p>So how do the A bits get set/cleared?  Like this:
+
+<ul>
+  <li>When the program starts, all the global data areas are marked as
+      accessible.</li><br>
+      <p>
+
+  <li>When the program does malloc/new, the A bits for the exactly the
+      area allocated, and not a byte more, are marked as accessible.
+      Upon freeing the area the A bits are changed to indicate
+      inaccessibility.</li><br>
+      <p>
+
+  <li>When the stack pointer register (%esp) moves up or down, A bits
+      are set.  The rule is that the area from %esp up to the base of
+      the stack is marked as accessible, and below %esp is
+      inaccessible.  (If that sounds illogical, bear in mind that the
+      stack grows down, not up, on almost all Unix systems, including
+      GNU/Linux.)  Tracking %esp like this has the useful side-effect
+      that the section of stack used by a function for local variables
+      etc is automatically marked accessible on function entry and
+      inaccessible on exit.</li><br>
+      <p>
+
+  <li>When doing system calls, A bits are changed appropriately.  For
+      example, mmap() magically makes files appear in the process's
+      address space, so the A bits must be updated if mmap()
+      succeeds.</li><br>
+      <p>
+
+  <li>Optionally, your program can tell Valgrind about such changes
+      explicitly, using the client request mechanism described above.
+</ul>
+
+
+<a name="together"></a>
+<h3>3.3&nbsp; Putting it all together</h3>
+Valgrind's checking machinery can be summarised as follows:
+
+<ul>
+  <li>Each byte in memory has 8 associated V (valid-value) bits,
+      saying whether or not the byte has a defined value, and a single
+      A (valid-address) bit, saying whether or not the program
+      currently has the right to read/write that address.</li><br>
+      <p>
+
+  <li>When memory is read or written, the relevant A bits are
+      consulted.  If they indicate an invalid address, Valgrind emits
+      an Invalid read or Invalid write error.</li><br>
+      <p>
+
+  <li>When memory is read into the CPU's integer registers, the
+      relevant V bits are fetched from memory and stored in the
+      simulated CPU.  They are not consulted.</li><br>
+      <p>
+
+  <li>When an integer register is written out to memory, the V bits
+      for that register are written back to memory too.</li><br>
+      <p>
+
+  <li>When memory is read into the CPU's floating point registers, the
+      relevant V bits are read from memory and they are immediately
+      checked.  If any are invalid, an uninitialised value error is
+      emitted.  This precludes using the floating-point registers to
+      copy possibly-uninitialised memory, but simplifies Valgrind in
+      that it does not have to track the validity status of the
+      floating-point registers.</li><br>
+      <p>
+
+  <li>As a result, when a floating-point register is written to
+      memory, the associated V bits are set to indicate a valid
+      value.</li><br>
+      <p>
+
+  <li>When values in integer CPU registers are used to generate a
+      memory address, or to determine the outcome of a conditional
+      branch, the V bits for those values are checked, and an error
+      emitted if any of them are undefined.</li><br>
+      <p>
+
+  <li>When values in integer CPU registers are used for any other
+      purpose, Valgrind computes the V bits for the result, but does
+      not check them.</li><br>
+      <p>
+
+  <li>One the V bits for a value in the CPU have been checked, they
+      are then set to indicate validity.  This avoids long chains of
+      errors.</li><br>
+      <p>
+
+  <li>When values are loaded from memory, valgrind checks the A bits
+      for that location and issues an illegal-address warning if
+      needed.  In that case, the V bits loaded are forced to indicate
+      Valid, despite the location being invalid.
+      <p>
+      This apparently strange choice reduces the amount of confusing
+      information presented to the user.  It avoids the
+      unpleasant phenomenon in which memory is read from a place which
+      is both unaddressible and contains invalid values, and, as a
+      result, you get not only an invalid-address (read/write) error,
+      but also a potentially large set of uninitialised-value errors,
+      one for every time the value is used.
+      <p>
+      There is a hazy boundary case to do with multi-byte loads from
+      addresses which are partially valid and partially invalid.  See
+      details of the flag <code>--partial-loads-ok</code> for details.
+      </li><br>
+</ul>
+
+Valgrind intercepts calls to malloc, calloc, realloc, valloc,
+memalign, free, new and delete.  The behaviour you get is:
+
+<ul>
+
+  <li>malloc/new: the returned memory is marked as addressible but not
+      having valid values.  This means you have to write on it before
+      you can read it.</li><br>
+      <p>
+
+  <li>calloc: returned memory is marked both addressible and valid,
+      since calloc() clears the area to zero.</li><br>
+      <p>
+
+  <li>realloc: if the new size is larger than the old, the new section
+      is addressible but invalid, as with malloc.</li><br>
+      <p>
+
+  <li>If the new size is smaller, the dropped-off section is marked as
+      unaddressible.  You may only pass to realloc a pointer
+      previously issued to you by malloc/calloc/realloc.</li><br>
+      <p>
+
+  <li>free/delete: you may only pass to free a pointer previously
+      issued to you by malloc/calloc/realloc, or the value
+      NULL. Otherwise, Valgrind complains.  If the pointer is indeed
+      valid, Valgrind marks the entire area it points at as
+      unaddressible, and places the block in the freed-blocks-queue.
+      The aim is to defer as long as possible reallocation of this
+      block.  Until that happens, all attempts to access it will
+      elicit an invalid-address error, as you would hope.</li><br>
+</ul>
+
+
+
+
+<a name="leaks"></a>
+<h3>3.5&nbsp; Memory leak detection</h3>
+
+Valgrind keeps track of all memory blocks issued in response to calls
+to malloc/calloc/realloc/new.  So when the program exits, it knows
+which blocks are still outstanding -- have not been returned, in other
+words.  Ideally, you want your program to have no blocks still in use
+at exit.  But many programs do.
+
+<p>For each such block, Valgrind scans the entire address space of the
+process, looking for pointers to the block.  One of three situations
+may result:
+
+<ul>
+  <li>A pointer to the start of the block is found.  This usually
+      indicates programming sloppiness; since the block is still
+      pointed at, the programmer could, at least in principle, free'd
+      it before program exit.</li><br>
+      <p>
+
+  <li>A pointer to the interior of the block is found.  The pointer
+      might originally have pointed to the start and have been moved
+      along, or it might be entirely unrelated.  Valgrind deems such a
+      block as "dubious", that is, possibly leaked,
+      because it's unclear whether or
+      not a pointer to it still exists.</li><br>
+      <p>
+
+  <li>The worst outcome is that no pointer to the block can be found.
+      The block is classified as "leaked", because the
+      programmer could not possibly have free'd it at program exit,
+      since no pointer to it exists.  This might be a symptom of
+      having lost the pointer at some earlier point in the
+      program.</li>
+</ul>
+
+Valgrind reports summaries about leaked and dubious blocks.
+For each such block, it will also tell you where the block was
+allocated.  This should help you figure out why the pointer to it has
+been lost.  In general, you should attempt to ensure your programs do
+not have any leaked or dubious blocks at exit.
+
+<p>The precise area of memory in which Valgrind searches for pointers
+is: all naturally-aligned 4-byte words for which all A bits indicate
+addressibility and all V bits indicated that the stored value is
+actually valid.
+
+<p><hr width="100%">
diff --git a/memcheck/mc_techdocs.html b/memcheck/mc_techdocs.html
new file mode 100644
index 000000000..017763412
--- /dev/null
+++ b/memcheck/mc_techdocs.html
@@ -0,0 +1,2113 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>The design and implementation of Valgrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title">&nbsp;</a>
+<h1 align=center>The design and implementation of Valgrind</h1>
+
+<center>
+Detailed technical notes for hackers, maintainers and the
+overly-curious<br>
+These notes pertain to snapshot 20020306<br>
+<p>
+<a href="mailto:jseward@acm.org">jseward@acm.org<br>
+<a href="http://developer.kde.org/~sewardj">http://developer.kde.org/~sewardj</a><br>
+Copyright &copy; 2000-2002 Julian Seward
+<p>
+Valgrind is licensed under the GNU General Public License, 
+version 2<br>
+An open-source tool for finding memory-management problems in
+x86 GNU/Linux executables.
+</center>
+
+<p>
+
+
+
+
+<hr width="100%">
+
+<h2>Introduction</h2>
+
+This document contains a detailed, highly-technical description of the
+internals of Valgrind.  This is not the user manual; if you are an
+end-user of Valgrind, you do not want to read this.  Conversely, if
+you really are a hacker-type and want to know how it works, I assume
+that you have read the user manual thoroughly.
+<p>
+You may need to read this document several times, and carefully.  Some
+important things, I only say once.
+
+
+<h3>History</h3>
+
+Valgrind came into public view in late Feb 2002.  However, it has been
+under contemplation for a very long time, perhaps seriously for about
+five years.  Somewhat over two years ago, I started working on the x86
+code generator for the Glasgow Haskell Compiler
+(http://www.haskell.org/ghc), gaining familiarity with x86 internals
+on the way.  I then did Cacheprof (http://www.cacheprof.org), gaining
+further x86 experience.  Some time around Feb 2000 I started
+experimenting with a user-space x86 interpreter for x86-Linux.  This
+worked, but it was clear that a JIT-based scheme would be necessary to
+give reasonable performance for Valgrind.  Design work for the JITter
+started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86
+dynamic translator which could run quite large programs.  This
+translator was in a sense pointless, since it did not do any
+instrumentation or checking.
+
+<p>
+Most of the rest of 2001 was taken up designing and implementing the
+instrumentation scheme.  The main difficulty, which consumed a lot
+of effort, was to design a scheme which did not generate large numbers
+of false uninitialised-value warnings.  By late 2001 a satisfactory
+scheme had been arrived at, and I started to test it on ever-larger
+programs, with an eventual eye to making it work well enough so that
+it was helpful to folks debugging the upcoming version 3 of KDE.  I've
+used KDE since before version 1.0, and wanted to Valgrind to be an
+indirect contribution to the KDE 3 development effort.  At the start of
+Feb 02 the kde-core-devel crew started using it, and gave a huge
+amount of helpful feedback and patches in the space of three weeks.
+Snapshot 20020306 is the result.
+
+<p>
+In the best Unix tradition, or perhaps in the spirit of Fred Brooks'
+depressing-but-completely-accurate epitaph "build one to throw away;
+you will anyway", much of Valgrind is a second or third rendition of
+the initial idea.  The instrumentation machinery
+(<code>vg_translate.c</code>, <code>vg_memory.c</code>) and core CPU
+simulation (<code>vg_to_ucode.c</code>, <code>vg_from_ucode.c</code>)
+have had three redesigns and rewrites; the register allocator,
+low-level memory manager (<code>vg_malloc2.c</code>) and symbol table
+reader (<code>vg_symtab2.c</code>) are on the second rewrite.  In a
+sense, this document serves to record some of the knowledge gained as
+a result.
+
+
+<h3>Design overview</h3>
+
+Valgrind is compiled into a Linux shared object,
+<code>valgrind.so</code>, and also a dummy one,
+<code>valgrinq.so</code>, of which more later.  The
+<code>valgrind</code> shell script adds <code>valgrind.so</code> to
+the <code>LD_PRELOAD</code> list of extra libraries to be
+loaded with any dynamically linked library.  This is a standard trick,
+one which I assume the <code>LD_PRELOAD</code> mechanism was developed
+to support.
+
+<p>
+<code>valgrind.so</code>
+is linked with the <code>-z initfirst</code> flag, which requests that
+its initialisation code is run before that of any other object in the
+executable image.  When this happens, valgrind gains control.  The
+real CPU becomes "trapped" in <code>valgrind.so</code> and the 
+translations it generates.  The synthetic CPU provided by Valgrind
+does, however, return from this initialisation function.  So the 
+normal startup actions, orchestrated by the dynamic linker
+<code>ld.so</code>, continue as usual, except on the synthetic CPU,
+not the real one.  Eventually <code>main</code> is run and returns,
+and then the finalisation code of the shared objects is run,
+presumably in inverse order to which they were initialised.  Remember,
+this is still all happening on the simulated CPU.  Eventually
+<code>valgrind.so</code>'s own finalisation code is called.  It spots
+this event, shuts down the simulated CPU, prints any error summaries
+and/or does leak detection, and returns from the initialisation code
+on the real CPU.  At this point, in effect the real and synthetic CPUs
+have merged back into one, Valgrind has lost control of the program,
+and the program finally <code>exit()s</code> back to the kernel in the
+usual way.
+
+<p>
+The normal course of activity, one Valgrind has started up, is as
+follows.  Valgrind never runs any part of your program (usually
+referred to as the "client"), not a single byte of it, directly.
+Instead it uses function <code>VG_(translate)</code> to translate
+basic blocks (BBs, straight-line sequences of code) into instrumented
+translations, and those are run instead.  The translations are stored
+in the translation cache (TC), <code>vg_tc</code>, with the
+translation table (TT), <code>vg_tt</code> supplying the
+original-to-translation code address mapping.  Auxiliary array
+<code>VG_(tt_fast)</code> is used as a direct-map cache for fast
+lookups in TT; it usually achieves a hit rate of around 98% and
+facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad.
+
+<p>
+Function <code>VG_(dispatch)</code> in <code>vg_dispatch.S</code> is
+the heart of the JIT dispatcher.  Once a translated code address has
+been found, it is executed simply by an x86 <code>call</code>
+to the translation.  At the end of the translation, the next 
+original code addr is loaded into <code>%eax</code>, and the 
+translation then does a <code>ret</code>, taking it back to the
+dispatch loop, with, interestingly, zero branch mispredictions.  
+The address requested in <code>%eax</code> is looked up first in
+<code>VG_(tt_fast)</code>, and, if not found, by calling C helper
+<code>VG_(search_transtab)</code>.  If there is still no translation 
+available, <code>VG_(dispatch)</code> exits back to the top-level
+C dispatcher <code>VG_(toploop)</code>, which arranges for 
+<code>VG_(translate)</code> to make a new translation.  All fairly
+unsurprising, really.  There are various complexities described below.
+
+<p>
+The translator, orchestrated by <code>VG_(translate)</code>, is
+complicated but entirely self-contained.  It is described in great
+detail in subsequent sections.  Translations are stored in TC, with TT
+tracking administrative information.  The translations are subject to
+an approximate LRU-based management scheme.  With the current
+settings, the TC can hold at most about 15MB of translations, and LRU
+passes prune it to about 13.5MB.  Given that the
+orig-to-translation expansion ratio is about 13:1 to 14:1, this means
+TC holds translations for more or less a megabyte of original code,
+which generally comes to about 70000 basic blocks for C++ compiled
+with optimisation on.  Generating new translations is expensive, so it
+is worth having a large TC to minimise the (capacity) miss rate.
+
+<p>
+The dispatcher, <code>VG_(dispatch)</code>, receives hints from
+the translations which allow it to cheaply spot all control 
+transfers corresponding to x86 <code>call</code> and <code>ret</code>
+instructions.  It has to do this in order to spot some special events:
+<ul>
+<li>Calls to <code>VG_(shutdown)</code>.  This is Valgrind's cue to
+    exit.  NOTE: actually this is done a different way; it should be
+    cleaned up.
+<p>
+<li>Returns of system call handlers, to the return address 
+    <code>VG_(signalreturn_bogusRA)</code>.  The signal simulator
+    needs to know when a signal handler is returning, so we spot
+    jumps (returns) to this address.
+<p>
+<li>Calls to <code>vg_trap_here</code>.  All <code>malloc</code>,
+    <code>free</code>, etc calls that the client program makes are
+    eventually routed to a call to <code>vg_trap_here</code>,
+    and Valgrind does its own special thing with these calls.
+    In effect this provides a trapdoor, by which Valgrind can
+    intercept certain calls on the simulated CPU, run the call as it
+    sees fit itself (on the real CPU), and return the result to
+    the simulated CPU, quite transparently to the client program.
+</ul>
+Valgrind intercepts the client's <code>malloc</code>,
+<code>free</code>, etc,
+calls, so that it can store additional information.  Each block 
+<code>malloc</code>'d by the client gives rise to a shadow block
+in which Valgrind stores the call stack at the time of the
+<code>malloc</code>
+call.  When the client calls <code>free</code>, Valgrind tries to
+find the shadow block corresponding to the address passed to
+<code>free</code>, and emits an error message if none can be found.
+If it is found, the block is placed on the freed blocks queue 
+<code>vg_freed_list</code>, it is marked as inaccessible, and
+its shadow block now records the call stack at the time of the
+<code>free</code> call.  Keeping <code>free</code>'d blocks in
+this queue allows Valgrind to spot all (presumably invalid) accesses
+to them.  However, once the volume of blocks in the free queue 
+exceeds <code>VG_(clo_freelist_vol)</code>, blocks are finally
+removed from the queue.
+
+<p>
+Keeping track of A and V bits (note: if you don't know what these are,
+you haven't read the user guide carefully enough) for memory is done
+in <code>vg_memory.c</code>.  This implements a sparse array structure
+which covers the entire 4G address space in a way which is reasonably
+fast and reasonably space efficient.  The 4G address space is divided
+up into 64K sections, each covering 64Kb of address space.  Given a
+32-bit address, the top 16 bits are used to select one of the 65536
+entries in <code>VG_(primary_map)</code>.  The resulting "secondary"
+(<code>SecMap</code>) holds A and V bits for the 64k of address space
+chunk corresponding to the lower 16 bits of the address.
+
+
+<h3>Design decisions</h3>
+
+Some design decisions were motivated by the need to make Valgrind
+debuggable.  Imagine you are writing a CPU simulator.  It works fairly
+well.  However, you run some large program, like Netscape, and after
+tens of millions of instructions, it crashes.  How can you figure out
+where in your simulator the bug is?
+
+<p>
+Valgrind's answer is: cheat.  Valgrind is designed so that it is
+possible to switch back to running the client program on the real
+CPU at any point.  Using the <code>--stop-after= </code> flag, you can 
+ask Valgrind to run just some number of basic blocks, and then 
+run the rest of the way on the real CPU.  If you are searching for
+a bug in the simulated CPU, you can use this to do a binary search,
+which quickly leads you to the specific basic block which is
+causing the problem.  
+
+<p>
+This is all very handy.  It does constrain the design in certain
+unimportant ways.  Firstly, the layout of memory, when viewed from the
+client's point of view, must be identical regardless of whether it is
+running on the real or simulated CPU.  This means that Valgrind can't
+do pointer swizzling -- well, no great loss -- and it can't run on 
+the same stack as the client -- again, no great loss.  
+Valgrind operates on its own stack, <code>VG_(stack)</code>, which
+it switches to at startup, temporarily switching back to the client's
+stack when doing system calls for the client.
+
+<p>
+Valgrind also receives signals on its own stack,
+<code>VG_(sigstack)</code>, but for different gruesome reasons
+discussed below.
+
+<p>
+This nice clean switch-back-to-the-real-CPU-whenever-you-like story
+is muddied by signals.  Problem is that signals arrive at arbitrary
+times and tend to slightly perturb the basic block count, with the
+result that you can get close to the basic block causing a problem but
+can't home in on it exactly.  My kludgey hack is to define
+<code>SIGNAL_SIMULATION</code> to 1 towards the bottom of 
+<code>vg_syscall_mem.c</code>, so that signal handlers are run on the
+real CPU and don't change the BB counts.
+
+<p>
+A second hole in the switch-back-to-real-CPU story is that Valgrind's
+way of delivering signals to the client is different from that of the
+kernel.  Specifically, the layout of the signal delivery frame, and
+the mechanism used to detect a sighandler returning, are different.
+So you can't expect to make the transition inside a sighandler and
+still have things working, but in practice that's not much of a
+restriction.
+
+<p>
+Valgrind's implementation of <code>malloc</code>, <code>free</code>,
+etc, (in <code>vg_clientmalloc.c</code>, not the low-level stuff in
+<code>vg_malloc2.c</code>) is somewhat complicated by the need to 
+handle switching back at arbitrary points.  It does work tho.
+
+
+
+<h3>Correctness</h3>
+
+There's only one of me, and I have a Real Life (tm) as well as hacking
+Valgrind [allegedly :-].  That means I don't have time to waste
+chasing endless bugs in Valgrind.  My emphasis is therefore on doing
+everything as simply as possible, with correctness, stability and
+robustness being the number one priority, more important than
+performance or functionality.  As a result:
+<ul>
+<li>The code is absolutely loaded with assertions, and these are
+    <b>permanently enabled.</b>  I have no plan to remove or disable
+    them later.  Over the past couple of months, as valgrind has
+    become more widely used, they have shown their worth, pulling
+    up various bugs which would otherwise have appeared as
+    hard-to-find segmentation faults.
+    <p>
+    I am of the view that it's acceptable to spend 5% of the total
+    running time of your valgrindified program doing assertion checks
+    and other internal sanity checks.
+<p>
+<li>Aside from the assertions, valgrind contains various sets of
+    internal sanity checks, which get run at varying frequencies
+    during normal operation.  <code>VG_(do_sanity_checks)</code>
+    runs every 1000 basic blocks, which means 500 to 2000 times/second 
+    for typical machines at present.  It checks that Valgrind hasn't
+    overrun its private stack, and does some simple checks on the
+    memory permissions maps.  Once every 25 calls it does some more
+    extensive checks on those maps.  Etc, etc.
+    <p>
+    The following components also have sanity check code, which can
+    be enabled to aid debugging:
+    <ul>
+    <li>The low-level memory-manager
+        (<code>VG_(mallocSanityCheckArena)</code>).  This does a 
+        complete check of all blocks and chains in an arena, which
+        is very slow.  Is not engaged by default.
+    <p>
+    <li>The symbol table reader(s): various checks to ensure
+        uniqueness of mappings; see <code>VG_(read_symbols)</code>
+        for a start.  Is permanently engaged.
+    <p>
+    <li>The A and V bit tracking stuff in <code>vg_memory.c</code>.
+        This can be compiled with cpp symbol
+        <code>VG_DEBUG_MEMORY</code> defined, which removes all the
+        fast, optimised cases, and uses simple-but-slow fallbacks
+        instead.  Not engaged by default.
+    <p>
+    <li>Ditto <code>VG_DEBUG_LEAKCHECK</code>.
+    <p>
+    <li>The JITter parses x86 basic blocks into sequences of 
+        UCode instructions.  It then sanity checks each one with
+        <code>VG_(saneUInstr)</code> and sanity checks the sequence
+        as a whole with <code>VG_(saneUCodeBlock)</code>.  This stuff
+        is engaged by default, and has caught some way-obscure bugs
+        in the simulated CPU machinery in its time.
+    <p>
+    <li>The system call wrapper does
+        <code>VG_(first_and_last_secondaries_look_plausible)</code> after
+        every syscall; this is known to pick up bugs in the syscall
+        wrappers.  Engaged by default.
+    <p>
+    <li>The main dispatch loop, in <code>VG_(dispatch)</code>, checks
+        that translations do not set <code>%ebp</code> to any value
+        different from <code>VG_EBP_DISPATCH_CHECKED</code> or
+        <code>& VG_(baseBlock)</code>.  In effect this test is free,
+        and is permanently engaged.
+    <p>
+    <li>There are a couple of ifdefed-out consistency checks I
+        inserted whilst debugging the new register allocater, 
+        <code>vg_do_register_allocation</code>.
+    </ul>
+<p>
+<li>I try to avoid techniques, algorithms, mechanisms, etc, for which
+    I can supply neither a convincing argument that they are correct,
+    nor sanity-check code which might pick up bugs in my
+    implementation.  I don't always succeed in this, but I try.
+    Basically the idea is: avoid techniques which are, in practice,
+    unverifiable, in some sense.   When doing anything, always have in
+    mind: "how can I verify that this is correct?"
+</ul>
+
+<p>
+Some more specific things are:
+
+<ul>
+<li>Valgrind runs in the same namespace as the client, at least from
+    <code>ld.so</code>'s point of view, and it therefore absolutely
+    had better not export any symbol with a name which could clash
+    with that of the client or any of its libraries.  Therefore, all
+    globally visible symbols exported from <code>valgrind.so</code>
+    are defined using the <code>VG_</code> CPP macro.  As you'll see
+    from <code>vg_constants.h</code>, this appends some arbitrary
+    prefix to the symbol, in order that it be, we hope, globally
+    unique.  Currently the prefix is <code>vgPlain_</code>.  For
+    convenience there are also <code>VGM_</code>, <code>VGP_</code>
+    and <code>VGOFF_</code>.  All locally defined symbols are declared
+    <code>static</code> and do not appear in the final shared object.
+    <p>
+    To check this, I periodically do 
+    <code>nm valgrind.so | grep " T "</code>, 
+    which shows you all the globally exported text symbols.
+    They should all have an approved prefix, except for those like
+    <code>malloc</code>, <code>free</code>, etc, which we deliberately
+    want to shadow and take precedence over the same names exported
+    from <code>glibc.so</code>, so that valgrind can intercept those
+    calls easily.  Similarly, <code>nm valgrind.so | grep " D "</code>
+    allows you to find any rogue data-segment symbol names.
+<p>
+<li>Valgrind tries, and almost succeeds, in being completely
+    independent of all other shared objects, in particular of
+    <code>glibc.so</code>.  For example, we have our own low-level
+    memory manager in <code>vg_malloc2.c</code>, which is a fairly
+    standard malloc/free scheme augmented with arenas, and
+    <code>vg_mylibc.c</code> exports reimplementations of various bits
+    and pieces you'd normally get from the C library.
+    <p>
+    Why all the hassle?  Because imagine the potential chaos of both
+    the simulated and real CPUs executing in <code>glibc.so</code>.
+    It just seems simpler and cleaner to be completely self-contained,
+    so that only the simulated CPU visits <code>glibc.so</code>.  In
+    practice it's not much hassle anyway.  Also, valgrind starts up
+    before glibc has a chance to initialise itself, and who knows what
+    difficulties that could lead to.  Finally, glibc has definitions
+    for some types, specifically <code>sigset_t</code>, which conflict
+    (are different from) the Linux kernel's idea of same.  When 
+    Valgrind wants to fiddle around with signal stuff, it wants to
+    use the kernel's definitions, not glibc's definitions.  So it's 
+    simplest just to keep glibc out of the picture entirely.
+    <p>
+    To find out which glibc symbols are used by Valgrind, reinstate
+    the link flags <code>-nostdlib -Wl,-no-undefined</code>.  This
+    causes linking to fail, but will tell you what you depend on.
+    I have mostly, but not entirely, got rid of the glibc
+    dependencies; what remains is, IMO, fairly harmless.  AFAIK the
+    current dependencies are: <code>memset</code>,
+    <code>memcmp</code>, <code>stat</code>, <code>system</code>,
+    <code>sbrk</code>, <code>setjmp</code> and <code>longjmp</code>.
+
+<p>
+<li>Similarly, valgrind should not really import any headers other
+    than the Linux kernel headers, since it knows of no API other than
+    the kernel interface to talk to.  At the moment this is really not
+    in a good state, and <code>vg_syscall_mem</code> imports, via
+    <code>vg_unsafe.h</code>, a significant number of C-library
+    headers so as to know the sizes of various structs passed across
+    the kernel boundary.  This is of course completely bogus, since
+    there is no guarantee that the C library's definitions of these
+    structs matches those of the kernel.  I have started to sort this
+    out using <code>vg_kerneliface.h</code>, into which I had intended
+    to copy all kernel definitions which valgrind could need, but this
+    has not gotten very far.  At the moment it mostly contains
+    definitions for <code>sigset_t</code> and <code>struct
+    sigaction</code>, since the kernel's definition for these really
+    does clash with glibc's.  I plan to use a <code>vki_</code> prefix
+    on all these types and constants, to denote the fact that they
+    pertain to <b>V</b>algrind's <b>K</b>ernel <b>I</b>nterface.
+    <p>
+    Another advantage of having a <code>vg_kerneliface.h</code> file
+    is that it makes it simpler to interface to a different kernel.
+    Once can, for example, easily imagine writing a new
+    <code>vg_kerneliface.h</code> for FreeBSD, or x86 NetBSD.
+
+</ul>
+
+<h3>Current limitations</h3>
+
+No threads.  I think fixing this is close to a research-grade problem.
+<p>
+No MMX.  Fixing this should be relatively easy, using the same giant
+trick used for x86 FPU instructions.  See below.
+<p>
+Support for weird (non-POSIX) signal stuff is patchy.  Does anybody
+care?
+<p>
+
+
+
+
+<hr width="100%">
+
+<h2>The instrumenting JITter</h2>
+
+This really is the heart of the matter.  We begin with various side
+issues.
+
+<h3>Run-time storage, and the use of host registers</h3>
+
+Valgrind translates client (original) basic blocks into instrumented
+basic blocks, which live in the translation cache TC, until either the
+client finishes or the translations are ejected from TC to make room
+for newer ones.
+<p>
+Since it generates x86 code in memory, Valgrind has complete control
+of the use of registers in the translations.  Now pay attention.  I
+shall say this only once, and it is important you understand this.  In
+what follows I will refer to registers in the host (real) cpu using
+their standard names, <code>%eax</code>, <code>%edi</code>, etc.  I
+refer to registers in the simulated CPU by capitalising them:
+<code>%EAX</code>, <code>%EDI</code>, etc.  These two sets of
+registers usually bear no direct relationship to each other; there is
+no fixed mapping between them.  This naming scheme is used fairly
+consistently in the comments in the sources.
+<p>
+Host registers, once things are up and running, are used as follows:
+<ul>
+<li><code>%esp</code>, the real stack pointer, points
+    somewhere in Valgrind's private stack area,
+    <code>VG_(stack)</code> or, transiently, into its signal delivery
+    stack, <code>VG_(sigstack)</code>.
+<p>
+<li><code>%edi</code> is used as a temporary in code generation; it
+    is almost always dead, except when used for the <code>Left</code>
+    value-tag operations.
+<p>
+<li><code>%eax</code>, <code>%ebx</code>, <code>%ecx</code>,
+    <code>%edx</code> and <code>%esi</code> are available to
+    Valgrind's register allocator.  They are dead (carry unimportant
+    values) in between translations, and are live only in
+    translations.  The one exception to this is <code>%eax</code>,
+    which, as mentioned far above, has a special significance to the
+    dispatch loop <code>VG_(dispatch)</code>: when a translation
+    returns to the dispatch loop, <code>%eax</code> is expected to
+    contain the original-code-address of the next translation to run.
+    The register allocator is so good at minimising spill code that
+    using five regs and not having to save/restore <code>%edi</code>
+    actually gives better code than allocating to <code>%edi</code>
+    as well, but then having to push/pop it around special uses.
+<p>
+<li><code>%ebp</code> points permanently at
+    <code>VG_(baseBlock)</code>.  Valgrind's translations are
+    position-independent, partly because this is convenient, but also
+    because translations get moved around in TC as part of the LRUing
+    activity.  <b>All</b> static entities which need to be referred to
+    from generated code, whether data or helper functions, are stored
+    starting at <code>VG_(baseBlock)</code> and are therefore reached
+    by indexing from <code>%ebp</code>.  There is but one exception, 
+    which is that by placing the value
+    <code>VG_EBP_DISPATCH_CHECKED</code>
+    in <code>%ebp</code> just before a return to the dispatcher, 
+    the dispatcher is informed that the next address to run, 
+    in <code>%eax</code>, requires special treatment.
+<p>
+<li>The real machine's FPU state is pretty much unimportant, for
+    reasons which will become obvious.  Ditto its <code>%eflags</code>
+    register.
+</ul>
+
+<p>
+The state of the simulated CPU is stored in memory, in
+<code>VG_(baseBlock)</code>, which is a block of 200 words IIRC.
+Recall that <code>%ebp</code> points permanently at the start of this
+block.  Function <code>vg_init_baseBlock</code> decides what the
+offsets of various entities in <code>VG_(baseBlock)</code> are to be,
+and allocates word offsets for them.  The code generator then emits
+<code>%ebp</code> relative addresses to get at those things.  The
+sequence in which entities are allocated has been carefully chosen so
+that the 32 most popular entities come first, because this means 8-bit
+offsets can be used in the generated code.
+
+<p>
+If I was clever, I could make <code>%ebp</code> point 32 words along 
+<code>VG_(baseBlock)</code>, so that I'd have another 32 words of
+short-form offsets available, but that's just complicated, and it's
+not important -- the first 32 words take 99% (or whatever) of the
+traffic.
+
+<p>
+Currently, the sequence of stuff in <code>VG_(baseBlock)</code> is as
+follows:
+<ul>
+<li>9 words, holding the simulated integer registers,
+    <code>%EAX</code> .. <code>%EDI</code>, and the simulated flags,
+    <code>%EFLAGS</code>.
+<p>
+<li>Another 9 words, holding the V bit "shadows" for the above 9 regs.
+<p>
+<li>The <b>addresses</b> of various helper routines called from
+    generated code: 
+    <code>VG_(helper_value_check4_fail)</code>,
+    <code>VG_(helper_value_check0_fail)</code>,
+    which register V-check failures,
+    <code>VG_(helperc_STOREV4)</code>,
+    <code>VG_(helperc_STOREV1)</code>,
+    <code>VG_(helperc_LOADV4)</code>,
+    <code>VG_(helperc_LOADV1)</code>,
+    which do stores and loads of V bits to/from the 
+    sparse array which keeps track of V bits in memory,
+    and
+    <code>VGM_(handle_esp_assignment)</code>, which messes with
+    memory addressibility resulting from changes in <code>%ESP</code>.
+<p>
+<li>The simulated <code>%EIP</code>.
+<p>
+<li>24 spill words, for when the register allocator can't make it work
+    with 5 measly registers.
+<p>
+<li>Addresses of helpers <code>VG_(helperc_STOREV2)</code>,
+    <code>VG_(helperc_LOADV2)</code>.  These are here because 2-byte
+    loads and stores are relatively rare, so are placed above the
+    magic 32-word offset boundary.
+<p>
+<li>For similar reasons, addresses of helper functions 
+    <code>VGM_(fpu_write_check)</code> and
+    <code>VGM_(fpu_read_check)</code>, which handle the A/V maps
+    testing and changes required by FPU writes/reads.  
+<p>
+<li>Some other boring helper addresses:
+    <code>VG_(helper_value_check2_fail)</code> and
+    <code>VG_(helper_value_check1_fail)</code>.  These are probably
+    never emitted now, and should be removed.
+<p>
+<li>The entire state of the simulated FPU, which I believe to be
+    108 bytes long.
+<p>
+<li>Finally, the addresses of various other helper functions in
+    <code>vg_helpers.S</code>, which deal with rare situations which
+    are tedious or difficult to generate code in-line for.
+</ul>
+
+<p>
+As a general rule, the simulated machine's state lives permanently in
+memory at <code>VG_(baseBlock)</code>.  However, the JITter does some
+optimisations which allow the simulated integer registers to be
+cached in real registers over multiple simulated instructions within
+the same basic block.  These are always flushed back into memory at
+the end of every basic block, so that the in-memory state is
+up-to-date between basic blocks.  (This flushing is implied by the
+statement above that the real machine's allocatable registers are
+dead in between simulated blocks).
+
+
+<h3>Startup, shutdown, and system calls</h3>
+
+Getting into of Valgrind (<code>VG_(startup)</code>, called from
+<code>valgrind.so</code>'s initialisation section), really means
+copying the real CPU's state into <code>VG_(baseBlock)</code>, and
+then installing our own stack pointer, etc, into the real CPU, and
+then starting up the JITter.  Exiting valgrind involves copying the
+simulated state back to the real state.
+
+<p>
+Unfortunately, there's a complication at startup time.  Problem is
+that at the point where we need to take a snapshot of the real CPU's
+state, the offsets in <code>VG_(baseBlock)</code> are not set up yet,
+because to do so would involve disrupting the real machine's state
+significantly.  The way round this is to dump the real machine's state
+into a temporary, static block of memory,
+<code>VG_(m_state_static)</code>.  We can then set up the
+<code>VG_(baseBlock)</code> offsets at our leisure, and copy into it
+from <code>VG_(m_state_static)</code> at some convenient later time.
+This copying is done by
+<code>VG_(copy_m_state_static_to_baseBlock)</code>.
+
+<p>
+On exit, the inverse transformation is (rather unnecessarily) used:
+stuff in <code>VG_(baseBlock)</code> is copied to
+<code>VG_(m_state_static)</code>, and the assembly stub then copies
+from <code>VG_(m_state_static)</code> into the real machine registers.
+
+<p>
+Doing system calls on behalf of the client (<code>vg_syscall.S</code>)
+is something of a half-way house.  We have to make the world look
+sufficiently like that which the client would normally have to make
+the syscall actually work properly, but we can't afford to lose
+control.  So the trick is to copy all of the client's state, <b>except
+its program counter</b>, into the real CPU, do the system call, and
+copy the state back out.  Note that the client's state includes its
+stack pointer register, so one effect of this partial restoration is
+to cause the system call to be run on the client's stack, as it should
+be.
+
+<p>
+As ever there are complications.  We have to save some of our own state
+somewhere when restoring the client's state into the CPU, so that we
+can keep going sensibly afterwards.  In fact the only thing which is
+important is our own stack pointer, but for paranoia reasons I save 
+and restore our own FPU state as well, even though that's probably
+pointless.
+
+<p>
+The complication on the above complication is, that for horrible
+reasons to do with signals, we may have to handle a second client
+system call whilst the client is blocked inside some other system 
+call (unbelievable!).  That means there's two sets of places to 
+dump Valgrind's stack pointer and FPU state across the syscall,
+and we decide which to use by consulting
+<code>VG_(syscall_depth)</code>, which is in turn maintained by
+<code>VG_(wrap_syscall)</code>.
+
+
+
+<h3>Introduction to UCode</h3>
+
+UCode lies at the heart of the x86-to-x86 JITter.  The basic premise
+is that dealing the the x86 instruction set head-on is just too darn
+complicated, so we do the traditional compiler-writer's trick and
+translate it into a simpler, easier-to-deal-with form.
+
+<p>
+In normal operation, translation proceeds through six stages,
+coordinated by <code>VG_(translate)</code>:
+<ol>
+<li>Parsing of an x86 basic block into a sequence of UCode
+    instructions (<code>VG_(disBB)</code>).
+<p>
+<li>UCode optimisation (<code>vg_improve</code>), with the aim of
+    caching simulated registers in real registers over multiple
+    simulated instructions, and removing redundant simulated
+    <code>%EFLAGS</code> saving/restoring.
+<p>
+<li>UCode instrumentation (<code>vg_instrument</code>), which adds
+    value and address checking code.
+<p>
+<li>Post-instrumentation cleanup (<code>vg_cleanup</code>), removing
+    redundant value-check computations.
+<p>
+<li>Register allocation (<code>vg_do_register_allocation</code>),
+    which, note, is done on UCode.
+<p>
+<li>Emission of final instrumented x86 code
+    (<code>VG_(emit_code)</code>).
+</ol>
+
+<p>
+Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
+transformation passes, all on straight-line blocks of UCode (type
+<code>UCodeBlock</code>).  Steps 2 and 4 are optimisation passes and
+can be disabled for debugging purposes, with
+<code>--optimise=no</code> and <code>--cleanup=no</code> respectively.
+
+<p>
+Valgrind can also run in a no-instrumentation mode, given
+<code>--instrument=no</code>.  This is useful for debugging the JITter
+quickly without having to deal with the complexity of the
+instrumentation mechanism too.  In this mode, steps 3 and 4 are
+omitted.
+
+<p>
+These flags combine, so that <code>--instrument=no</code> together with 
+<code>--optimise=no</code> means only steps 1, 5 and 6 are used.
+<code>--single-step=yes</code> causes each x86 instruction to be
+treated as a single basic block.  The translations are terrible but
+this is sometimes instructive.  
+
+<p>
+The <code>--stop-after=N</code> flag switches back to the real CPU
+after <code>N</code> basic blocks.  It also re-JITs the final basic
+block executed and prints the debugging info resulting, so this
+gives you a way to get a quick snapshot of how a basic block looks as
+it passes through the six stages mentioned above.  If you want to 
+see full information for every block translated (probably not, but
+still ...) find, in <code>VG_(translate)</code>, the lines
+<br><code>   dis = True;</code>
+<br><code>   dis = debugging_translation;</code>
+<br>
+and comment out the second line.  This will spew out debugging
+junk faster than you can possibly imagine.
+
+
+
+<h3>UCode operand tags: type <code>Tag</code></h3>
+
+UCode is, more or less, a simple two-address RISC-like code.  In
+keeping with the x86 AT&T assembly syntax, generally speaking the
+first operand is the source operand, and the second is the destination
+operand, which is modified when the uinstr is notionally executed.
+
+<p>
+UCode instructions have up to three operand fields, each of which has
+a corresponding <code>Tag</code> describing it.  Possible values for
+the tag are:
+
+<ul>
+<li><code>NoValue</code>: indicates that the field is not in use.
+<p>
+<li><code>Lit16</code>: the field contains a 16-bit literal.
+<p>
+<li><code>Literal</code>: the field denotes a 32-bit literal, whose
+    value is stored in the <code>lit32</code> field of the uinstr
+    itself.  Since there is only one <code>lit32</code> for the whole
+    uinstr, only one operand field may contain this tag.
+<p>
+<li><code>SpillNo</code>: the field contains a spill slot number, in
+    the range 0 to 23 inclusive, denoting one of the spill slots
+    contained inside <code>VG_(baseBlock)</code>.  Such tags only
+    exist after register allocation.
+<p>
+<li><code>RealReg</code>: the field contains a number in the range 0
+    to 7 denoting an integer x86 ("real") register on the host.  The
+    number is the Intel encoding for integer registers.  Such tags
+    only exist after register allocation.
+<p>
+<li><code>ArchReg</code>: the field contains a number in the range 0
+    to 7 denoting an integer x86 register on the simulated CPU.  In
+    reality this means a reference to one of the first 8 words of
+    <code>VG_(baseBlock)</code>.  Such tags can exist at any point in
+    the translation process.
+<p>
+<li>Last, but not least, <code>TempReg</code>.  The field contains the
+    number of one of an infinite set of virtual (integer)
+    registers. <code>TempReg</code>s are used everywhere throughout
+    the translation process; you can have as many as you want.  The
+    register allocator maps as many as it can into
+    <code>RealReg</code>s and turns the rest into
+    <code>SpillNo</code>s, so <code>TempReg</code>s should not exist
+    after the register allocation phase.
+    <p>
+    <code>TempReg</code>s are always 32 bits long, even if the data
+    they hold is logically shorter.  In that case the upper unused
+    bits are required, and, I think, generally assumed, to be zero.  
+    <code>TempReg</code>s holding V bits for quantities shorter than 
+    32 bits are expected to have ones in the unused places, since a
+    one denotes "undefined".
+</ul>
+
+
+<h3>UCode instructions: type <code>UInstr</code></h3>
+
+<p>
+UCode was carefully designed to make it possible to do register
+allocation on UCode and then translate the result into x86 code
+without needing any extra registers ... well, that was the original
+plan, anyway.  Things have gotten a little more complicated since
+then.  In what follows, UCode instructions are referred to as uinstrs,
+to distinguish them from x86 instructions.  Uinstrs of course have
+uopcodes which are (naturally) different from x86 opcodes.
+
+<p>
+A uinstr (type <code>UInstr</code>) contains
+various fields, not all of which are used by any one uopcode:
+<ul>
+<li>Three 16-bit operand fields, <code>val1</code>, <code>val2</code>
+    and <code>val3</code>.
+<p>
+<li>Three tag fields, <code>tag1</code>, <code>tag2</code>
+    and <code>tag3</code>.  Each of these has a value of type
+    <code>Tag</code>,
+    and they describe what the <code>val1</code>, <code>val2</code>
+    and <code>val3</code> fields contain.
+<p>
+<li>A 32-bit literal field.
+<p>
+<li>Two <code>FlagSet</code>s, specifying which x86 condition codes are
+    read and written by the uinstr.
+<p>
+<li>An opcode byte, containing a value of type <code>Opcode</code>.
+<p>
+<li>A size field, indicating the data transfer size (1/2/4/8/10) in
+    cases where this makes sense, or zero otherwise.
+<p>
+<li>A condition-code field, which, for jumps, holds a
+    value of type <code>Condcode</code>, indicating the condition
+    which applies.  The encoding is as it is in the x86 insn stream,
+    except we add a 17th value <code>CondAlways</code> to indicate
+    an unconditional transfer.
+<p>
+<li>Various 1-bit flags, indicating whether this insn pertains to an
+    x86 CALL or RET instruction, whether a widening is signed or not,
+    etc.
+</ul>
+
+<p>
+UOpcodes (type <code>Opcode</code>) are divided into two groups: those
+necessary merely to express the functionality of the x86 code, and
+extra uopcodes needed to express the instrumentation.  The former
+group contains:
+<ul>
+<li><code>GET</code> and <code>PUT</code>, which move values from the
+    simulated CPU's integer registers (<code>ArchReg</code>s) into
+    <code>TempReg</code>s, and back.  <code>GETF</code> and
+    <code>PUTF</code> do the corresponding thing for the simulated
+    <code>%EFLAGS</code>.  There are no corresponding insns for the
+    FPU register stack, since we don't explicitly simulate its
+    registers.
+<p>
+<li><code>LOAD</code> and <code>STORE</code>, which, in RISC-like
+    fashion, are the only uinstrs able to interact with memory.
+<p>
+<li><code>MOV</code> and <code>CMOV</code> allow unconditional and
+    conditional moves of values between <code>TempReg</code>s.
+<p>
+<li>ALU operations.  Again in RISC-like fashion, these only operate on
+    <code>TempReg</code>s (before reg-alloc) or <code>RealReg</code>s
+    (after reg-alloc).  These are: <code>ADD</code>, <code>ADC</code>,
+    <code>AND</code>, <code>OR</code>, <code>XOR</code>,
+    <code>SUB</code>, <code>SBB</code>, <code>SHL</code>,
+    <code>SHR</code>, <code>SAR</code>, <code>ROL</code>,
+    <code>ROR</code>, <code>RCL</code>, <code>RCR</code>,
+    <code>NOT</code>, <code>NEG</code>, <code>INC</code>,
+    <code>DEC</code>, <code>BSWAP</code>, <code>CC2VAL</code> and
+    <code>WIDEN</code>.  <code>WIDEN</code> does signed or unsigned
+    value widening.  <code>CC2VAL</code> is used to convert condition
+    codes into a value, zero or one.  The rest are obvious.
+    <p>
+    To allow for more efficient code generation, we bend slightly the
+    restriction at the start of the previous para: for
+    <code>ADD</code>, <code>ADC</code>, <code>XOR</code>,
+    <code>SUB</code> and <code>SBB</code>, we allow the first (source)
+    operand to also be an <code>ArchReg</code>, that is, one of the
+    simulated machine's registers.  Also, many of these ALU ops allow
+    the source operand to be a literal.  See
+    <code>VG_(saneUInstr)</code> for the final word on the allowable
+    forms of uinstrs.
+<p>
+<li><code>LEA1</code> and <code>LEA2</code> are not strictly
+    necessary, but allow faciliate better translations.  They
+    record the fancy x86 addressing modes in a direct way, which
+    allows those amodes to be emitted back into the final
+    instruction stream more or less verbatim.
+<p>
+<li><code>CALLM</code> calls a machine-code helper, one of the methods
+    whose address is stored at some <code>VG_(baseBlock)</code>
+    offset.  <code>PUSH</code> and <code>POP</code> move values
+    to/from <code>TempReg</code> to the real (Valgrind's) stack, and
+    <code>CLEAR</code> removes values from the stack.
+    <code>CALLM_S</code> and <code>CALLM_E</code> delimit the
+    boundaries of call setups and clearings, for the benefit of the
+    instrumentation passes.  Getting this right is critical, and so
+    <code>VG_(saneUCodeBlock)</code> makes various checks on the use
+    of these uopcodes.
+    <p>
+    It is important to understand that these uopcodes have nothing to
+    do with the x86 <code>call</code>, <code>return,</code>
+    <code>push</code> or <code>pop</code> instructions, and are not
+    used to implement them.  Those guys turn into combinations of
+    <code>GET</code>, <code>PUT</code>, <code>LOAD</code>,
+    <code>STORE</code>, <code>ADD</code>, <code>SUB</code>, and
+    <code>JMP</code>.  What these uopcodes support is calling of
+    helper functions such as <code>VG_(helper_imul_32_64)</code>,
+    which do stuff which is too difficult or tedious to emit inline.
+<p>
+<li><code>FPU</code>, <code>FPU_R</code> and <code>FPU_W</code>.
+    Valgrind doesn't attempt to simulate the internal state of the
+    FPU at all.  Consequently it only needs to be able to distinguish
+    FPU ops which read and write memory from those that don't, and
+    for those which do, it needs to know the effective address and
+    data transfer size.  This is made easier because the x86 FP
+    instruction encoding is very regular, basically consisting of
+    16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode
+    for a memory FPU insn.  So our <code>FPU</code> uinstr carries
+    the 16 bits in its <code>val1</code> field.  And
+    <code>FPU_R</code> and <code>FPU_W</code> carry 11 bits in that
+    field, together with the identity of a <code>TempReg</code> or
+    (later) <code>RealReg</code> which contains the address.
+<p>
+<li><code>JIFZ</code> is unique, in that it allows a control-flow
+    transfer which is not deemed to end a basic block.  It causes a
+    jump to a literal (original) address if the specified argument
+    is zero.
+<p>
+<li>Finally, <code>INCEIP</code> advances the simulated
+    <code>%EIP</code> by the specified literal amount.  This supports
+    lazy <code>%EIP</code> updating, as described below.
+</ul>
+
+<p>
+Stages 1 and 2 of the 6-stage translation process mentioned above
+deal purely with these uopcodes, and no others.  They are
+sufficient to express pretty much all the x86 32-bit protected-mode 
+instruction set, at
+least everything understood by a pre-MMX original Pentium (P54C). 
+
+<p>
+Stages 3, 4, 5 and 6 also deal with the following extra
+"instrumentation" uopcodes.  They are used to express all the
+definedness-tracking and -checking machinery which valgrind does.  In
+later sections we show how to create checking code for each of the
+uopcodes above.  Note that these instrumentation uopcodes, although
+some appearing complicated, have been carefully chosen so that
+efficient x86 code can be generated for them.  GNU superopt v2.5 did a
+great job helping out here.  Anyways, the uopcodes are as follows:
+
+<ul>
+<li><code>GETV</code> and <code>PUTV</code> are analogues to
+    <code>GET</code> and <code>PUT</code> above.  They are identical
+    except that they move the V bits for the specified values back and
+    forth to <code>TempRegs</code>, rather than moving the values
+    themselves.
+<p>
+<li>Similarly, <code>LOADV</code> and <code>STOREV</code> read and
+    write V bits from the synthesised shadow memory that Valgrind
+    maintains.  In fact they do more than that, since they also do
+    address-validity checks, and emit complaints if the read/written
+    addresses are unaddressible.
+<p>
+<li><code>TESTV</code>, whose parameters are a <code>TempReg</code>
+    and a size, tests the V bits in the <code>TempReg</code>, at the
+    specified operation size (0/1/2/4 byte) and emits an error if any
+    of them indicate undefinedness.  This is the only uopcode capable
+    of doing such tests.
+<p>
+<li><code>SETV</code>, whose parameters are also <code>TempReg</code>
+    and a size, makes the V bits in the <code>TempReg</code> indicated
+    definedness, at the specified operation size.  This is usually
+    used to generate the correct V bits for a literal value, which is
+    of course fully defined.
+<p>
+<li><code>GETVF</code> and <code>PUTVF</code> are analogues to
+    <code>GETF</code> and <code>PUTF</code>.  They move the single V
+    bit used to model definedness of <code>%EFLAGS</code> between its
+    home in <code>VG_(baseBlock)</code> and the specified
+    <code>TempReg</code>.
+<p>
+<li><code>TAG1</code> denotes one of a family of unary operations on
+    <code>TempReg</code>s containing V bits.  Similarly,
+    <code>TAG2</code> denotes one in a family of binary operations on
+    V bits.
+</ul>
+
+<p>
+These 10 uopcodes are sufficient to express Valgrind's entire
+definedness-checking semantics.  In fact most of the interesting magic
+is done by the <code>TAG1</code> and <code>TAG2</code>
+suboperations.
+
+<p>
+First, however, I need to explain about V-vector operation sizes.
+There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32
+V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations.
+However there is also the mysterious size 0, which really means a
+single V bit.  Single V bits are used in various circumstances; in
+particular, the definedness of <code>%EFLAGS</code> is modelled with a
+single V bit.  Now might be a good time to also point out that for
+V bits, 1 means "undefined" and 0 means "defined".  Similarly, for A
+bits, 1 means "invalid address" and 0 means "valid address".  This
+seems counterintuitive (and so it is), but testing against zero on
+x86s saves instructions compared to testing against all 1s, because
+many ALU operations set the Z flag for free, so to speak.
+
+<p>
+With that in mind, the tag ops are:
+
+<ul>
+<li><b>(UNARY) Pessimising casts</b>: <code>VgT_PCast40</code>,
+    <code>VgT_PCast20</code>, <code>VgT_PCast10</code>,
+    <code>VgT_PCast01</code>, <code>VgT_PCast02</code> and
+    <code>VgT_PCast04</code>.  A "pessimising cast" takes a V-bit
+    vector at one size, and creates a new one at another size,
+    pessimised in the sense that if any of the bits in the source
+    vector indicate undefinedness, then all the bits in the result
+    indicate undefinedness.  In this case the casts are all to or from
+    a single V bit, so for example <code>VgT_PCast40</code> is a
+    pessimising cast from 32 bits to 1, whereas
+    <code>VgT_PCast04</code> simply copies the single source V bit
+    into all 32 bit positions in the result.  Surprisingly, these ops
+    can all be implemented very efficiently.
+    <p>
+    There are also the pessimising casts <code>VgT_PCast14</code>,
+    from 8 bits to 32, <code>VgT_PCast12</code>, from 8 bits to 16,
+    and <code>VgT_PCast11</code>, from 8 bits to 8.  This last one
+    seems nonsensical, but in fact it isn't a no-op because, as
+    mentioned above, any undefined (1) bits in the source infect the
+    entire result.
+<p>
+<li><b>(UNARY) Propagating undefinedness upwards in a word</b>:
+    <code>VgT_Left4</code>, <code>VgT_Left2</code> and
+    <code>VgT_Left1</code>.  These are used to simulate the worst-case
+    effects of carry propagation in adds and subtracts.  They return a
+    V vector identical to the original, except that if the original
+    contained any undefined bits, then it and all bits above it are
+    marked as undefined too.  Hence the Left bit in the names.
+<p>
+<li><b>(UNARY) Signed and unsigned value widening</b>:
+     <code>VgT_SWiden14</code>, <code>VgT_SWiden24</code>,
+     <code>VgT_SWiden12</code>, <code>VgT_ZWiden14</code>,
+     <code>VgT_ZWiden24</code> and <code>VgT_ZWiden12</code>.  These
+     mimic the definedness effects of standard signed and unsigned
+     integer widening.  Unsigned widening creates zero bits in the new
+     positions, so <code>VgT_ZWiden*</code> accordingly park mark
+     those parts of their argument as defined.  Signed widening copies
+     the sign bit into the new positions, so <code>VgT_SWiden*</code>
+     copies the definedness of the sign bit into the new positions.
+     Because 1 means undefined and 0 means defined, these operations
+     can (fascinatingly) be done by the same operations which they
+     mimic.  Go figure.
+<p>
+<li><b>(BINARY) Undefined-if-either-Undefined,
+     Defined-if-either-Defined</b>: <code>VgT_UifU4</code>,
+     <code>VgT_UifU2</code>, <code>VgT_UifU1</code>,
+     <code>VgT_UifU0</code>, <code>VgT_DifD4</code>,
+     <code>VgT_DifD2</code>, <code>VgT_DifD1</code>.  These do simple
+     bitwise operations on pairs of V-bit vectors, with
+     <code>UifU</code> giving undefined if either arg bit is
+     undefined, and <code>DifD</code> giving defined if either arg bit
+     is defined.  Abstract interpretation junkies, if any make it this
+     far, may like to think of them as meets and joins (or is it joins
+     and meets) in the definedness lattices.  
+<p>
+<li><b>(BINARY; one value, one V bits) Generate argument improvement
+    terms for AND and OR</b>: <code>VgT_ImproveAND4_TQ</code>,
+    <code>VgT_ImproveAND2_TQ</code>, <code>VgT_ImproveAND1_TQ</code>,
+    <code>VgT_ImproveOR4_TQ</code>, <code>VgT_ImproveOR2_TQ</code>,
+    <code>VgT_ImproveOR1_TQ</code>.  These help out with AND and OR
+    operations.  AND and OR have the inconvenient property that the
+    definedness of the result depends on the actual values of the
+    arguments as well as their definedness.  At the bit level:
+    <br><code>1 AND undefined = undefined</code>, but 
+    <br><code>0 AND undefined = 0</code>, and similarly 
+    <br><code>0 OR  undefined = undefined</code>, but 
+    <br><code>1 OR  undefined = 1</code>.
+    <br>
+    <p>
+    It turns out that gcc (quite legitimately) generates code which
+    relies on this fact, so we have to model it properly in order to
+    avoid flooding users with spurious value errors.  The ultimate
+    definedness result of AND and OR is calculated using
+    <code>UifU</code> on the definedness of the arguments, but we
+    also <code>DifD</code> in some "improvement" terms which 
+    take into account the above phenomena.  
+    <p>
+    <code>ImproveAND</code> takes as its first argument the actual
+    value of an argument to AND (the T) and the definedness of that
+    argument (the Q), and returns a V-bit vector which is defined (0)
+    for bits which have value 0 and are defined; this, when
+    <code>DifD</code> into the final result causes those bits to be
+    defined even if the corresponding bit in the other argument is undefined.
+    <p>
+    The <code>ImproveOR</code> ops do the dual thing for OR
+    arguments.  Note that XOR does not have this property that one
+    argument can make the other irrelevant, so there is no need for
+    such complexity for XOR.
+</ul>
+
+<p>
+That's all the tag ops.  If you stare at this long enough, and then
+run Valgrind and stare at the pre- and post-instrumented ucode, it
+should be fairly obvious how the instrumentation machinery hangs
+together.
+
+<p>
+One point, if you do this: in order to make it easy to differentiate
+<code>TempReg</code>s carrying values from <code>TempReg</code>s
+carrying V bit vectors, Valgrind prints the former as (for example)
+<code>t28</code> and the latter as <code>q28</code>; the fact that
+they carry the same number serves to indicate their relationship.
+This is purely for the convenience of the human reader; the register
+allocator and code generator don't regard them as different.
+
+
+<h3>Translation into UCode</h3>
+
+<code>VG_(disBB)</code> allocates a new <code>UCodeBlock</code> and
+then uses <code>disInstr</code> to translate x86 instructions one at a
+time into UCode, dumping the result in the <code>UCodeBlock</code>.
+This goes on until a control-flow transfer instruction is encountered.
+
+<p>
+Despite the large size of <code>vg_to_ucode.c</code>, this translation
+is really very simple.  Each x86 instruction is translated entirely
+independently of its neighbours, merrily allocating new
+<code>TempReg</code>s as it goes.  The idea is to have a simple
+translator -- in reality, no more than a macro-expander -- and the --
+resulting bad UCode translation is cleaned up by the UCode
+optimisation phase which follows.  To give you an idea of some x86
+instructions and their translations (this is a complete basic block,
+as Valgrind sees it):
+<pre>
+        0x40435A50:  incl %edx
+
+           0: GETL      %EDX, t0
+           1: INCL      t0  (-wOSZAP)
+           2: PUTL      t0, %EDX
+
+        0x40435A51:  movsbl (%edx),%eax
+
+           3: GETL      %EDX, t2
+           4: LDB       (t2), t2
+           5: WIDENL_Bs t2
+           6: PUTL      t2, %EAX
+
+        0x40435A54:  testb $0x20, 1(%ecx,%eax,2)
+
+           7: GETL      %EAX, t6
+           8: GETL      %ECX, t8
+           9: LEA2L     1(t8,t6,2), t4
+          10: LDB       (t4), t10
+          11: MOVB      $0x20, t12
+          12: ANDB      t12, t10  (-wOSZACP)
+          13: INCEIPo   $9
+
+        0x40435A59:  jnz-8 0x40435A50
+
+          14: Jnzo      $0x40435A50  (-rOSZACP)
+          15: JMPo      $0x40435A5B
+</pre>
+
+<p>
+Notice how the block always ends with an unconditional jump to the
+next block.  This is a bit unnecessary, but makes many things simpler.
+
+<p>
+Most x86 instructions turn into sequences of <code>GET</code>,
+<code>PUT</code>, <code>LEA1</code>, <code>LEA2</code>,
+<code>LOAD</code> and <code>STORE</code>.  Some complicated ones
+however rely on calling helper bits of code in 
+<code>vg_helpers.S</code>.  The ucode instructions <code>PUSH</code>,
+<code>POP</code>, <code>CALL</code>, <code>CALLM_S</code> and
+<code>CALLM_E</code> support this.  The calling convention is somewhat
+ad-hoc and is not the C calling convention.  The helper routines must 
+save all integer registers, and the flags, that they use.  Args are
+passed on the stack underneath the return address, as usual, and if 
+result(s) are to be returned, it (they) are either placed in dummy arg
+slots created by the ucode <code>PUSH</code> sequence, or just
+overwrite the incoming args.
+
+<p>
+In order that the instrumentation mechanism can handle calls to these
+helpers, <code>VG_(saneUCodeBlock)</code> enforces the following
+restrictions on calls to helpers:
+
+<ul>
+<li>Each <code>CALL</code> uinstr must be bracketed by a preceding
+    <code>CALLM_S</code> marker (dummy uinstr) and a trailing
+    <code>CALLM_E</code> marker.  These markers are used by the
+    instrumentation mechanism later to establish the boundaries of the
+    <code>PUSH</code>, <code>POP</code> and <code>CLEAR</code>
+    sequences for the call.
+<p>
+<li><code>PUSH</code>, <code>POP</code> and <code>CLEAR</code>
+    may only appear inside sections bracketed by <code>CALLM_S</code>
+    and <code>CALLM_E</code>, and nowhere else.
+<p>
+<li>In any such bracketed section, no two <code>PUSH</code> insns may
+    push the same <code>TempReg</code>.  Dually, no two two
+    <code>POP</code>s may pop the same <code>TempReg</code>.
+<p>
+<li>Finally, although this is not checked, args should be removed from
+    the stack with <code>CLEAR</code>, rather than <code>POP</code>s
+    into a <code>TempReg</code> which is not subsequently used.  This
+    is because the instrumentation mechanism assumes that all values
+    <code>POP</code>ped from the stack are actually used.
+</ul>
+
+Some of the translations may appear to have redundant
+<code>TempReg</code>-to-<code>TempReg</code> moves.  This helps the
+next phase, UCode optimisation, to generate better code.
+
+
+
+<h3>UCode optimisation</h3>
+
+UCode is then subjected to an improvement pass
+(<code>vg_improve()</code>), which blurs the boundaries between the
+translations of the original x86 instructions.  It's pretty
+straightforward.  Three transformations are done:
+
+<ul>
+<li>Redundant <code>GET</code> elimination.  Actually, more general
+    than that -- eliminates redundant fetches of ArchRegs.  In our
+    running example, uinstr 3 <code>GET</code>s <code>%EDX</code> into
+    <code>t2</code> despite the fact that, by looking at the previous
+    uinstr, it is already in <code>t0</code>.  The <code>GET</code> is
+    therefore removed, and <code>t2</code> renamed to <code>t0</code>.
+    Assuming <code>t0</code> is allocated to a host register, it means
+    the simulated <code>%EDX</code> will exist in a host CPU register
+    for more than one simulated x86 instruction, which seems to me to
+    be a highly desirable property.
+    <p>
+    There is some mucking around to do with subregisters;
+    <code>%AL</code> vs <code>%AH</code> <code>%AX</code> vs
+    <code>%EAX</code> etc.  I can't remember how it works, but in
+    general we are very conservative, and these tend to invalidate the
+    caching. 
+<p>
+<li>Redundant <code>PUT</code> elimination.  This annuls
+    <code>PUT</code>s of values back to simulated CPU registers if a
+    later <code>PUT</code> would overwrite the earlier
+    <code>PUT</code> value, and there is no intervening reads of the
+    simulated register (<code>ArchReg</code>).
+    <p>
+    As before, we are paranoid when faced with subregister references.
+    Also, <code>PUT</code>s of <code>%ESP</code> are never annulled,
+    because it is vital the instrumenter always has an up-to-date
+    <code>%ESP</code> value available, <code>%ESP</code> changes
+    affect addressibility of the memory around the simulated stack
+    pointer.
+    <p>
+    The implication of the above paragraph is that the simulated
+    machine's registers are only lazily updated once the above two
+    optimisation phases have run, with the exception of
+    <code>%ESP</code>.  <code>TempReg</code>s go dead at the end of
+    every basic block, from which is is inferrable that any
+    <code>TempReg</code> caching a simulated CPU reg is flushed (back
+    into the relevant <code>VG_(baseBlock)</code> slot) at the end of
+    every basic block.  The further implication is that the simulated
+    registers are only up-to-date at in between basic blocks, and not
+    at arbitrary points inside basic blocks.  And the consequence of
+    that is that we can only deliver signals to the client in between
+    basic blocks.  None of this seems any problem in practice.
+<p>
+<li>Finally there is a simple def-use thing for condition codes.  If
+    an earlier uinstr writes the condition codes, and the next uinsn
+    along which actually cares about the condition codes writes the
+    same or larger set of them, but does not read any, the earlier
+    uinsn is marked as not writing any condition codes.  This saves 
+    a lot of redundant cond-code saving and restoring.
+</ul>
+
+The effect of these transformations on our short block is rather
+unexciting, and shown below.  On longer basic blocks they can
+dramatically improve code quality.
+
+<pre>
+at 3: delete GET, rename t2 to t0 in (4 .. 6)
+at 7: delete GET, rename t6 to t0 in (8 .. 9)
+at 1: annul flag write OSZAP due to later OSZACP
+
+Improved code:
+           0: GETL      %EDX, t0
+           1: INCL      t0
+           2: PUTL      t0, %EDX
+           4: LDB       (t0), t0
+           5: WIDENL_Bs t0
+           6: PUTL      t0, %EAX
+           8: GETL      %ECX, t8
+           9: LEA2L     1(t8,t0,2), t4
+          10: LDB       (t4), t10
+          11: MOVB      $0x20, t12
+          12: ANDB      t12, t10  (-wOSZACP)
+          13: INCEIPo   $9
+          14: Jnzo      $0x40435A50  (-rOSZACP)
+          15: JMPo      $0x40435A5B
+</pre>
+
+<h3>UCode instrumentation</h3>
+
+Once you understand the meaning of the instrumentation uinstrs,
+discussed in detail above, the instrumentation scheme is fairly
+straighforward.  Each uinstr is instrumented in isolation, and the
+instrumentation uinstrs are placed before the original uinstr.
+Our running example continues below.  I have placed a blank line 
+after every original ucode, to make it easier to see which
+instrumentation uinstrs correspond to which originals.
+
+<p>
+As mentioned somewhere above, <code>TempReg</code>s carrying values 
+have names like <code>t28</code>, and each one has a shadow carrying
+its V bits, with names like <code>q28</code>.  This pairing aids in
+reading instrumented ucode.
+
+<p>
+One decision about all this is where to have "observation points",
+that is, where to check that V bits are valid.  I use a minimalistic
+scheme, only checking where a failure of validity could cause the 
+original program to (seg)fault.  So the use of values as memory
+addresses causes a check, as do conditional jumps (these cause a check
+on the definedness of the condition codes).  And arguments
+<code>PUSH</code>ed for helper calls are checked, hence the wierd
+restrictions on help call preambles described above.
+
+<p>
+Another decision is that once a value is tested, it is thereafter
+regarded as defined, so that we do not emit multiple undefined-value
+errors for the same undefined value.  That means that
+<code>TESTV</code> uinstrs are always followed by <code>SETV</code> 
+on the same (shadow) <code>TempReg</code>s.  Most of these
+<code>SETV</code>s are redundant and are removed by the
+post-instrumentation cleanup phase.
+
+<p>
+The instrumentation for calling helper functions deserves further
+comment.  The definedness of results from a helper is modelled using
+just one V bit.  So, in short, we do pessimising casts of the
+definedness of all the args, down to a single bit, and then
+<code>UifU</code> these bits together.  So this single V bit will say
+"undefined" if any part of any arg is undefined.  This V bit is then
+pessimally cast back up to the result(s) sizes, as needed.  If, by
+seeing that all the args are got rid of with <code>CLEAR</code> and
+none with <code>POP</code>, Valgrind sees that the result of the call
+is not actually used, it immediately examines the result V bit with a
+<code>TESTV</code> -- <code>SETV</code> pair.  If it did not do this,
+there would be no observation point to detect that the some of the
+args to the helper were undefined.  Of course, if the helper's results
+are indeed used, we don't do this, since the result usage will
+presumably cause the result definedness to be checked at some suitable
+future point.
+
+<p>
+In general Valgrind tries to track definedness on a bit-for-bit basis,
+but as the above para shows, for calls to helpers we throw in the
+towel and approximate down to a single bit.  This is because it's too
+complex and difficult to track bit-level definedness through complex
+ops such as integer multiply and divide, and in any case there is no
+reasonable code fragments which attempt to (eg) multiply two
+partially-defined values and end up with something meaningful, so
+there seems little point in modelling multiplies, divides, etc, in
+that level of detail.
+
+<p>
+Integer loads and stores are instrumented with firstly a test of the
+definedness of the address, followed by a <code>LOADV</code> or
+<code>STOREV</code> respectively.  These turn into calls to 
+(for example) <code>VG_(helperc_LOADV4)</code>.  These helpers do two
+things: they perform an address-valid check, and they load or store V
+bits from/to the relevant address in the (simulated V-bit) memory.
+
+<p>
+FPU loads and stores are different.  As above the definedness of the
+address is first tested.  However, the helper routine for FPU loads
+(<code>VGM_(fpu_read_check)</code>) emits an error if either the
+address is invalid or the referenced area contains undefined values.
+It has to do this because we do not simulate the FPU at all, and so
+cannot track definedness of values loaded into it from memory, so we
+have to check them as soon as they are loaded into the FPU, ie, at
+this point.  We notionally assume that everything in the FPU is
+defined.
+
+<p>
+It follows therefore that FPU writes first check the definedness of
+the address, then the validity of the address, and finally mark the
+written bytes as well-defined.
+
+<p>
+If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest
+you use the same trick.  It works provided that the FPU/MMX unit is
+not used to merely as a conduit to copy partially undefined data from
+one place in memory to another.  Unfortunately the integer CPU is used
+like that (when copying C structs with holes, for example) and this is
+the cause of much of the elaborateness of the instrumentation here
+described.
+
+<p>
+<code>vg_instrument()</code> in <code>vg_translate.c</code> actually
+does the instrumentation.  There are comments explaining how each
+uinstr is handled, so we do not repeat that here.  As explained
+already, it is bit-accurate, except for calls to helper functions.
+Unfortunately the x86 insns <code>bt/bts/btc/btr</code> are done by
+helper fns, so bit-level accuracy is lost there.  This should be fixed
+by doing them inline; it will probably require adding a couple new
+uinstrs.  Also, left and right rotates through the carry flag (x86
+<code>rcl</code> and <code>rcr</code>) are approximated via a single
+V bit; so far this has not caused anyone to complain.  The
+non-carry rotates, <code>rol</code> and <code>ror</code>, are much
+more common and are done exactly.  Re-visiting the instrumentation for
+AND and OR, they seem rather verbose, and I wonder if it could be done
+more concisely now.
+
+<p>
+The lowercase <code>o</code> on many of the uopcodes in the running
+example indicates that the size field is zero, usually meaning a
+single-bit operation.
+
+<p>
+Anyroads, the post-instrumented version of our running example looks
+like this:
+
+<pre>
+Instrumented code:
+           0: GETVL     %EDX, q0
+           1: GETL      %EDX, t0
+
+           2: TAG1o     q0 = Left4 ( q0 )
+           3: INCL      t0
+
+           4: PUTVL     q0, %EDX
+           5: PUTL      t0, %EDX
+
+           6: TESTVL    q0
+           7: SETVL     q0
+           8: LOADVB    (t0), q0
+           9: LDB       (t0), t0
+
+          10: TAG1o     q0 = SWiden14 ( q0 )
+          11: WIDENL_Bs t0
+
+          12: PUTVL     q0, %EAX
+          13: PUTL      t0, %EAX
+
+          14: GETVL     %ECX, q8
+          15: GETL      %ECX, t8
+
+          16: MOVL      q0, q4
+          17: SHLL      $0x1, q4
+          18: TAG2o     q4 = UifU4 ( q8, q4 )
+          19: TAG1o     q4 = Left4 ( q4 )
+          20: LEA2L     1(t8,t0,2), t4
+
+          21: TESTVL    q4
+          22: SETVL     q4
+          23: LOADVB    (t4), q10
+          24: LDB       (t4), t10
+
+          25: SETVB     q12
+          26: MOVB      $0x20, t12
+
+          27: MOVL      q10, q14
+          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
+          29: TAG2o     q10 = UifU1 ( q12, q10 )
+          30: TAG2o     q10 = DifD1 ( q14, q10 )
+          31: MOVL      q12, q14
+          32: TAG2o     q14 = ImproveAND1_TQ ( t12, q14 )
+          33: TAG2o     q10 = DifD1 ( q14, q10 )
+          34: MOVL      q10, q16
+          35: TAG1o     q16 = PCast10 ( q16 )
+          36: PUTVFo    q16
+          37: ANDB      t12, t10  (-wOSZACP)
+
+          38: INCEIPo   $9
+
+          39: GETVFo    q18
+          40: TESTVo    q18
+          41: SETVo     q18
+          42: Jnzo      $0x40435A50  (-rOSZACP)
+
+          43: JMPo      $0x40435A5B
+</pre>
+
+
+<h3>UCode post-instrumentation cleanup</h3>
+
+<p>
+This pass, coordinated by <code>vg_cleanup()</code>, removes redundant
+definedness computation created by the simplistic instrumentation
+pass.  It consists of two passes,
+<code>vg_propagate_definedness()</code> followed by
+<code>vg_delete_redundant_SETVs</code>.
+
+<p>
+<code>vg_propagate_definedness()</code> is a simple
+constant-propagation and constant-folding pass.  It tries to determine
+which <code>TempReg</code>s containing V bits will always indicate
+"fully defined", and it propagates this information as far as it can,
+and folds out as many operations as possible.  For example, the
+instrumentation for an ADD of a literal to a variable quantity will be
+reduced down so that the definedness of the result is simply the
+definedness of the variable quantity, since the literal is by
+definition fully defined.
+
+<p>
+<code>vg_delete_redundant_SETVs</code> removes <code>SETV</code>s on
+shadow <code>TempReg</code>s for which the next action is a write.
+I don't think there's anything else worth saying about this; it is
+simple.  Read the sources for details.
+
+<p>
+So the cleaned-up running example looks like this.  As above, I have
+inserted line breaks after every original (non-instrumentation) uinstr
+to aid readability.  As with straightforward ucode optimisation, the
+results in this block are undramatic because it is so short; longer
+blocks benefit more because they have more redundancy which gets
+eliminated.
+
+
+<pre>
+at 29: delete UifU1 due to defd arg1
+at 32: change ImproveAND1_TQ to MOV due to defd arg2
+at 41: delete SETV
+at 31: delete MOV
+at 25: delete SETV
+at 22: delete SETV
+at 7: delete SETV
+
+           0: GETVL     %EDX, q0
+           1: GETL      %EDX, t0
+
+           2: TAG1o     q0 = Left4 ( q0 )
+           3: INCL      t0
+
+           4: PUTVL     q0, %EDX
+           5: PUTL      t0, %EDX
+
+           6: TESTVL    q0
+           8: LOADVB    (t0), q0
+           9: LDB       (t0), t0
+
+          10: TAG1o     q0 = SWiden14 ( q0 )
+          11: WIDENL_Bs t0
+
+          12: PUTVL     q0, %EAX
+          13: PUTL      t0, %EAX
+
+          14: GETVL     %ECX, q8
+          15: GETL      %ECX, t8
+
+          16: MOVL      q0, q4
+          17: SHLL      $0x1, q4
+          18: TAG2o     q4 = UifU4 ( q8, q4 )
+          19: TAG1o     q4 = Left4 ( q4 )
+          20: LEA2L     1(t8,t0,2), t4
+
+          21: TESTVL    q4
+          23: LOADVB    (t4), q10
+          24: LDB       (t4), t10
+
+          26: MOVB      $0x20, t12
+
+          27: MOVL      q10, q14
+          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
+          30: TAG2o     q10 = DifD1 ( q14, q10 )
+          32: MOVL      t12, q14
+          33: TAG2o     q10 = DifD1 ( q14, q10 )
+          34: MOVL      q10, q16
+          35: TAG1o     q16 = PCast10 ( q16 )
+          36: PUTVFo    q16
+          37: ANDB      t12, t10  (-wOSZACP)
+
+          38: INCEIPo   $9
+          39: GETVFo    q18
+          40: TESTVo    q18
+          42: Jnzo      $0x40435A50  (-rOSZACP)
+
+          43: JMPo      $0x40435A5B
+</pre>
+
+
+<h3>Translation from UCode</h3>
+
+This is all very simple, even though <code>vg_from_ucode.c</code>
+is a big file.  Position-independent x86 code is generated into 
+a dynamically allocated array <code>emitted_code</code>; this is
+doubled in size when it overflows.  Eventually the array is handed
+back to the caller of <code>VG_(translate)</code>, who must copy
+the result into TC and TT, and free the array.
+
+<p>
+This file is structured into four layers of abstraction, which,
+thankfully, are glued back together with extensive
+<code>__inline__</code> directives.  From the bottom upwards:
+
+<ul>
+<li>Address-mode emitters, <code>emit_amode_regmem_reg</code> et al.
+<p>
+<li>Emitters for specific x86 instructions.  There are quite a lot of
+    these, with names such as <code>emit_movv_offregmem_reg</code>.
+    The <code>v</code> suffix is Intel parlance for a 16/32 bit insn;
+    there are also <code>b</code> suffixes for 8 bit insns.
+<p>
+<li>The next level up are the <code>synth_*</code> functions, which
+    synthesise possibly a sequence of raw x86 instructions to do some
+    simple task.  Some of these are quite complex because they have to
+    work around Intel's silly restrictions on subregister naming.  See 
+    <code>synth_nonshiftop_reg_reg</code> for example.
+<p>
+<li>Finally, at the top of the heap, we have
+    <code>emitUInstr()</code>,
+    which emits code for a single uinstr.
+</ul>
+
+<p>
+Some comments:
+<ul>
+<li>The hack for FPU instructions becomes apparent here.  To do a
+    <code>FPU</code> ucode instruction, we load the simulated FPU's
+    state into from its <code>VG_(baseBlock)</code> into the real FPU
+    using an x86 <code>frstor</code> insn, do the ucode
+    <code>FPU</code> insn on the real CPU, and write the updated FPU
+    state back into <code>VG_(baseBlock)</code> using an
+    <code>fnsave</code> instruction.  This is pretty brutal, but is
+    simple and it works, and even seems tolerably efficient.  There is
+    no attempt to cache the simulated FPU state in the real FPU over
+    multiple back-to-back ucode FPU instructions.
+    <p>
+    <code>FPU_R</code> and <code>FPU_W</code> are also done this way,
+    with the minor complication that we need to patch in some
+    addressing mode bits so the resulting insn knows the effective
+    address to use.  This is easy because of the regularity of the x86
+    FPU instruction encodings.
+<p>
+<li>An analogous trick is done with ucode insns which claim, in their
+    <code>flags_r</code> and <code>flags_w</code> fields, that they
+    read or write the simulated <code>%EFLAGS</code>.  For such cases
+    we first copy the simulated <code>%EFLAGS</code> into the real
+    <code>%eflags</code>, then do the insn, then, if the insn says it
+    writes the flags, copy back to <code>%EFLAGS</code>.  This is a
+    bit expensive, which is why the ucode optimisation pass goes to
+    some effort to remove redundant flag-update annotations.
+</ul>
+
+<p>
+And so ... that's the end of the documentation for the instrumentating
+translator!  It's really not that complex, because it's composed as a
+sequence of simple(ish) self-contained transformations on
+straight-line blocks of code.
+
+
+<h3>Top-level dispatch loop</h3>
+
+Urk.  In <code>VG_(toploop)</code>.  This is basically boring and
+unsurprising, not to mention fiddly and fragile.  It needs to be
+cleaned up.  
+
+<p>
+The only perhaps surprise is that the whole thing is run
+on top of a <code>setjmp</code>-installed exception handler, because,
+supposing a translation got a segfault, we have to bail out of the
+Valgrind-supplied exception handler <code>VG_(oursignalhandler)</code>
+and immediately start running the client's segfault handler, if it has
+one.  In particular we can't finish the current basic block and then
+deliver the signal at some convenient future point, because signals
+like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not
+simply be re-tried.  (I'm sure there is a clearer way to explain this).
+
+
+<h3>Exceptions, creating new translations</h3>
+<h3>Self-modifying code</h3>
+
+<h3>Lazy updates of the simulated program counter</h3>
+
+Simulated <code>%EIP</code> is not updated after every simulated x86
+insn as this was regarded as too expensive.  Instead ucode
+<code>INCEIP</code> insns move it along as and when necessary.
+Currently we don't allow it to fall more than 4 bytes behind reality
+(see <code>VG_(disBB)</code> for the way this works).
+<p>
+Note that <code>%EIP</code> is always brought up to date by the inner
+dispatch loop in <code>VG_(dispatch)</code>, so that if the client
+takes a fault we know at least which basic block this happened in.
+
+
+<h3>The translation cache and translation table</h3>
+
+<h3>Signals</h3>
+
+Horrible, horrible.  <code>vg_signals.c</code>.
+Basically, since we have to intercept all system
+calls anyway, we can see when the client tries to install a signal
+handler.  If it does so, we make a note of what the client asked to
+happen, and ask the kernel to route the signal to our own signal
+handler, <code>VG_(oursignalhandler)</code>.  This simply notes the
+delivery of signals, and returns.  
+
+<p>
+Every 1000 basic blocks, we see if more signals have arrived.  If so,
+<code>VG_(deliver_signals)</code> builds signal delivery frames on the
+client's stack, and allows their handlers to be run.  Valgrind places
+in these signal delivery frames a bogus return address,
+</code>VG_(signalreturn_bogusRA)</code>, and checks all jumps to see
+if any jump to it.  If so, this is a sign that a signal handler is
+returning, and if so Valgrind removes the relevant signal frame from
+the client's stack, restores the from the signal frame the simulated
+state before the signal was delivered, and allows the client to run
+onwards.  We have to do it this way because some signal handlers never
+return, they just <code>longjmp()</code>, which nukes the signal
+delivery frame.
+
+<p>
+The Linux kernel has a different but equally horrible hack for
+detecting signal handler returns.  Discovering it is left as an
+exercise for the reader.
+
+
+
+<h3>Errors, error contexts, error reporting, suppressions</h3>
+<h3>Client malloc/free</h3>
+<h3>Low-level memory management</h3>
+<h3>A and V bitmaps</h3>
+<h3>Symbol table management</h3>
+<h3>Dealing with system calls</h3>
+<h3>Namespace management</h3>
+<h3>GDB attaching</h3>
+<h3>Non-dependence on glibc or anything else</h3>
+<h3>The leak detector</h3>
+<h3>Performance problems</h3>
+<h3>Continuous sanity checking</h3>
+<h3>Tracing, or not tracing, child processes</h3>
+<h3>Assembly glue for syscalls</h3>
+
+
+<hr width="100%">
+
+<h2>Extensions</h2>
+
+Some comments about Stuff To Do.
+
+<h3>Bugs</h3>
+
+Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS
+(RC2 ish) when run on Valgrind.  Stephan has it deadlocking; Marc has
+it looping at startup.  I can't repro either behaviour. Needs
+repro-ing and fixing.
+
+
+<h3>Threads</h3>
+
+Doing a good job of thread support strikes me as almost a
+research-level problem.  The central issues are how to do fast cheap
+locking of the <code>VG_(primary_map)</code> structure, whether or not
+accesses to the individual secondary maps need locking, what
+race-condition issues result, and whether the already-nasty mess that
+is the signal simulator needs further hackery.
+
+<p>
+I realise that threads are the most-frequently-requested feature, and
+I am thinking about it all.  If you have guru-level understanding of 
+fast mutual exclusion mechanisms and race conditions, I would be
+interested in hearing from you.
+
+
+<h3>Verification suite</h3>
+
+Directory <code>tests/</code> contains various ad-hoc tests for
+Valgrind.  However, there is no systematic verification or regression
+suite, that, for example, exercises all the stuff in
+<code>vg_memory.c</code>, to ensure that illegal memory accesses and
+undefined value uses are detected as they should be.  It would be good
+to have such a suite.
+
+
+<h3>Porting to other platforms</h3>
+
+It would be great if Valgrind was ported to FreeBSD and x86 NetBSD,
+and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style
+executables, not ELF ?)
+
+<p>
+The main difficulties, for an x86-ELF platform, seem to be:
+
+<ul>
+<li>You'd need to rewrite the <code>/proc/self/maps</code> parser
+    (<code>vg_procselfmaps.c</code>).
+    Easy.
+<p>
+<li>You'd need to rewrite <code>vg_syscall_mem.c</code>, or, more
+    specifically, provide one for your OS.  This is tedious, but you
+    can implement syscalls on demand, and the Linux kernel interface
+    is, for the most part, going to look very similar to the *BSD
+    interfaces, so it's really a copy-paste-and-modify-on-demand job.
+    As part of this, you'd need to supply a new
+    <code>vg_kerneliface.h</code> file.
+<p>
+<li>You'd also need to change the syscall wrappers for Valgrind's
+    internal use, in <code>vg_mylibc.c</code>.
+</ul>
+
+All in all, I think a port to x86-ELF *BSDs is not really very
+difficult, and in some ways I would like to see it happen, because
+that would force a more clear factoring of Valgrind into platform
+dependent and independent pieces.  Not to mention, *BSD folks also
+deserve to use Valgrind just as much as the Linux crew do.
+
+
+<p>
+<hr width="100%">
+
+<h2>Easy stuff which ought to be done</h2>
+
+<h3>MMX instructions</h3>
+
+MMX insns should be supported, using the same trick as for FPU insns.
+If the MMX registers are not used to copy uninitialised junk from one
+place to another in memory, this means we don't have to actually
+simulate the internal MMX unit state, so the FPU hack applies.  This
+should be fairly easy.
+
+
+
+<h3>Fix stabs-info reader</h3>
+
+The machinery in <code>vg_symtab2.c</code> which reads "stabs" style
+debugging info is pretty weak.  It usually correctly translates 
+simulated program counter values into line numbers and procedure
+names, but the file name is often completely wrong.  I think the
+logic used to parse "stabs" entries is weak.  It should be fixed.
+The simplest solution, IMO, is to copy either the logic or simply the
+code out of GNU binutils which does this; since GDB can clearly get it
+right, binutils (or GDB?) must have code to do this somewhere.
+
+
+
+
+
+<h3>BT/BTC/BTS/BTR</h3>
+
+These are x86 instructions which test, complement, set, or reset, a
+single bit in a word.  At the moment they are both incorrectly
+implemented and incorrectly instrumented.
+
+<p>
+The incorrect instrumentation is due to use of helper functions.  This
+means we lose bit-level definedness tracking, which could wind up
+giving spurious uninitialised-value use errors.  The Right Thing to do
+is to invent a couple of new UOpcodes, I think <code>GET_BIT</code>
+and <code>SET_BIT</code>, which can be used to implement all 4 x86
+insns, get rid of the helpers, and give bit-accurate instrumentation
+rules for the two new UOpcodes.
+
+<p>
+I realised the other day that they are mis-implemented too.  The x86
+insns take a bit-index and a register or memory location to access.
+For registers the bit index clearly can only be in the range zero to
+register-width minus 1, and I assumed the same applied to memory
+locations too.  But evidently not; for memory locations the index can
+be arbitrary, and the processor will index arbitrarily into memory as
+a result.  This too should be fixed.  Sigh.  Presumably indexing
+outside the immediate word is not actually used by any programs yet
+tested on Valgrind, for otherwise they (presumably) would simply not
+work at all.  If you plan to hack on this, first check the Intel docs
+to make sure my understanding is really correct.
+
+
+
+<h3>Using PREFETCH instructions</h3>
+
+Here's a small but potentially interesting project for performance
+junkies.  Experiments with valgrind's code generator and optimiser(s)
+suggest that reducing the number of instructions executed in the
+translations and mem-check helpers gives disappointingly small
+performance improvements.  Perhaps this is because performance of
+Valgrindified code is limited by cache misses.  After all, each read
+in the original program now gives rise to at least three reads, one
+for the <code>VG_(primary_map)</code>, one of the resulting
+secondary, and the original.  Not to mention, the instrumented
+translations are 13 to 14 times larger than the originals.  All in all
+one would expect the memory system to be hammered to hell and then
+some.
+
+<p>
+So here's an idea.  An x86 insn involving a read from memory, after
+instrumentation, will turn into ucode of the following form:
+<pre>
+    ... calculate effective addr, into ta and qa ...
+    TESTVL qa             -- is the addr defined?
+    LOADV (ta), qloaded   -- fetch V bits for the addr
+    LOAD  (ta), tloaded   -- do the original load
+</pre>
+At the point where the <code>LOADV</code> is done, we know the actual
+address (<code>ta</code>) from which the real <code>LOAD</code> will
+be done.  We also know that the <code>LOADV</code> will take around
+20 x86 insns to do.  So it seems plausible that doing a prefetch of
+<code>ta</code> just before the <code>LOADV</code> might just avoid a
+miss at the <code>LOAD</code> point, and that might be a significant
+performance win.
+
+<p>
+Prefetch insns are notoriously tempermental, more often than not
+making things worse rather than better, so this would require
+considerable fiddling around.  It's complicated because Intels and
+AMDs have different prefetch insns with different semantics, so that
+too needs to be taken into account.  As a general rule, even placing
+the prefetches before the <code>LOADV</code> insn is too near the
+<code>LOAD</code>; the ideal distance is apparently circa 200 CPU
+cycles.  So it might be worth having another analysis/transformation
+pass which pushes prefetches as far back as possible, hopefully 
+immediately after the effective address becomes available.
+
+<p>
+Doing too many prefetches is also bad because they soak up bus
+bandwidth / cpu resources, so some cleverness in deciding which loads
+to prefetch and which to not might be helpful.  One can imagine not
+prefetching client-stack-relative (<code>%EBP</code> or
+<code>%ESP</code>) accesses, since the stack in general tends to show
+good locality anyway.
+
+<p>
+There's quite a lot of experimentation to do here, but I think it
+might make an interesting week's work for someone.
+
+<p>
+As of 15-ish March 2002, I've started to experiment with this, using
+the AMD <code>prefetch/prefetchw</code> insns.
+
+
+
+<h3>User-defined permission ranges</h3>
+
+This is quite a large project -- perhaps a month's hacking for a
+capable hacker to do a good job -- but it's potentially very
+interesting.  The outcome would be that Valgrind could detect a 
+whole class of bugs which it currently cannot.
+
+<p>
+The presentation falls into two pieces.
+
+<p>
+<b>Part 1: user-defined address-range permission setting</b>
+<p>
+
+Valgrind intercepts the client's <code>malloc</code>,
+<code>free</code>, etc calls, watches system calls, and watches the
+stack pointer move.  This is currently the only way it knows about
+which addresses are valid and which not.  Sometimes the client program
+knows extra information about its memory areas.  For example, the
+client could at some point know that all elements of an array are
+out-of-date.  We would like to be able to convey to Valgrind this
+information that the array is now addressable-but-uninitialised, so
+that Valgrind can then warn if elements are used before they get new
+values. 
+
+<p>
+What I would like are some macros like this:
+<pre>
+   VALGRIND_MAKE_NOACCESS(addr, len)
+   VALGRIND_MAKE_WRITABLE(addr, len)
+   VALGRIND_MAKE_READABLE(addr, len)
+</pre>
+and also, to check that memory is addressible/initialised,
+<pre>
+   VALGRIND_CHECK_ADDRESSIBLE(addr, len)
+   VALGRIND_CHECK_INITIALISED(addr, len)
+</pre>
+
+<p>
+I then include in my sources a header defining these macros, rebuild
+my app, run under Valgrind, and get user-defined checks.
+
+<p>
+Now here's a neat trick.  It's a nuisance to have to re-link the app
+with some new library which implements the above macros.  So the idea
+is to define the macros so that the resulting executable is still
+completely stand-alone, and can be run without Valgrind, in which case
+the macros do nothing, but when run on Valgrind, the Right Thing
+happens.  How to do this?  The idea is for these macros to turn into a
+piece of inline assembly code, which (1) has no effect when run on the
+real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane
+person would ever write, which is important for avoiding false matches
+in (2).  So here's a suggestion:
+<pre>
+   VALGRIND_MAKE_NOACCESS(addr, len)
+</pre>
+becomes (roughly speaking)
+<pre>
+   movl addr, %eax
+   movl len,  %ebx
+   movl $1,   %ecx   -- 1 describes the action; MAKE_WRITABLE might be
+                     -- 2, etc
+   rorl $13, %ecx
+   rorl $19, %ecx
+   rorl $11, %eax
+   rorl $21, %eax
+</pre>
+The rotate sequences have no effect, and it's unlikely they would
+appear for any other reason, but they define a unique byte-sequence
+which the JITter can easily spot.  Using the operand constraints
+section at the end of a gcc inline-assembly statement, we can tell gcc
+that the assembly fragment kills <code>%eax</code>, <code>%ebx</code>,
+<code>%ecx</code> and the condition codes, so this fragment is made
+harmless when not running on Valgrind, runs quickly when not on
+Valgrind, and does not require any other library support.
+
+
+<p>
+<b>Part 2: using it to detect interference between stack variables</b>
+<p>
+
+Currently Valgrind cannot detect errors of the following form:
+<pre>
+void fooble ( void )
+{
+   int a[10];
+   int b[10];
+   a[10] = 99;
+}
+</pre>
+Now imagine rewriting this as
+<pre>
+void fooble ( void )
+{
+   int spacer0;
+   int a[10];
+   int spacer1;
+   int b[10];
+   int spacer2;
+   VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
+   VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
+   VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
+   a[10] = 99;
+}
+</pre>
+Now the invalid write is certain to hit <code>spacer0</code> or
+<code>spacer1</code>, so Valgrind will spot the error.
+
+<p>
+There are two complications.
+
+<p>
+The first is that we don't want to annotate sources by hand, so the
+Right Thing to do is to write a C/C++ parser, annotator, prettyprinter
+which does this automatically, and run it on post-CPP'd C/C++ source.
+See http://www.cacheprof.org for an example of a system which
+transparently inserts another phase into the gcc/g++ compilation
+route.  The parser/prettyprinter is probably not as hard as it sounds;
+I would write it in Haskell, a powerful functional language well
+suited to doing symbolic computation, with which I am intimately
+familar.  There is already a C parser written in Haskell by someone in
+the Haskell community, and that would probably be a good starting
+point.
+
+<p>
+The second complication is how to get rid of these
+<code>NOACCESS</code> records inside Valgrind when the instrumented
+function exits; after all, these refer to stack addresses and will
+make no sense whatever when some other function happens to re-use the
+same stack address range, probably shortly afterwards.  I think I
+would be inclined to define a special stack-specific macro
+<pre>
+   VALGRIND_MAKE_NOACCESS_STACK(addr, len)
+</pre>
+which causes Valgrind to record the client's <code>%ESP</code> at the
+time it is executed.  Valgrind will then watch for changes in
+<code>%ESP</code> and discard such records as soon as the protected
+area is uncovered by an increase in <code>%ESP</code>.  I hesitate
+with this scheme only because it is potentially expensive, if there
+are hundreds of such records, and considering that changes in
+<code>%ESP</code> already require expensive messing with stack access
+permissions.
+
+<p>
+This is probably easier and more robust than for the instrumenter 
+program to try and spot all exit points for the procedure and place
+suitable deallocation annotations there.  Plus C++ procedures can 
+bomb out at any point if they get an exception, so spotting return
+points at the source level just won't work at all.
+
+<p>
+Although some work, it's all eminently doable, and it would make
+Valgrind into an even-more-useful tool.
+
+
+<p>
+
+</body>
+</html>
diff --git a/none/nl_main.html b/none/nl_main.html
new file mode 100644
index 000000000..95f947178
--- /dev/null
+++ b/none/nl_main.html
@@ -0,0 +1,57 @@
+<html>
+  <head>
+    <style type="text/css">
+      body      { background-color: #ffffff;
+                  color:            #000000;
+                  font-family:      Times, Helvetica, Arial;
+                  font-size:        14pt}
+      h4        { margin-bottom:    0.3em}
+      code      { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      pre       { color:            #000000;
+                  font-family:      Courier; 
+                  font-size:        13pt }
+      a:link    { color:            #0000C0;
+                  text-decoration:  none; }
+      a:visited { color:            #0000C0; 
+                  text-decoration:  none; }
+      a:active  { color:            #0000C0;
+                  text-decoration:  none; }
+    </style>
+    <title>Cachegrind</title>
+  </head>
+
+<body bgcolor="#ffffff">
+
+<a name="title"></a>
+<h1 align=center>Nulgrind</h1>
+<center>This manual was last updated on 2002-10-02</center>
+<p>
+
+<center>
+<a href="mailto:njn25@cam.ac.uk">njn25@cam.ac.uk</a><br>
+Copyright &copy; 2000-2002 Nicholas Nethercote
+<p>
+Nulgrind is licensed under the GNU General Public License, 
+version 2<br>
+Nulgrind is a Valgrind skin that does not very much at all.
+</center>
+
+<p>
+
+<h2>1&nbsp; Nulgrind</h2>
+
+Nulgrind is the minimal skin for Valgrind.  It does no initialisation or
+finalisation, and adds no instrumentation to the program's code.  It is mainly
+of use for Valgrind's developers for debugging and regression testing.
+<p>
+Nonetheless you can run programs with Nulgrind.  They will run roughly 5-10
+times more slowly than normal, for no useful effect.  Note that you need to use
+the option <code>--skin=none</code> to run Nulgrind (ie. not
+<code>--skin=nulgrind</code>).
+
+<hr width="100%">
+</body>
+</html>
+