diff --git a/cachegrind/docs/manual.html b/cachegrind/docs/manual.html
index 5644872d0..4b6b77391 100644
--- a/cachegrind/docs/manual.html
+++ b/cachegrind/docs/manual.html
@@ -1929,7 +1929,11 @@ particular, it records:
 </ul>
 On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
 and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
-very useful for improving the performance of your program.
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for optimisation and test coverage.<p>
 
 Please note that this is an experimental feature.  Any feedback, bug-fixes,
 suggestions, etc, welcome.
diff --git a/cachegrind/docs/techdocs.html b/cachegrind/docs/techdocs.html
index aea95c9bb..5bfda47ee 100644
--- a/cachegrind/docs/techdocs.html
+++ b/cachegrind/docs/techdocs.html
@@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.
 
 
 <p>
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode.  For example, for cache profiling we are interested in
+instructions that read and write memory;  in UCode there are only four
+instructions that do this:  <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>.  By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually.  Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses.  A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+  <li>An instruction-type identification tag (explained below)</li><p>
+  <li>Instruction size</li><p>
+  <li>Data reference size (<code>idCC</code> only)</li><p>
+  <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>.  This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing).  We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>.  So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there.  So the write access isn't really interesting, and
+Valgrind doesn't record it.  This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level.  By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+                
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+</pre>
+
+The first step is to allocate the cost centres.  This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes).  UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required.  From this we can determine how many cost centres we
+need for the basic block, and their sizes.  We can then allocate them in a
+single array.<p>
+
+Consider the example code above.  After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>.  So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array.  We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name.  This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+  <li>Fill in the gaps in the allocated cost centres.</li><p>
+  <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem.  As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated;  see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+</pre>
+
+Consider the first instruction's UCode.  Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers.  Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function.  The address is known at this point because we are doing a
+simultaneous pass through the cost centre array.  This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call).  Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions;  it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar.  The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too.  This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions.  (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function.  But this would slow the calls down
+(two or three extra arguments pushed onto the stack).  Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre;  bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication.  Valgrind has a limited size
+cache for basic block translations;  if it fills up, old translations are
+discarded.  If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution!  So when a basic block is
+translated, we first look for its cost centre array in the hash table.  If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above.  But if there is a cost centre array already, it must be a
+retranslation.  In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward.  It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean.  The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>;  these calls are inlined so that only
+one function call is done per simulated x86 instruction.  The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions.  Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time;  the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+  <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+  <li><code>cmd</code> is a command line invocation.</li><p>
+  <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+  <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+  <li><code>ws</code> is whitespace.</li><p>
+  <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>.  If the number in each <code>line</code> is less,
+vg_annotate treats those missing as though they were a "." entry.  <p>
+
+A <code>file_line</code> changes the current file name.  A <code>fn_line</code>
+changes the current function name.  A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name.  A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>.  "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name.  (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+  <li>The basic block-level cost centre storage allows almost free cost centre
+      lookup.</li><p>
+  
+  <li>Only one function call is made per instruction simulated;  even this
+      accounts for a sizeable percentage of execution time, but it seems
+      unavoidable if we want flexibility in the cache simulator.</li><p>
+
+  <li>Unchanging information about an instruction is stored in its cost centre,
+      avoiding unnecessary argument pushing, and minimising UCode
+      instrumentation bloat.</li><p>
+
+  <li>Summary counts are calculated at the end, rather than during
+      execution.</li><p>
+
+  <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+      information; file format was carefully chosen to minimise file
+      sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by vg_annotate.  It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them.  It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events.  A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information.  The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
 </body>
 </html>
diff --git a/coregrind/docs/manual.html b/coregrind/docs/manual.html
index 5644872d0..4b6b77391 100644
--- a/coregrind/docs/manual.html
+++ b/coregrind/docs/manual.html
@@ -1929,7 +1929,11 @@ particular, it records:
 </ul>
 On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
 and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
-very useful for improving the performance of your program.
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for optimisation and test coverage.<p>
 
 Please note that this is an experimental feature.  Any feedback, bug-fixes,
 suggestions, etc, welcome.
diff --git a/coregrind/docs/techdocs.html b/coregrind/docs/techdocs.html
index aea95c9bb..5bfda47ee 100644
--- a/coregrind/docs/techdocs.html
+++ b/coregrind/docs/techdocs.html
@@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.
 
 
 <p>
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode.  For example, for cache profiling we are interested in
+instructions that read and write memory;  in UCode there are only four
+instructions that do this:  <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>.  By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually.  Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses.  A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+  <li>An instruction-type identification tag (explained below)</li><p>
+  <li>Instruction size</li><p>
+  <li>Data reference size (<code>idCC</code> only)</li><p>
+  <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>.  This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing).  We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>.  So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there.  So the write access isn't really interesting, and
+Valgrind doesn't record it.  This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level.  By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+                
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+</pre>
+
+The first step is to allocate the cost centres.  This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes).  UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required.  From this we can determine how many cost centres we
+need for the basic block, and their sizes.  We can then allocate them in a
+single array.<p>
+
+Consider the example code above.  After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>.  So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array.  We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name.  This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+  <li>Fill in the gaps in the allocated cost centres.</li><p>
+  <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem.  As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated;  see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+</pre>
+
+Consider the first instruction's UCode.  Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers.  Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function.  The address is known at this point because we are doing a
+simultaneous pass through the cost centre array.  This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call).  Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions;  it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar.  The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too.  This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions.  (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function.  But this would slow the calls down
+(two or three extra arguments pushed onto the stack).  Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre;  bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication.  Valgrind has a limited size
+cache for basic block translations;  if it fills up, old translations are
+discarded.  If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution!  So when a basic block is
+translated, we first look for its cost centre array in the hash table.  If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above.  But if there is a cost centre array already, it must be a
+retranslation.  In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward.  It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean.  The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>;  these calls are inlined so that only
+one function call is done per simulated x86 instruction.  The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions.  Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time;  the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+  <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+  <li><code>cmd</code> is a command line invocation.</li><p>
+  <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+  <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+  <li><code>ws</code> is whitespace.</li><p>
+  <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>.  If the number in each <code>line</code> is less,
+vg_annotate treats those missing as though they were a "." entry.  <p>
+
+A <code>file_line</code> changes the current file name.  A <code>fn_line</code>
+changes the current function name.  A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name.  A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>.  "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name.  (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+  <li>The basic block-level cost centre storage allows almost free cost centre
+      lookup.</li><p>
+  
+  <li>Only one function call is made per instruction simulated;  even this
+      accounts for a sizeable percentage of execution time, but it seems
+      unavoidable if we want flexibility in the cache simulator.</li><p>
+
+  <li>Unchanging information about an instruction is stored in its cost centre,
+      avoiding unnecessary argument pushing, and minimising UCode
+      instrumentation bloat.</li><p>
+
+  <li>Summary counts are calculated at the end, rather than during
+      execution.</li><p>
+
+  <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+      information; file format was carefully chosen to minimise file
+      sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by vg_annotate.  It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them.  It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events.  A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information.  The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
 </body>
 </html>
diff --git a/docs/manual.html b/docs/manual.html
index 5644872d0..4b6b77391 100644
--- a/docs/manual.html
+++ b/docs/manual.html
@@ -1929,7 +1929,11 @@ particular, it records:
 </ul>
 On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
 and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
-very useful for improving the performance of your program.
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for optimisation and test coverage.<p>
 
 Please note that this is an experimental feature.  Any feedback, bug-fixes,
 suggestions, etc, welcome.
diff --git a/docs/techdocs.html b/docs/techdocs.html
index aea95c9bb..5bfda47ee 100644
--- a/docs/techdocs.html
+++ b/docs/techdocs.html
@@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.
 
 
 <p>
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode.  For example, for cache profiling we are interested in
+instructions that read and write memory;  in UCode there are only four
+instructions that do this:  <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>.  By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually.  Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses.  A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+  <li>An instruction-type identification tag (explained below)</li><p>
+  <li>Instruction size</li><p>
+  <li>Data reference size (<code>idCC</code> only)</li><p>
+  <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>.  This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing).  We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>.  So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there.  So the write access isn't really interesting, and
+Valgrind doesn't record it.  This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level.  By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+                
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+</pre>
+
+The first step is to allocate the cost centres.  This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes).  UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required.  From this we can determine how many cost centres we
+need for the basic block, and their sizes.  We can then allocate them in a
+single array.<p>
+
+Consider the example code above.  After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>.  So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array.  We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name.  This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+  <li>Fill in the gaps in the allocated cost centres.</li><p>
+  <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem.  As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated;  see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+</pre>
+
+Consider the first instruction's UCode.  Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers.  Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function.  The address is known at this point because we are doing a
+simultaneous pass through the cost centre array.  This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call).  Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions;  it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar.  The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too.  This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions.  (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function.  But this would slow the calls down
+(two or three extra arguments pushed onto the stack).  Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre;  bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication.  Valgrind has a limited size
+cache for basic block translations;  if it fills up, old translations are
+discarded.  If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution!  So when a basic block is
+translated, we first look for its cost centre array in the hash table.  If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above.  But if there is a cost centre array already, it must be a
+retranslation.  In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward.  It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean.  The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>;  these calls are inlined so that only
+one function call is done per simulated x86 instruction.  The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions.  Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time;  the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+  <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+  <li><code>cmd</code> is a command line invocation.</li><p>
+  <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+  <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+  <li><code>ws</code> is whitespace.</li><p>
+  <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>.  If the number in each <code>line</code> is less,
+vg_annotate treats those missing as though they were a "." entry.  <p>
+
+A <code>file_line</code> changes the current file name.  A <code>fn_line</code>
+changes the current function name.  A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name.  A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>.  "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name.  (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+  <li>The basic block-level cost centre storage allows almost free cost centre
+      lookup.</li><p>
+  
+  <li>Only one function call is made per instruction simulated;  even this
+      accounts for a sizeable percentage of execution time, but it seems
+      unavoidable if we want flexibility in the cache simulator.</li><p>
+
+  <li>Unchanging information about an instruction is stored in its cost centre,
+      avoiding unnecessary argument pushing, and minimising UCode
+      instrumentation bloat.</li><p>
+
+  <li>Summary counts are calculated at the end, rather than during
+      execution.</li><p>
+
+  <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+      information; file format was carefully chosen to minimise file
+      sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by vg_annotate.  It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them.  It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events.  A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information.  The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
 </body>
 </html>
diff --git a/memcheck/docs/manual.html b/memcheck/docs/manual.html
index 5644872d0..4b6b77391 100644
--- a/memcheck/docs/manual.html
+++ b/memcheck/docs/manual.html
@@ -1929,7 +1929,11 @@ particular, it records:
 </ul>
 On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
 and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
-very useful for improving the performance of your program.
+very useful for improving the performance of your program.<p>
+
+Also, since one instruction cache read is performed per instruction executed,
+you can find out how many instructions are executed per line, which can be
+useful for optimisation and test coverage.<p>
 
 Please note that this is an experimental feature.  Any feedback, bug-fixes,
 suggestions, etc, welcome.
diff --git a/memcheck/docs/techdocs.html b/memcheck/docs/techdocs.html
index aea95c9bb..5bfda47ee 100644
--- a/memcheck/docs/techdocs.html
+++ b/memcheck/docs/techdocs.html
@@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.
 
 
 <p>
+
+
+<hr width="100%">
+
+<h2>Cache profiling</h2>
+Valgrind is a very nice platform for doing cache profiling and other kinds of
+simulation, because it converts horrible x86 instructions into nice clean
+RISC-like UCode.  For example, for cache profiling we are interested in
+instructions that read and write memory;  in UCode there are only four
+instructions that do this:  <code>LOAD</code>, <code>STORE</code>,
+<code>FPU_R</code> and <code>FPU_W</code>.  By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.<p>
+
+Most of the cache profiling machinery is in the file
+<code>vg_cachesim.c</code>.<p>
+
+These notes are a somewhat haphazard guide to how Valgrind's cache profiling
+works.<p>
+
+<h3>Cost centres</h3>
+Valgrind gathers cache profiling about every instruction executed,
+individually.  Each instruction has a <b>cost centre</b> associated with it.
+There are two kinds of cost centre: one for instructions that don't reference
+memory (<code>iCC</code>), and one for instructions that do
+(<code>idCC</code>):
+
+<pre>
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+</pre>
+
+Each <code>CC</code> has three fields <code>a</code>, <code>m1</code>,
+<code>m2</code> for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit <code>ULong</code> -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.<p>
+
+A <code>iCC</code> has one <code>CC</code> for instruction cache accesses.  A
+<code>idCC</code> has two, one for instruction cache accesses, and one for data
+cache accesses.<p>
+
+The <code>iCC</code> and <code>dCC</code> structs also store unchanging
+information about the instruction:
+<ul>
+  <li>An instruction-type identification tag (explained below)</li><p>
+  <li>Instruction size</li><p>
+  <li>Data reference size (<code>idCC</code> only)</li><p>
+  <li>Instruction address</li><p>
+</ul>
+
+Note that data address is not one of the fields for <code>idCC</code>.  This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing).  We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.<p>
+
+Also note that there is only room for recording info about one data cache
+access in an <code>idCC</code>.  So what about instructions that do a read then
+a write, such as:
+
+<blockquote><code>inc %(esi)</code></blockquote>
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there.  So the write access isn't really interesting, and
+Valgrind doesn't record it.  This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.<p>
+
+<h3>Storing cost-centres</h3>
+Cost centres are stored in a way that makes them very cheap to lookup, which is
+important since one is looked up for every original x86 instruction
+executed.<p>
+
+Valgrind does JIT translations at the basic block level, and cost centres are
+also setup and stored at the basic block level.  By doing things carefully, we
+store all the cost centres for a basic block in a contiguous array, and lookup
+comes almost for free.<p>
+
+Consider this part of a basic block (for exposition purposes, pretend it's an
+entire basic block):
+
+<pre>
+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+</pre>
+
+The translation to UCode looks like this:
+                
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+</pre>
+
+The first step is to allocate the cost centres.  This requires a preliminary
+pass to count how many x86 instructions were in the basic block, and their
+types (and thus sizes).  UCode translations for single x86 instructions are
+delimited by the <code>INCEIPo</code> instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).<p>
+
+We can tell if an x86 instruction references memory by looking for
+<code>LDL</code> and <code>STL</code> UCode instructions, and thus what kind of
+cost centre is required.  From this we can determine how many cost centres we
+need for the basic block, and their sizes.  We can then allocate them in a
+single array.<p>
+
+Consider the example code above.  After the preliminary pass, we know we need
+two cost centres, one <code>iCC</code> and one <code>dCC</code>.  So we
+allocate an array to store these which looks like this:
+
+<pre>
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+</pre>
+
+(We can see now why we need tags to distinguish between the two types of cost
+centres.)<p>
+
+We also record the size of the array.  We look up the debug info of the first
+instruction in the basic block, and then stick the array into a table indexed
+by filename and function name.  This makes it easy to dump the information
+quickly to file at the end.<p>
+
+<h3>Instrumentation</h3>
+The instrumentation pass has two main jobs:
+
+<ol>
+  <li>Fill in the gaps in the allocated cost centres.</li><p>
+  <li>Add UCode to call the cache simulator for each instruction.</li><p>
+</ol>
+
+The instrumentation pass steps through the UCode and the cost centres in
+tandem.  As each original x86 instruction's UCode is processed, the appropriate
+gaps in the instructions cost centre are filled in, for example:
+
+<pre>
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+</pre>
+
+(Note that this step is not performed if a basic block is re-translated;  see
+<a href="#retranslations">here</a> for more information.)<p>
+
+GCC inserts padding before the <code>instr_size</code> field so that it is word
+aligned.<p>
+
+The instrumentation added to call the cache simulation function looks like this
+(instrumentation is indented to distinguish it from the original UCode):
+
+<pre>
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+</pre>
+
+Consider the first instruction's UCode.  Each call is surrounded by three
+<code>PUSHL</code> and <code>POPL</code> instructions to save and restore the
+caller-save registers.  Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function.  The address is known at this point because we are doing a
+simultaneous pass through the cost centre array.  This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call).  Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+<code>CALLMo</code> UInstruction takes an offset into a table of predefined
+functions;  it is not an absolute address), and the single argument is
+<code>CLEAR</code>ed from the stack.<p>
+
+The second instruction's UCode is similar.  The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too.  This explains the <code>MOVL t14,
+t42</code> and <code>PUSHL t42</code> UInstructions.  (Note that the seemingly
+redundant <code>MOV</code>ing will probably be optimised away during register
+allocation.)<p>
+
+Note that instead of storing unchanging information about each instruction
+(instruction size, data size, etc) in its cost centre, we could have passed in
+these arguments to the simulation function.  But this would slow the calls down
+(two or three extra arguments pushed onto the stack).  Also it would bloat the
+UCode instrumentation by amounts similar to the space required for them in the
+cost centre;  bloated UCode would also fill the translation cache more quickly,
+requiring more translations for large programs and slowing them down more.<p>
+
+<a name="retranslations"></a>
+<h3>Handling basic block retranslations</h3>
+The above description ignores one complication.  Valgrind has a limited size
+cache for basic block translations;  if it fills up, old translations are
+discarded.  If a discarded basic block is executed again, it must be
+re-translated.<p>
+
+However, we can't use this approach for profiling -- we can't throw away cost
+centres for instructions in the middle of execution!  So when a basic block is
+translated, we first look for its cost centre array in the hash table.  If
+there is no cost centre array, it must be the first translation, so we proceed
+as described above.  But if there is a cost centre array already, it must be a
+retranslation.  In this case, we skip the cost centre allocation and
+initialisation steps, but still do the UCode instrumentation step.<p>
+
+<h3>The cache simulation</h3>
+The cache simulation is fairly straightforward.  It just tracks which memory
+blocks are in the cache at the moment (it doesn't track the contents, since
+that is irrelevant).<p>
+
+The interface to the simulation is quite clean.  The functions called from the
+UCode contain calls to the simulation functions in the files
+<Code>vg_cachesim_{I1,D1,L2}.c</code>;  these calls are inlined so that only
+one function call is done per simulated x86 instruction.  The file
+<code>vg_cachesim.c</code> simply <code>#include</code>s the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.<p>
+
+<h3>Output</h3>
+Output is fairly straightforward, basically printing the cost centre for every
+instruction, grouped by files and functions.  Total counts (eg. total cache
+accesses, total L1 misses) are calculated when traversing this structure rather
+than during execution, to save time;  the cache simulation functions are called
+so often that even one or two extra adds can make a sizeable difference.<p>
+
+Input file has the following format:
+
+<pre>
+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+</pre>
+
+Where:
+
+<ul>
+  <li><code>non_nl_string</code> is any string not containing a newline.</li><p>
+  <li><code>cmd</code> is a command line invocation.</li><p>
+  <li><code>filename</code> and <code>fn_name</code> can be anything.</li><p>
+  <li><code>num</code> and <code>line_num</code> are decimal numbers.</li><p>
+  <li><code>ws</code> is whitespace.</li><p>
+  <li><code>nl</code> is a newline.</li><p>
+</ul>
+
+The contents of the "desc:" lines is printed out at the top of the summary.
+This is a generic way of providing simulation specific information, eg. for
+giving the cache configuration for cache simulation.<p>
+
+Counts can be "." to represent "N/A", eg. the number of write misses for an
+instruction that doesn't write to memory.<p>
+
+The number of counts in each <code>line</code> and the
+<code>summary_line</code> should not exceed the number of events in the
+<code>event_line</code>.  If the number in each <code>line</code> is less,
+vg_annotate treats those missing as though they were a "." entry.  <p>
+
+A <code>file_line</code> changes the current file name.  A <code>fn_line</code>
+changes the current function name.  A <code>count_line</code> contains counts
+that pertain to the current filename/fn_name.  A "fn=" <code>file_line</code>
+and a <code>fn_line</code> must appear before any <code>count_line</code>s to
+give the context of the first <code>count_line</code>s.<p>
+
+Each <code>file_line</code> should be immediately followed by a
+<code>fn_line</code>.  "fi=" <code>file_lines</code> are used to switch
+filenames for inlined functions; "fe=" <code>file_lines</code> are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name.  (fi and fe lines behave the same, they are
+only distinguished to help debugging.)<p>
+
+
+<h3>Summary of performance features</h3>
+Quite a lot of work has gone into making the profiling as fast as possible.
+This is a summary of the important features:
+
+<ul>
+  <li>The basic block-level cost centre storage allows almost free cost centre
+      lookup.</li><p>
+  
+  <li>Only one function call is made per instruction simulated;  even this
+      accounts for a sizeable percentage of execution time, but it seems
+      unavoidable if we want flexibility in the cache simulator.</li><p>
+
+  <li>Unchanging information about an instruction is stored in its cost centre,
+      avoiding unnecessary argument pushing, and minimising UCode
+      instrumentation bloat.</li><p>
+
+  <li>Summary counts are calculated at the end, rather than during
+      execution.</li><p>
+
+  <li>The <code>cachegrind.out</code> output files can contain huge amounts of
+      information; file format was carefully chosen to minimise file
+      sizes.</li><p>
+</ul>
+
+
+<h3>Annotation</h3>
+Annotation is done by vg_annotate.  It is a fairly straightforward Perl script
+that slurps up all the cost centres, and then runs through all the chosen
+source files, printing out cost centres with them.  It too has been carefully
+optimised.
+
+
+<h3>Similar work, extensions</h3>
+It would be relatively straightforward to do other simulations and obtain
+line-by-line information about interesting events.  A good example would be
+branch prediction -- all branches could be instrumented to interact with a
+branch prediction simulator, using very similar techniques to those described
+above.<p>
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information.  The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(<code>cachegrind.out</code>), although it would be very simple to add an
+option to control this.<p>
+
 </body>
 </html>