diff --git a/cachegrind/docs/manual.html b/cachegrind/docs/manual.html index 5644872d0..4b6b77391 100644 --- a/cachegrind/docs/manual.html +++ b/cachegrind/docs/manual.html @@ -1929,7 +1929,11 @@ particular, it records: On a modern x86 machine, an L1 miss will typically cost around 10 cycles, and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program. +very useful for improving the performance of your program.
+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for optimisation and test coverage.
Please note that this is an experimental feature. Any feedback, bug-fixes, suggestions, etc, welcome. diff --git a/cachegrind/docs/techdocs.html b/cachegrind/docs/techdocs.html index aea95c9bb..5bfda47ee 100644 --- a/cachegrind/docs/techdocs.html +++ b/cachegrind/docs/techdocs.html @@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.
+ + +
LOAD, STORE,
+FPU_R and FPU_W. By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.
+
+Most of the cache profiling machinery is in the file
+vg_cachesim.c.
+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.
+ +
iCC), and one for instructions that do
+(idCC):
+
+
+typedef struct _CC {
+ ULong a;
+ ULong m1;
+ ULong m2;
+} CC;
+
+typedef struct _iCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+} iCC;
+
+typedef struct _idCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+ UChar data_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+ CC D;
+} idCC;
+
+
+Each CC has three fields a, m1,
+m2 for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit ULong -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.
+
+A iCC has one CC for instruction cache accesses. A
+idCC has two, one for instruction cache accesses, and one for data
+cache accesses.
+
+The iCC and dCC structs also store unchanging
+information about the instruction:
+
+
+
idCC only)+
+
idCC. This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing). We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.
+
+Also note that there is only room for recording info about one data cache
+access in an idCC. So what about instructions that do a read then
+a write, such as:
+
+
inc %(esi)
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there. So the write access isn't really interesting, and
+Valgrind doesn't record it. This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.+ +
+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.
+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +
+movl $0x0,%eax +movl $0x99, -4(%ebp) ++ +The translation to UCode looks like this: + +
+MOVL $0x0, t20 +PUTL t20, %EAX +INCEIPo $5 + +LEA1L -4(t4), t14 +MOVL $0x99, t18 +STL t18, (t14) +INCEIPo $7 ++ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the
INCEIPo instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).
+
+We can tell if an x86 instruction references memory by looking for
+LDL and STL UCode instructions, and thus what kind of
+cost centre is required. From this we can determine how many cost centres we
+need for the basic block, and their sizes. We can then allocate them in a
+single array.
+
+Consider the example code above. After the preliminary pass, we know we need
+two cost centres, one iCC and one dCC. So we
+allocate an array to store these which looks like this:
+
+
+|(uninit)| tag (1 byte) +|(uninit)| instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|(uninit)| instr_addr (4 bytes) +|(uninit)| I.a (8 bytes) +|(uninit)| I.m1 (8 bytes) +|(uninit)| I.m2 (8 bytes) + +|(uninit)| tag (1 byte) +|(uninit)| instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|(uninit)| instr_addr (4 bytes) +|(uninit)| I.a (8 bytes) +|(uninit)| I.m1 (8 bytes) +|(uninit)| I.m2 (8 bytes) +|(uninit)| D.a (8 bytes) +|(uninit)| D.m1 (8 bytes) +|(uninit)| D.m2 (8 bytes) ++ +(We can see now why we need tags to distinguish between the two types of cost +centres.)
+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.
+ +
+
+
+|INSTR_CC| tag (1 byte) +|5 | instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|i_addr1 | instr_addr (4 bytes) +|0 | I.a (8 bytes) +|0 | I.m1 (8 bytes) +|0 | I.m2 (8 bytes) + +|READ_CC | tag (1 byte) +|7 | instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|i_addr2 | instr_addr (4 bytes) +|0 | I.a (8 bytes) +|0 | I.m1 (8 bytes) +|0 | I.m2 (8 bytes) +|0 | D.a (8 bytes) +|0 | D.m1 (8 bytes) +|0 | D.m2 (8 bytes) ++ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)
+
+GCC inserts padding before the instr_size field so that it is word
+aligned.
+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +
+MOVL $0x0, t20 +PUTL t20, %EAX + PUSHL %eax + PUSHL %ecx + PUSHL %edx + MOVL $0x4091F8A4, t46 # address of 1st CC + PUSHL t46 + CALLMo $0x12 # second cachesim function + CLEARo $0x4 + POPL %edx + POPL %ecx + POPL %eax +INCEIPo $5 + +LEA1L -4(t4), t14 +MOVL $0x99, t18 + MOVL t14, t42 +STL t18, (t14) + PUSHL %eax + PUSHL %ecx + PUSHL %edx + PUSHL t42 + MOVL $0x4091F8C4, t44 # address of 2nd CC + PUSHL t44 + CALLMo $0x13 # second cachesim function + CLEARo $0x8 + POPL %edx + POPL %ecx + POPL %eax +INCEIPo $7 ++ +Consider the first instruction's UCode. Each call is surrounded by three +
PUSHL and POPL instructions to save and restore the
+caller-save registers. Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function. The address is known at this point because we are doing a
+simultaneous pass through the cost centre array. This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call). Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+CALLMo UInstruction takes an offset into a table of predefined
+functions; it is not an absolute address), and the single argument is
+CLEARed from the stack.
+
+The second instruction's UCode is similar. The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too. This explains the MOVL t14,
+t42 and PUSHL t42 UInstructions. (Note that the seemingly
+redundant MOVing will probably be optimised away during register
+allocation.)
+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.
+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.
+ +
+
+The interface to the simulation is quite clean. The functions called from the
+UCode contain calls to the simulation functions in the files
+vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only
+one function call is done per simulated x86 instruction. The file
+vg_cachesim.c simply #includes the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.
+ +
+ +Input file has the following format: + +
+file ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line ::= "desc:" ws? non_nl_string
+cmd_line ::= "cmd:" ws? cmd
+events_line ::= "events:" ws? (event ws)+
+data_line ::= file_line | fn_line | count_line
+file_line ::= ("fl=" | "fi=" | "fe=") filename
+fn_line ::= "fn=" fn_name
+count_line ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count ::= num | "."
+
+
+Where:
+
+non_nl_string is any string not containing a newline.+
cmd is a command line invocation.+
filename and fn_name can be anything.+
num and line_num are decimal numbers.+
ws is whitespace.+
nl is a newline.+
+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.
+
+The number of counts in each line and the
+summary_line should not exceed the number of events in the
+event_line. If the number in each line is less,
+vg_annotate treats those missing as though they were a "." entry.
+
+A file_line changes the current file name. A fn_line
+changes the current function name. A count_line contains counts
+that pertain to the current filename/fn_name. A "fn=" file_line
+and a fn_line must appear before any count_lines to
+give the context of the first count_lines.
+
+Each file_line should be immediately followed by a
+fn_line. "fi=" file_lines are used to switch
+filenames for inlined functions; "fe=" file_lines are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name. (fi and fe lines behave the same, they are
+only distinguished to help debugging.)
+ + +
+ +
+ +
+ +
+ +
cachegrind.out output files can contain huge amounts of
+ information; file format was carefully chosen to minimise file
+ sizes.+
+
+In particular, vg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information. The only part of vg_annotate that is specific to
+the cache simulation is the name of the input file
+(cachegrind.out), although it would be very simple to add an
+option to control this.
+