diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml index 92fe08682..35d6a412e 100644 --- a/cachegrind/docs/cg-manual.xml +++ b/cachegrind/docs/cg-manual.xml @@ -5,167 +5,117 @@ -Cachegrind: a cache and branch-prediction profiler +Cachegrind: a high-precision tracing profiler -To use this tool, you must specify - on the -Valgrind command line. + +To use this tool, specify on the Valgrind +command line. + Overview -Cachegrind simulates how your program interacts with a machine's cache -hierarchy and (optionally) branch predictor. It simulates a machine with -independent first-level instruction and data caches (I1 and D1), backed by a -unified second-level cache (L2). This exactly matches the configuration of -many modern machines. - -However, some modern machines have three or four levels of cache. For these -machines (in the cases where Cachegrind can auto-detect the cache -configuration) Cachegrind simulates the first-level and last-level caches. -The reason for this choice is that the last-level cache has the most influence on -runtime, as it masks accesses to main memory. Furthermore, the L1 caches -often have low associativity, so simulating them can detect cases where the -code interacts badly with this cache (eg. traversing a matrix column-wise -with the row length being a power of 2). - -Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) -caches. - -Cachegrind gathers the following statistics (abbreviations used for each statistic -is given in parentheses): +Cachegrind is a high-precision tracing profiler. It runs slowly, but collects +precise and reproducible profiling data. It can merge and diff data from +different runs. To expand on these characteristics: + + - I cache reads (Ir, - which equals the number of instructions executed), - I1 cache read misses (I1mr) and - LL cache instruction read misses (ILmr). + + Precise. Cachegrind measures the exact number of + instructions executed by your program, not an approximation. Furthermore, + it presents the gathered data at the file, function, and line level. This + is different to many other profilers that measure approximate execution + time, using sampling, and only at the function level. + - D cache reads (Dr, which - equals the number of memory reads), - D1 cache read misses (D1mr), and - LL cache data read misses (DLmr). - - - - D cache writes (Dw, which equals - the number of memory writes), - D1 cache write misses (D1mw), and - LL cache data write misses (DLmw). - - - - Conditional branches executed (Bc) and - conditional branches mispredicted (Bcm). - - - - Indirect branches executed (Bi) and - indirect branches mispredicted (Bim). + + Reproducible. In general, execution time is a better + metric than instruction counts because it's what users perceive. However, + execution time often has high variability. When running the exact same + program on the exact same input multiple times, execution time might vary + by several percent. Furthermore, small changes in a program can change its + memory layout and have even larger effects on runtime. In contrast, + instruction counts are highly reproducible; for some programs they are + perfectly reproducible. This means the effects of small changes in a + program can be measured with high precision. -Note that D1 total accesses is given by -D1mr + -D1mw, and that LL total -accesses is given by ILmr + -DLmr + -DLmw. + +For these reasons, Cachegrind is an excellent complement to time-based profilers. -These statistics are presented for the entire program and for each -function in the program. You can also annotate each line of source code in -the program with the counts that were caused directly by it. + +Cachegrind can annotate programs written in any language, so long as debug info +is present to map machine code back to the original source code. Cachegrind has +been used successfully on programs written in C, C++, Rust, and assembly. + -On a modern machine, an L1 miss will typically cost -around 10 cycles, an LL miss can cost as much as 200 -cycles, and a mispredicted branch costs in the region of 10 -to 30 cycles. Detailed cache and branch profiling can be very useful -for understanding how your program interacts with the machine and thus how -to make it faster. - -Also, since one instruction cache read is performed per -instruction executed, you can find out how many instructions are -executed per line, which can be useful for traditional profiling. + +Cachegrind can also simulate how your program interacts with a machine's cache +hierarchy and branch predictor. This simulation was the original motivation for +the tool, hence its name. However, the simulations are basic and unlikely to +reflect the behaviour of a modern machine. For this reason they are off by +default. If you really want cache and branch information, a profiler like +perf that accesses hardware counters is a +better choice. + - -Using Cachegrind, cg_annotate and cg_merge + xreflabel="Using Cachegrind and cg_annotate"> +Using Cachegrind and cg_annotate -First off, as for normal Valgrind use, you probably want to -compile with debugging info (the - option). But by contrast with -normal Valgrind use, you probably do want to turn -optimisation on, since you should profile your program as it will -be normally run. + +First, as for normal Valgrind use, you should compile with debugging info (the + option in most compilers). But by contrast with normal +Valgrind use, you probably do want to turn optimisation on, since you should +profile your program as it will be normally run. + -Then, you need to run Cachegrind itself to gather the profiling -information, and then run cg_annotate to get a detailed presentation of that -information. As an optional intermediate step, you can use cg_merge to sum -together the outputs of multiple Cachegrind runs into a single file which -you then use as the input for cg_annotate. Alternatively, you can use -cg_diff to difference the outputs of two Cachegrind runs into a single file -which you then use as the input for cg_annotate. + +Second, run Cachegrind itself to gather the profiling data. + + + +Third, run cg_annotate to get a detailed presentation of that data. cg_annotate +can combine the results of multiple Cachegrind output files. It can also +perform a diff between two Cachegrind output files. + Running Cachegrind -To run Cachegrind on a program prog, run: + +To run Cachegrind on a program prog, run: - -The program will execute (slowly). Upon completion, -summary statistics that look like this will be printed: - - - -Cache accesses for instruction fetches are summarised -first, giving the number of fetches made (this is the number of -instructions executed, which can be useful to know in its own -right), the number of I1 misses, and the number of LL instruction -(LLi) misses. - -Cache accesses for data follow. The information is similar -to that of the instruction fetches, except that the values are -also shown split between reads and writes (note each row's -rd and -wr values add up to the row's -total). - -Combined instruction and data figures for the LL cache -follow that. Note that the LL miss rate is computed relative to the total -number of memory accesses, not the number of L1 misses. I.e. it is -(ILmr + DLmr + DLmw) / (Ir + Dr + Dw) -not -(ILmr + DLmr + DLmw) / (I1mr + D1mr + D1mw) -Branch prediction statistics are not collected by default. -To do so, add the option . + +The program will execute (slowly). Upon completion, summary statistics that +look like this will be printed: + + + + + +The I refs number is short for "Instruction +cache references", which is equivalent to "instructions executed". If you +enable the cache and/or branch simulation, additional counts will be shown. + @@ -173,660 +123,744 @@ To do so, add the option . Output File -As well as printing summary information, Cachegrind also writes -more detailed profiling information to a file. By default this file is named -cachegrind.out.<pid> (where -<pid> is the program's process ID), but its name -can be changed with the option. This -file is human-readable, but is intended to be interpreted by the -accompanying program cg_annotate, described in the next section. + +Cachegrind also writes more detailed profiling data to a file. By default this +Cachegrind output file is named cachegrind.out.<pid> +(where <pid> is the program's process ID), but its +name can be changed with the option. +This file is human-readable, but is intended to be interpreted by the +accompanying program cg_annotate, described in the next section. + -The default .<pid> suffix -on the output file name serves two purposes. Firstly, it means you -don't have to rename old log files that you don't want to overwrite. -Secondly, and more importantly, it allows correct profiling with the - option of -programs that spawn child processes. - -The output file can be big, many megabytes for large applications -built with full debugging information. + +The default .<pid> suffix on the output +file name serves two purposes. First, it means existing Cachegrind output files +aren't immediately overwritten. Second, and more importantly, it allows correct +profiling with the option of programs +that spawn child processes. + - Running cg_annotate -Before using cg_annotate, -it is worth widening your window to be at least 120-characters -wide if possible, as the output lines can be quite long. - -To get a function-by-function summary, run: + +Before using cg_annotate, it is worth widening your window to be at least 120 +characters wide if possible, because the output lines can be quite long. + + +Then run: cg_annotate <filename> - -on a Cachegrind output file. +on a Cachegrind output file. + + + + +The Metadata Section + + +The first part of the output looks like this: + - -This is a summary of the annotation options: + +It summarizes how Cachegrind and the profiled program were run. + - - I1 cache, D1 cache, LL cache: cache configuration. So - you know the configuration with which these results were - obtained. + + Invocation: the command line used to produce this output. + - Command: the command line invocation of the program - under examination. + + Command: the command line used to run the profiled program. + - Events recorded: which events were recorded. - - - - - Events shown: the events shown, which is a subset of the events - gathered. This can be adjusted with the - option. + + Events recorded: which events were recorded. By default, this is + Ir. More events will be recorded if cache + and/or branch simulation is enabled. + - Event sort order: the sort order in which functions are - shown. For example, in this case the functions are sorted - from highest Ir counts to - lowest. If two functions have identical - Ir counts, they will then be - sorted by I1mr counts, and - so on. This order can be adjusted with the - option. - - Note that this dictates the order the functions appear. - It is not the order in which the columns - appear; that is dictated by the "events shown" line (and can - be changed with the - option). + + Events shown: the events shown, which is a subset of the events gathered. + This can be adjusted with the option. + - Threshold: cg_annotate - by default omits functions that cause very low counts - to avoid drowning you in information. In this case, - cg_annotate shows summaries the functions that account for - 99% of the Ir counts; - Ir is chosen as the - threshold event since it is the primary sort event. The - threshold can be adjusted with the - - option. + + Event sort order: the sort order used for the subsequent sections. For + example, in this case those sections are sorted from highest + Ir counts to lowest. If there are multiple + events, one will be the primary sort event, and then there can be a + secondary sort event, tertiary sort event, etc., though more than one is + rarely needed. This order can be adjusted with the + option. Note that this does not specify the order in + which the columns appear. That is specified by the "events shown" line (and + can be changed with the option). + - Chosen for annotation: names of files specified - manually for annotation; in this case none. + + Threshold: cg_annotate by default omits files and functions with very low + counts to keep the output size reasonable. By default cg_annotate only + shows files and functions that account for at least 0.1% of the primary + sort event. The threshold can be adjusted with the + option. + - Auto-annotation: whether auto-annotation was requested - via the - option. In this case no. + + Annotation: whether source file annotation is enabled. Controlled with the + option. + + +If cache simulation is enabled, details of the cache parameters will be shown +above the "Invocation" line. + + -The Global and Function-level Counts - -Then follows summary statistics for the whole -program: - - + xreflabel="Global, File, and Function-level Counts"> +Global, File, and Function-level Counts -These are similar to the summary provided when Cachegrind finishes running. +Next comes the summary for the whole program: + + + + + +The Ir column label is suffixed with +underscores to show the bounds of the columns underneath. -Then comes function-by-function statistics: + +Then comes file:function counts. Here is the first part of that section: + + Ir______________________ file:function -Each function -is identified by a -file_name:function_name pair. If -a column contains only a dot it means the function never performs -that event (e.g. the third row shows that -strcmp() contains no -instructions that write to memory). The name -??? is used if the file name -and/or function name could not be determined from debugging -information. If most of the entries have the form -???:??? the program probably -wasn't compiled with . +< 3,078,746 (37.6%, 37.6%) /home/njn/grind/ws1/cachegrind/concord.c: + 1,630,232 (19.9%) get_word + 630,918 (7.7%) hash + 461,095 (5.6%) insert + 130,560 (1.6%) add_existing + 91,014 (1.1%) init_hash_table + 88,056 (1.1%) create + 46,676 (0.6%) new_word_node -It is worth noting that functions will come both from -the profiled program (e.g. concord.c) -and from libraries (e.g. getc.c) +< 1,746,038 (21.3%, 58.9%) ./malloc/./malloc/malloc.c: + 1,285,938 (15.7%) _int_malloc + 458,225 (5.6%) malloc + +< 1,107,550 (13.5%, 72.4%) ./libio/./libio/getc.c:getc + +< 551,071 (6.7%, 79.1%) ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 + +< 521,228 (6.4%, 85.5%) ./ctype/../include/ctype.h: + 260,616 (3.2%) __ctype_tolower_loc + 260,612 (3.2%) __ctype_b_loc + +< 468,163 (5.7%, 91.2%) ???: + 468,151 (5.7%) ??? + +< 456,071 (5.6%, 96.8%) /usr/include/ctype.h:get_word + +]]> + + +Each entry covers one file, and one or more functions within that file. If +there is only one significant function within a file, as in the first entry, +the file and function are shown on the same line separate by a colon. If there +are multiple significant functions within a file, as in the third entry, each +function gets its own line. + + + +This example involves a small C program, and shows a combination of code from +the program itself (including functions like get_word and +hash in the file concord.c) as well +as code from system libraries, such as functions like +malloc and getc. + + + +Each entry is preceded with a <, which can +be useful when navigating through the output in an editor, or grepping through +results. + + + +The first percentage in each column indicates the proportion of the total event +count is covered by this line. The second percentage, which only shows on the +first line of each entry, shows the cumulative percentage of all the entries up +to and including this one. The entries shown here account for 96.8% of the +instructions executed by the program. + + + +The name ??? is used if the file name and/or +function name could not be determined from debugging information. If +??? filenames dominate, the program probably wasn't +compiled with . If ??? function names +dominate, the program may have had symbols stripped. + + + +After that comes function:file counts. Here is the first part of that section: + + + 2,086,303 (25.5%, 25.5%) get_word: + 1,630,232 (19.9%) /home/njn/grind/ws1/cachegrind/concord.c + 456,071 (5.6%) /usr/include/ctype.h + +> 1,285,938 (15.7%, 41.1%) _int_malloc:./malloc/./malloc/malloc.c + +> 1,107,550 (13.5%, 54.7%) getc:./libio/./libio/getc.c + +> 630,918 (7.7%, 62.4%) hash:/home/njn/grind/ws1/cachegrind/concord.c + +> 551,071 (6.7%, 69.1%) __strcmp_avx2:./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S + +> 480,248 (5.9%, 74.9%) malloc: + 458,225 (5.6%) ./malloc/./malloc/malloc.c + 22,023 (0.3%) ./malloc/./malloc/arena.c + +> 468,151 (5.7%, 80.7%) ???:??? + +> 461,095 (5.6%, 86.3%) insert:/home/njn/grind/ws1/cachegrind/concord.c +]]> + + +This is similar to the previous section, but is grouped by functions first and +files second. Also, the entry markers are > +instead of <. + + + +You might wonder why this section is needed, and how it differs from the +previous section. The answer is inlining. In this example there are two entries +demonstrating a function whose code is effectively spread across more than one +file: get_word and malloc. Here is an +example from profiling the Rust compiler, a much larger program that uses +inlining more: + + + 30,469,230 (1.3%, 11.1%) ::intern_ty: + 10,269,220 (0.5%) /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/raw/mod.rs + 7,696,827 (0.3%) /home/njn/dev/rust0/compiler/rustc_middle/src/ty/context.rs + 3,858,099 (0.2%) /home/njn/dev/rust0/library/core/src/cell.rs +]]> + + +In this case the compiled function intern_ty includes code +from three different source files, due to inlining. These should be examined +together. Older versions of cg_annotate presented this entry as three separate +file:function entries, which would typically be intermixed with all the other +entries, making it hard to see that they are all really part of the same +function. + - -Line-by-line Counts + +Per-line Counts -By default, all source code annotation is also shown. (Filenames to be -annotated can also by specified manually as arguments to cg_annotate, but this -is rarely needed.) For example, the output from running cg_annotate -<filename> for our example produces the same output as above -followed by an annotated version of concord.c, a section -of which looks like: + +By default, a source file is annotated if it contains at least one function +that meets the significance threshold. This can be disabled with the + option. + + + +To continue the previous example, here is part of the annotation of the file +concord.c: + ;word, data->line, table); - . . . . . . . . . - 4 0 0 1 0 0 2 0 0 free(data); - 4 0 0 1 0 0 2 0 0 fclose(file_ptr); - 3 0 0 2 0 0 . . . }]]> + . /* Function builds the hash table from the given file. */ + . void init_hash_table(char *file_name, Word_Node *table[]) + 8 (0.0%) { + . FILE *file_ptr; + . Word_Info *data; + 2 (0.0%) int line = 1, i; + . + . /* Structure used when reading in words and line numbers. */ + 3 (0.0%) data = (Word_Info *) create(sizeof(Word_Info)); + . + . /* Initialise entire table to NULL. */ + 2,993 (0.0%) for (i = 0; i < TABLE_SIZE; i++) + 997 (0.0%) table[i] = NULL; + . + . /* Open file, check it. */ + 4 (0.0%) file_ptr = fopen(file_name, "r"); + 2 (0.0%) if (!(file_ptr)) { + . fprintf(stderr, "Couldn't open '%s'.\n", file_name); + . exit(EXIT_FAILURE); + . } + . + . /* 'Get' the words and lines one at a time from the file, and insert them + . ** into the table one at a time. */ + 55,363 (0.7%) while ((line = get_word(data, line, file_ptr)) != EOF) + 31,632 (0.4%) insert(data->word, data->line, table); + . + 2 (0.0%) free(data); + 2 (0.0%) fclose(file_ptr); + 6 (0.0%) } +]]> -(Although column widths are automatically minimised, a wide -terminal is clearly useful.) - -Each source file is clearly marked -(User-annotated source) as -having been chosen manually for annotation. If the file was -found in one of the directories specified with the -/ option, the directory -and file are both given. + +Each executed line is annotated with its event counts. Other lines are +annotated with a dot. This may be because they contain no executable code, or +they contain executable code but were never executed. + -Each line is annotated with its event counts. Events not -applicable for a line are represented by a dot. This is useful -for distinguishing between an event which cannot happen, and one -which can but did not. + +You can easily tell if a function is inlined from this output. If it is not +inlined, it will have event counts on the lines containing the opening and +closing braces. If it is inlined, it will not have event counts on those lines. +In the example above, init_hash_table does have counts, +so you can tell it is not inlined. + -Sometimes only a small section of a source file is -executed. To minimise uninteresting output, Cachegrind only shows -annotated lines and lines within a small distance of annotated -lines. Gaps are marked with the line numbers so you know which -part of a file the shown code comes from, eg: + +Note again that inlining can lead to surprising results. If a function +f is always inlined, in the file:function and +function:file sections counts will be attributed to the functions it is inlined +into, rather than itself. However, if you look at the line-by-line annotations +for f you'll see the counts that belong to +f. So it's worth looking for large counts/percentages in the +line-by-line annotations. + + + +Sometimes only a small section of a source file is executed. To minimise +uninteresting output, Cachegrind only shows annotated lines and lines within a +small distance of annotated lines. Gaps are marked with line numbers, for +example: + +(counts and code for line 704) +-- line 375 ---------------------------------------- +-- line 514 ---------------------------------------- +(counts and code for line 878) +]]> -The amount of context to show around annotated lines is -controlled by the -option. + +The number of lines of context shown around annotated lines is controlled by +the option. + -Automatic annotation is enabled by default. -cg_annotate will automatically annotate every source file it can -find that is mentioned in the function-by-function summary. -Therefore, the files chosen for auto-annotation are affected by -the and - options. Each -source file is clearly marked (Auto-annotated -source) as being chosen automatically. Any -files that could not be found are mentioned at the end of the -output, eg: + +Any significant source files that could not be found are shown like this: + +-------------------------------------------------------------------------------- +-- Annotated source file: ./malloc/./malloc/malloc.c +-------------------------------------------------------------------------------- +Unannotated because one or more of these original files are unreadable: +- ./malloc/./malloc/malloc.c +]]> -This is quite common for library files, since libraries are -usually compiled with debugging information, but the source files -are often not present on a system. If a file is chosen for -annotation both manually and automatically, it -is marked as User-annotated -source. Use the -/ option to tell Valgrind where -to look for source files if the filenames found from the debugging -information aren't specific enough. + +This is common for library files, because libraries are usually compiled with +debugging information but the source files are rarely present on a system. + - Beware that auto-annotation can produce a lot of output if your program -is large. + +Cachegrind relies heavily on accurate debug info. Sometimes compilers do not +map a particular compiled instruction to line number 0, where the 0 represents +"unknown" or "none". This is annoying but does happen in practice. cg_annotate +prints these in the following way: + + + +]]> + + +Finally, when annotation is performed, the output ends with a summary of how +many counts were annotated and unannotated, and why. For example: + + + - -Annotating Assembly Code Programs - -Valgrind can annotate assembly code programs too, or annotate -the assembly code generated for your C program. Sometimes this is -useful for understanding what is really happening when an -interesting line of C code is translated into multiple -instructions. - -To do this, you just need to assemble your -.s files with assembly-level debug -information. You can use compile with the to compile C/C++ -programs to assembly code, and then assemble the assembly code files with - to achieve this. You can then profile and annotate the -assembly code source files in the same way as C/C++ source files. - - - Forking Programs -If your program forks, the child will inherit all the profiling data that -has been gathered for the parent. -If the output file format string (controlled by -) does not contain , -then the outputs from the parent and child will be intermingled in a single -output file, which will almost certainly make it unreadable by -cg_annotate. + +If your program forks, the child will inherit all the profiling data that +has been gathered for the parent. + + + +If the output file name (controlled by ) +does not contain , then the outputs from the parent and +child will be intermingled in a single output file, which will almost certainly +make it unreadable by cg_annotate. + + cg_annotate Warnings -There are a couple of situations in which -cg_annotate issues warnings. + +There are two situations in which cg_annotate prints warnings. + - If a source file is more recent than the - cachegrind.out.<pid> file. - This is because the information in - cachegrind.out.<pid> is only - recorded with line numbers, so if the line numbers change at - all in the source (e.g. lines added, deleted, swapped), any - annotations will be incorrect. + + If a source file is more recent than the Cachegrind output file. This is + because the information in the Cachegrind output file is only recorded with + line numbers, so if the line numbers change at all in the source (e.g. + lines added, deleted, swapped), any annotations will be incorrect. + - If information is recorded about line numbers past the - end of a file. This can be caused by the above problem, - i.e. shortening the source file while using an old - cachegrind.out.<pid> file. If - this happens, the figures for the bogus lines are printed - anyway (clearly marked as bogus) in case they are - important. + + If information is recorded about line numbers past the end of a file. This + can be caused by the above problem, e.g. shortening the source file while + using an old Cachegrind output file. If this happens, the figures for the + bogus lines are printed anyway (and clearly marked as bogus) in case they + are important. + - - -Unusual Annotation Cases - -Some odd things that can occur during annotation: - - - - If annotating at the assembler level, you might see - something like this: - - - How can the third instruction be executed twice when - the others are executed only once? As it turns out, it - isn't. Here's a dump of the executable, using - objdump -d: - - - Notice the extra mov - %esi,%esi instruction. Where did this come - from? The GNU assembler inserted it to serve as the two - bytes of padding needed to align the movl - $.LnrB,%eax instruction on a four-byte - boundary, but pretended it didn't exist when adding debug - information. Thus when Valgrind reads the debug info it - thinks that the movl - $0x1,0xffffffec(%ebp) instruction covers the - address range 0x8048f2b--0x804833 by itself, and attributes - the counts for the mov - %esi,%esi to it. - - - - - - Sometimes, the same filename might be represented with - a relative name and with an absolute name in different parts - of the debug info, eg: - /home/user/proj/proj.h and - ../proj.h. In this case, if you use - auto-annotation, the file will be annotated twice with the - counts split between the two. - - - - If you compile some files with - and some without, some - events that take place in a file without debug info could be - attributed to the last line of a file with debug info - (whichever one gets placed before the non-debug-info file in - the executable). - - - - -These cases should be rare. - - - - -Merging Profiles with cg_merge +Merging Cachegrind Output Files -cg_merge is a simple program which -reads multiple profile files, as created by Cachegrind, merges them -together, and writes the results into another file in the same format. -You can then examine the merged results using -cg_annotate <filename>, as -described above. The merging functionality might be useful if you -want to aggregate costs over multiple runs of the same program, or -from a single parallel run with multiple instances of the same -program. +cg_annotate can merge data from multiple Cachegrind output files in a single +run. (There is also a program called cg_merge that can merge multiple +Cachegrind output files into a single Cachegrind output file, but it is now +deprecated because cg_annotate's merging does a better job.) + -cg_merge is invoked as follows: +Use it as follows: +cg_annotate file1 file2 file3 ... +]]> -It reads and checks file1, then read -and checks file2 and merges it into -the running totals, then the same with -file3, etc. The final results are -written to outputfile, or to standard -out if no output file is specified. +cg_annotate computes the sum of these files (effectively +file1 + file2 + +file3), and then produces output as usual that shows the +summed counts. + -Costs are summed on a per-function, per-line and per-instruction -basis. Because of this, the order in which the input files does not -matter, although you should take care to only mention each file once, -since any file mentioned twice will be added in twice. - - -cg_merge does not attempt to check -that the input files come from runs of the same executable. It will -happily merge together profile files from completely unrelated -programs. It does however check that the -Events: lines of all the inputs are -identical, so as to ensure that the addition of costs makes sense. -For example, it would be nonsensical for it to add a number indicating -D1 read references to a number from a different file indicating LL -write misses. - - -A number of other syntax and sanity checks are done whilst reading the -inputs. cg_merge will stop and -attempt to print a helpful error message if any of the input files -fail these checks. +The most common merging scenario is if you want to aggregate costs over +multiple runs of the same program, possibly on different inputs. + -Differencing Profiles with cg_diff +Differencing Cachegrind output files -cg_diff is a simple program which -reads two profile files, as created by Cachegrind, finds the difference -between them, and writes the results into another file in the same format. -You can then examine the merged results using -cg_annotate <filename>, as -described above. This is very useful if you want to measure how a change to -a program affected its performance. +cg_annotate can diff data from two Cachegrind output files in a single run. +(There is also a program called cg_diff that can diff two Cachegrind output +files into a single Cachegrind output file, but it is now deprecated because +cg_annotate's differencing does a better job.) -cg_diff is invoked as follows: +Use it as follows: +cg_annotate --diff file1 file2 +]]> -It reads and checks file1, then read -and checks file2, then computes the -difference (effectively file1 - -file2). The final results are written to -standard output. +cg_annotate computes the difference between these two files (effectively +file2 - file1), and then +produces output as usual that shows the count differences. Note that many of +the counts may be negative; this indicates that the counts for the relevant +file/function/line are smaller in the second version than those in the first +version. + -Costs are summed on a per-function basis. Per-line costs are not summed, -because doing so is too difficult. For example, consider differencing two -profiles, one from a single-file program A, and one from the same program A -where a single blank line was inserted at the top of the file. Every single -per-line count has changed. In comparison, the per-function counts have not -changed. The per-function count differences are still very useful for -determining differences between programs. Note that because the result is -the difference of two profiles, many of the counts will be negative; this -indicates that the counts for the relevant function are fewer in the second -version than those in the first version. +The simplest common scenario is comparing two Cachegrind output files that came +from the same program, but on different inputs. cg_annotate will do a good job +on this without assistance. + -cg_diff does not attempt to check -that the input files come from runs of the same executable. It will -happily merge together profile files from completely unrelated -programs. It does however check that the -Events: lines of all the inputs are -identical, so as to ensure that the addition of costs makes sense. -For example, it would be nonsensical for it to add a number indicating -D1 read references to a number from a different file indicating LL -write misses. +A more complex scenario is if you want to compare Cachegrind output files from +two slightly different versions of a program that you have sitting +side-by-side, running on the same input. For example, you might have +version1/prog.c and version2/prog.c. +A straight comparison of the two would not be useful. Because functions are +always paired with filenames, a function f would be listed +as version1/prog.c:f for the first version but +version2/prog.c:f for the second version. + -A number of other syntax and sanity checks are done whilst reading the -inputs. cg_diff will stop and -attempt to print a helpful error message if any of the input files -fail these checks. - - -Sometimes you will want to compare Cachegrind profiles of two versions of a -program that you have sitting side-by-side. For example, you might have -version1/prog.c and -version2/prog.c, where the second is -slightly different to the first. A straight comparison of the two will not -be useful -- because functions are qualified with filenames, a function -f will be listed as -version1/prog.c:f for the first version but -version2/prog.c:f for the second -version. - - -When this happens, you can use the option. -Its argument is a Perl search-and-replace expression that will be applied -to all the filenames in both Cachegrind output files. It can be used to -remove minor differences in filenames. For example, the option - will suffice for -this case. +In this case, use the option. Its argument is a +search-and-replace expression that will be applied to all the filenames in both +Cachegrind output files. It can be used to remove minor differences in +filenames. For example, the option + will suffice for the +above example. + Similarly, sometimes compilers auto-generate certain functions and give them -randomized names. For example, GCC sometimes auto-generates functions with -names like T.1234, and the suffixes vary from build to -build. You can use the option to remove -small differences like these; it works in the same way as -. +randomized names like T.1234 where the suffixes vary from +build to build. You can use the option to +remove small differences like these; it works in the same way as +. + + + +When is used to compare two different versions +of the same program, cg_annotate will not annotate any file that is different +between the two versions, because the per-line counts are not reliable in such +a case. For example, imagine if version2/prog.c is the +same as version1/prog.c except with an extra blank line at +the top of the file. Every single per-line count will have changed. In +comparison, the per-file and per-function counts have not changed, and are +still very useful for determining differences between programs. You might think +that this means every interesting file will be left unannotated, but again +inlining means that files that are identical in the two versions can have +different counts on many lines. + + + +Cache and Branch Simulation + + +Cachegrind can simulate how your program interacts with a machine's cache +hierarchy and/or branch predictor. + +The cache simulation models a machine with independent first-level instruction +and data caches (I1 and D1), backed by a unified second-level cache (L2). For +these machines (in the cases where Cachegrind can auto-detect the cache +configuration) Cachegrind simulates the first-level and last-level caches. +Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches. + + + +When simulating the cache, with , Cachegrind +gathers the following statistics: + + + + + + I cache reads (Ir, which equals the number + of instructions executed), I1 cache read misses + (I1mr) and LL cache instruction read + misses (ILmr). + + + + + D cache reads (Dr, which equals the number + of memory reads), D1 cache read misses + (D1mr), and LL cache data read misses + (DLmr). + + + + + D cache writes (Dw, which equals the + number of memory writes), D1 cache write misses + (D1mw), and LL cache data write misses + (DLmw). + + + + + +Note that D1 total accesses is given by D1mr + +D1mw, and that LL total accesses is given by +ILmr + DLmr + +DLmw. + + + +When simulating the branch predictor, with , +Cachegrind gathers the following statistics: + + + + + + Conditional branches executed (Bc) and + conditional branches mispredicted (Bcm). + + + + + Indirect branches executed (Bi) and + indirect branches mispredicted (Bim). + + + + + +When cache and/or branch simulation is enabled, cg_annotate will print multiple +counts per line of output. For example: + + + 8,547 (0.1%, 99.4%) 936 (0.1%, 99.1%) 177 (0.3%, 96.7%) 59 (0.0%, 99.9%) 38 (19.4%, 66.3%) strcmp: + 8,503 (0.1%) 928 (0.1%) 175 (0.3%) 59 (0.0%) 38 (19.4%) ./string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S +]]> + + - Cachegrind Command-line Options -Cachegrind-specific options are: + +Cachegrind-specific options are: + - + - + - Specify the size, associativity and line size of the level 1 - instruction cache. - - - - - - - - - Specify the size, associativity and line size of the level 1 - data cache. - - - - - - - - - Specify the size, associativity and line size of the last-level - cache. + + Write the Cachegrind output file to file rather than + to the default output file, + cachegrind.out.<pid>. The + and format specifiers can be used to embed the + process ID and/or the contents of an environment variable in the name, as + is the case for the core option + . + - + - Enables or disables collection of cache access and miss - counts. + + Enables or disables collection of cache access and miss counts. + @@ -835,29 +869,45 @@ small differences like these; it works in the same way as - Enables or disables collection of branch instruction and - misprediction counts. By default this is disabled as it - slows Cachegrind down by approximately 25%. Note that you - cannot specify - and - together, as that would leave Cachegrind with no - information to collect. + + Enables or disables collection of branch instruction and + misprediction counts. + - + - + - Write the profile data to - file rather than to the default - output file, - cachegrind.out.<pid>. The - and format specifiers - can be used to embed the process ID and/or the contents of an - environment variable in the name, as is the case for the core - option . + + Specify the size, associativity and line size of the level 1 instruction + cache. Only useful with . + + + + + + + + + + + Specify the size, associativity and line size of the level 1 data cache. + Only useful with . + + + + + + + + + + + Specify the size, associativity and line size of the last-level cache. + Only useful with . @@ -895,29 +945,65 @@ small differences like these; it works in the same way as - + - Specifies which events to show (and the column - order). Default is to use all present in the - cachegrind.out.<pid> file (and - use the order in the file). Useful if you want to concentrate on, for - example, I cache misses (), or data - read misses (), or LL data misses - (). Best used in conjunction with - . + Diff two Cachegrind output files. - + - Specifies the events upon which the sorting of the - function-by-function entries will be based. + + Specifies an search-and-replace expression + that is applied to all filenames. Useful when differencing, for removing + minor differences in paths between two different versions of a program + that are sitting in different directories. An suffix + makes the regex case-insensitive, and a suffix makes + it match multiple times. + + + + + + + + + + + Like , but for filenames. Useful for + removing minor differences in randomized names of auto-generated + functions generated by some compilers. + + + + + + + + + + + Specifies which events to show (and the column order). Default is to use + all present in the Cachegrind output file (and use the order in the + file). Best used in conjunction with . + + + + + + + + + + + Specifies the events upon which the sorting of the file:function and + function:file entries will be based. + @@ -926,18 +1012,12 @@ small differences like these; it works in the same way as - Sets the threshold for the function-by-function - summary. A function is shown if it accounts for more than X% - of the counts for the primary sort event. If auto-annotating, also - affects which files are annotated. - - Note: thresholds can be set for more than one of the - events by appending any events for the - option with a colon - and a number (no spaces, though). E.g. if you want to see - each function that covers more than 1% of LL read misses or 1% of LL - write misses, use this option: - + + Sets the significance threshold for the file:function and function:files + sections. A file or function is shown if it accounts for more than X% of + the counts for the primary sort event. If annotating source files, this + also affects which files are annotated. + @@ -946,20 +1026,21 @@ small differences like these; it works in the same way as - When enabled, a percentage is printed next to all event counts. - This helps gauge the relative importance of each function and line. + + When enabled, a percentage is printed next to all event counts. This + helps gauge the relative importance of each function and line. - + - When enabled, automatically annotates every file that - is mentioned in the function-by-function summary that can be - found. Also gives a list of those that couldn't be found. + + Enables or disables source file annotation. + @@ -968,21 +1049,10 @@ small differences like these; it works in the same way as - Print N lines of context before and after each - annotated line. Avoids printing large sections of source - files that were not executed. Use a large number - (e.g. 100000) to show all source lines. - - - - - -