diff --git a/addrcheck/ac_main.html b/addrcheck/ac_main.html new file mode 100644 index 000000000..7aa2e9a52 --- /dev/null +++ b/addrcheck/ac_main.html @@ -0,0 +1,10 @@ + +
++ +
+Cachegrind is licensed under the GNU General Public License,
+version 2
+An open-source tool for finding memory-management problems in
+Linux-x86 executables.
+
+ +
+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for traditional profiling and test coverage.
+ +Any feedback, bug-fixes, suggestions, etc, welcome. + + +
-g flag). But by contrast with normal
+Valgrind use, you probably do want to turn optimisation on, since you
+should profile your program as it will be normally run.
+
+The two steps are:
+valgrind --skin=cachegrind in front of
+ the normal command line invocation. When the program finishes,
+ Valgrind will print summary cache statistics. It also collects
+ line-by-line information in a file
+ cachegrind.out.pid, where pid
+ is the program's process id.
+ + This step should be done every time you want to collect + information about a new program, a changed program, or about the + same program with different input. +
+
--auto=yes option. You can annotate C/C++
+ files or assembly language files equally easily.
+ + This step can be performed as many times as you like for each + Step 2. You may want to do multiple annotations showing + different information each time.
+
+ + +
+ +The more specific characteristics of the simulation are as follows. + +
+ +
+ +
+
--I1, --D1 and --L2 options.+ +Other noteworthy behaviour: + +
inc and
+ dec) are counted as doing just a read, ie. a single data
+ reference. This may seem strange, but since the write can never cause a
+ miss (the read guarantees the block is in the cache) it's not very
+ interesting.+ + Thus it measures not the number of times the data cache is accessed, but + the number of times a data cache miss could occur.
+
vg_cachesim_I1.c, vg_cachesim_D1.c,
+vg_cachesim_L2.c and vg_cachesim_gen.c. We'd be
+interested to hear from anyone who does.
+
+
+--skin=cachegrind
+option to the valgrind shell script. To gather cache profiling
+information about the program ls -l, type:
+
+valgrind --skin=cachegrind ls -l
+
+The program will execute (slowly). Upon completion, summary statistics
+that look like this will be printed:
+
++==31751== I refs: 27,742,716 +==31751== I1 misses: 276 +==31751== L2 misses: 275 +==31751== I1 miss rate: 0.0% +==31751== L2i miss rate: 0.0% +==31751== +==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) +==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) +==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr) +==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) +==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%) +==31751== +==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr) +==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%) ++ +Cache accesses for instruction fetches are summarised first, giving the +number of fetches made (this is the number of instructions executed, which +can be useful to know in its own right), the number of I1 misses, and the +number of L2 instruction (
L2i) misses.
+
+Cache accesses for data follow. The information is similar to that of the
+instruction fetches, except that the values are also shown split between reads
+and writes (note each row's rd and wr values add up
+to the row's total).
+ +Combined instruction and data figures for the L2 cache follow that.
+ + +
cachegrind.out.pid. This file is human-readable, but is
+best interpreted by the accompanying program cg_annotate,
+described in the next section.
+
+Things to note about the cachegrind.out.pid file:
+
valgrind --skin=cachegrind
+ is run, and will overwrite any existing
+ cachegrind.out.pid in the current directory (but
+ that won't happen very often because it takes some time for process ids
+ to be recycled).+
ls -l generates a file of about
+ 350KB. Browsing a few files and web pages with a Konqueror
+ built with full debugging information generates a file
+ of around 15 MB.cachegrind.out (i.e. no .pid suffix).
+The suffix serves two purposes. Firstly, it means you don't have to rename old
+log files that you don't want to overwrite. Secondly, and more importantly,
+it allows correct profiling with the --trace-children=yes option
+of programs that spawn child processes.
+
+
++ +The interesting cache-simulation specific options are: + +
--I1=<size>,<associativity>,<line_size>--D1=<size>,<associativity>,<line_size>--L2=<size>,<associativity>,<line_size>+ [default: uses CPUID for automagic cache configuration]
+
+ Manually specifies the I1/D1/L2 cache configuration, where
+ size and line_size are measured in bytes. The
+ three items must be comma-separated, but with no spaces, eg:
+
+
+ valgrind --skin=cachegrind --I1=65535,2,64
+
+
+ You can specify one, two or three of the I1/D1/L2 caches. Any level not
+ manually specified will be simulated using the configuration found in the
+ normal way (via the CPUID instruction, or failing that, via defaults).
+cg_annotate, it is worth widening your
+window to be at least 120-characters wide if possible, as the output
+lines can be quite long.
+
+To get a function-by-function summary, run cg_annotate
+--pid in a directory containing a
+cachegrind.out.pid file. The --pid
+is required so that cg_annotate knows which log file to use when
+several are present.
+
+The output looks like this: + +
+-------------------------------------------------------------------------------- +I1 cache: 65536 B, 64 B, 2-way associative +D1 cache: 65536 B, 64 B, 2-way associative +L2 cache: 262144 B, 64 B, 8-way associative +Command: concord vg_to_ucode.c +Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw +Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw +Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw +Threshold: 99% +Chosen for annotation: +Auto-annotation: on + +-------------------------------------------------------------------------------- +Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw +-------------------------------------------------------------------------------- +27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS + +-------------------------------------------------------------------------------- +Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function +-------------------------------------------------------------------------------- +8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc +5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word +2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp +2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash +2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower +1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert + 897,991 51 51 897,831 95 30 62 1 1 ???:??? + 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile + 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile + 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc + 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing + 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER + 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table + 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create + 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 + 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 + 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node + 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue ++ +First up is a summary of the annotation options: + +
+ +
+ +
+
Ir : I cache reads (ie. instructions executed)I1mr: I1 cache read missesI2mr: L2 cache instruction read missesDr : D cache reads (ie. memory reads)D1mr: D1 cache read missesD2mr: L2 cache data read missesDw : D cache writes (ie. memory writes)D1mw: D1 cache write missesD2mw: L2 cache data write misses
+ Note that D1 total accesses is given by D1mr +
+ D1mw, and that L2 total accesses is given by
+ I2mr + D2mr + D2mw.
+ +
--show option.+ +
Ir counts to lowest. If two functions have identical
+ Ir counts, they will then be sorted by I1mr
+ counts, and so on. This order can be adjusted with the
+ --sort option.
+
+ Note that this dictates the order the functions appear. It is not
+ the order in which the columns appear; that is dictated by the "events
+ shown" line (and can be changed with the --show option).
+
+ +
cg_annotate by default omits functions
+ that cause very low numbers of misses to avoid drowning you in
+ information. In this case, cg_annotate shows summaries the
+ functions that account for 99% of the Ir counts;
+ Ir is chosen as the threshold event since it is the
+ primary sort event. The threshold can be adjusted with the
+ --threshold option.+ +
+ +
--auto=yes option. In this case no.+
valgrind --skin=cachegrind.
+
+Then follows function-by-function statistics. Each function is
+identified by a file_name:function_name pair. If a column
+contains only a dot it means the function never performs
+that event (eg. the third row shows that strcmp()
+contains no instructions that write to memory). The name
+??? is used if the the file name and/or function name
+could not be determined from debugging information. If most of the
+entries have the form ???:??? the program probably wasn't
+compiled with -g. If any code was invalidated (either due to
+self-modifying code or unloading of shared objects) its counts are aggregated
+into a single cost centre written as (discarded):(discarded).
+ +It is worth noting that functions will come from three types of source files: +
concord.c in this example).getc.c)vg_clientmalloc.c:malloc). These are recognisable because
+ the filename begins with vg_, and is probably one of
+ vg_main.c, vg_clientmalloc.c or
+ vg_mylibc.c.
+ --auto=yes option. To do it
+manually, just specify the filenames as arguments to
+cg_annotate. For example, the output from running
+cg_annotate concord.c for our example produces the same
+output as above followed by an annotated version of
+concord.c, a section of which looks like:
+
+
+--------------------------------------------------------------------------------
+-- User-annotated source: concord.c
+--------------------------------------------------------------------------------
+Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
+
+[snip]
+
+ . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
+ 3 1 1 . . . 1 0 0 {
+ . . . . . . . . . FILE *file_ptr;
+ . . . . . . . . . Word_Info *data;
+ 1 0 0 . . . 1 1 1 int line = 1, i;
+ . . . . . . . . .
+ 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
+ . . . . . . . . .
+ 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
+ 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
+ . . . . . . . . .
+ . . . . . . . . . /* Open file, check it. */
+ 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
+ 2 0 0 1 0 0 . . . if (!(file_ptr)) {
+ . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
+ 1 1 1 . . . . . . exit(EXIT_FAILURE);
+ . . . . . . . . . }
+ . . . . . . . . .
+ 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
+ 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
+ . . . . . . . . .
+ 4 0 0 1 0 0 2 0 0 free(data);
+ 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
+ 3 0 0 2 0 0 . . . }
+
+
+(Although column widths are automatically minimised, a wide terminal is clearly
+useful.)
+
+Each source file is clearly marked (User-annotated source) as
+having been chosen manually for annotation. If the file was found in one of
+the directories specified with the -I/--include
+option, the directory and file are both given.
+ +Each line is annotated with its event counts. Events not applicable for a line +are represented by a `.'; this is useful for distinguishing between an event +which cannot happen, and one which can but did not.
+ +Sometimes only a small section of a source file is executed. To minimise +uninteresting output, Valgrind only shows annotated lines and lines within a +small distance of annotated lines. Gaps are marked with the line numbers so +you know which part of a file the shown code comes from, eg: + +
+(figures and code for line 704) +-- line 704 ---------------------------------------- +-- line 878 ---------------------------------------- +(figures and code for line 878) ++ +The amount of context to show around annotated lines is controlled by the +
--context option.
+
+To get automatic annotation, run cg_annotate --auto=yes.
+cg_annotate will automatically annotate every source file it can find that is
+mentioned in the function-by-function summary. Therefore, the files chosen for
+auto-annotation are affected by the --sort and
+--threshold options. Each source file is clearly marked
+(Auto-annotated source) as being chosen automatically. Any files
+that could not be found are mentioned at the end of the output, eg:
+
+
+-------------------------------------------------------------------------------- +The following files chosen for auto-annotation could not be found: +-------------------------------------------------------------------------------- + getc.c + ctype.c + ../sysdeps/generic/lockfile.c ++ +This is quite common for library files, since libraries are usually compiled +with debugging information, but the source files are often not present on a +system. If a file is chosen for annotation both manually and +automatically, it is marked as
User-annotated source.
+
+Use the -I/--include option to tell Valgrind where to look for
+source files if the filenames found from the debugging information aren't
+specific enough.
+
+Beware that cg_annotate can take some time to digest large
+cachegrind.out.pid files, e.g. 30 seconds or more. Also
+beware that auto-annotation can produce a lot of output if your program is
+large!
+
+
+
+
+To do this, you just need to assemble your .s files with
+assembler-level debug information. gcc doesn't do this, but you can
+use the GNU assembler with the --gstabs option to
+generate object files with this information, eg:
+
+
as --gstabs foo.s
+
+You can then profile and annotate source files in the same way as for C/C++
+programs.
+
+
+cg_annotate options--pid
+
+ Indicates which cachegrind.out.pid file to read.
+ Not actually an option -- it is required.
+
+
-h, --help+
-v, --version+ + Help and version, as usual.
--sort=A,B,C [default: order in
+ cachegrind.out.pid]
+ Specifies the events upon which the sorting of the function-by-function
+ entries will be based. Useful if you want to concentrate on eg. I cache
+ misses (--sort=I1mr,I2mr), or D cache misses
+ (--sort=D1mr,D2mr), or L2 misses
+ (--sort=D2mr,I2mr).
+ +
--show=A,B,C [default: all, using order in
+ cachegrind.out.pid]
+ Specifies which events to show (and the column order). Default is to use
+ all present in the cachegrind.out.pid file (and use
+ the order in the file).
+ +
--threshold=X [default: 99%]
+ Sets the threshold for the function-by-function summary. Functions are
+ shown that account for more than X% of the primary sort event. If
+ auto-annotating, also affects which files are annotated.
+
+ Note: thresholds can be set for more than one of the events by appending
+ any events for the --sort option with a colon and a number
+ (no spaces, though). E.g. if you want to see the functions that cover
+ 99% of L2 read misses and 99% of L2 write misses, use this option:
+
+
--sort=D2mr:99,D2mw:99
+ + +
--auto=no [default]--auto=yes + When enabled, automatically annotates every file that is mentioned in the + function-by-function summary that can be found. Also gives a list of + those that couldn't be found. + +
--context=N [default: 8]+ Print N lines of context before and after each annotated line. Avoids + printing large sections of source files that were not executed. Use a + large number (eg. 10,000) to show all source lines. +
+ +
-I=<dir>, --include=<dir>
+ [default: empty string]+ Adds a directory to the list in which to search for files. Multiple + -I/--include options can be given to add multiple directories. +
cachegrind.out.pid file. This is because the
+ information in cachegrind.out.pid is only recorded
+ with line numbers, so if the line numbers change at all in the source
+ (eg. lines added, deleted, swapped), any annotations will be
+ incorrect.+ +
cachegrind.out.pid file. If this
+ happens, the figures for the bogus lines are printed anyway (clearly
+ marked as bogus) in case they are important.+
+ 1 0 0 . . . . . . leal -12(%ebp),%eax + 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) + 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) + . . . . . . . . . .align 4,0x90 + 1 0 0 . . . . . . movl $.LnrB,%eax + 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp) ++ + How can the third instruction be executed twice when the others are + executed only once? As it turns out, it isn't. Here's a dump of the + executable, using
objdump -d:
+
+ + 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax + 8048f28: 89 43 54 mov %eax,0x54(%ebx) + 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) + 8048f32: 89 f6 mov %esi,%esi + 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax + 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp) ++ + Notice the extra
mov %esi,%esi instruction. Where did this
+ come from? The GNU assembler inserted it to serve as the two bytes of
+ padding needed to align the movl $.LnrB,%eax instruction on
+ a four-byte boundary, but pretended it didn't exist when adding debug
+ information. Thus when Valgrind reads the debug info it thinks that the
+ movl $0x1,0xffffffec(%ebp) instruction covers the address
+ range 0x8048f2b--0x804833 by itself, and attributes the counts for the
+ mov %esi,%esi to it.+
inline_me() is defined in
+ foo.h and inlined in the functions f1(),
+ f2() and f3() in bar.c, there will
+ not be a foo.h:inline_me() function entry. Instead, there
+ will be separate function entries for each inlining site, ie.
+ foo.h:f1(), foo.h:f2() and
+ foo.h:f3(). To find the total counts for
+ foo.h:inline_me(), add up the counts from each entry.
+
+ The reason for this is that although the debug info output by gcc
+ indicates the switch from bar.c to foo.h, it
+ doesn't indicate the name of the function in foo.h, so
+ Valgrind keeps using the old one.
+ +
/home/user/proj/proj.h and ../proj.h. In this
+ case, if you use auto-annotation, the file will be annotated twice with
+ the counts split between the two.+
struct
+ nlist defined in a.out.h under Linux is only a 16-bit
+ value. Valgrind can handle some files with more than 65,535 lines
+ correctly by making some guesses to identify line number overflows. But
+ some cases are beyond it, in which case you'll get a warning message
+ explaining that annotations for the file might be incorrect.+
-g and some without, some
+ events that take place in a file without debug info could be attributed
+ to the last line of a file with debug info (whichever one gets placed
+ before the non-debug-info file in the executable).+
+ +Note: stabs is not an easy format to read. If you come across bizarre +annotations that look like might be caused by a bug in the stabs reader, +please let us know.
+ + +
+ +
+ +
+ +
+ +
malloc() will allocate memory in different
+ ways to the standard malloc(), which could warp the results.
+ + +
+ +
bts, btr and btc
+ will incorrectly be counted as doing a data read if both the arguments
+ are registers, eg:
+
+ btsl %eax, %edx
+
+ This should only happen rarely.
+ + +
fsave) are treated as though they only access 16 bytes.
+ These instructions seem to be rare so hopefully this won't affect
+ accuracy much.
+ +
valgrind.so file, the size of the program being
+profiled, or even the length of its name can perturb the results. Variations
+will be small, but don't expect perfectly repeatable results if your program
+changes at all.+ +While these factors mean you shouldn't trust the results to be super-accurate, +hopefully they should be close enough to be useful.
+ + +
+
+jseward@acm.org
+http://developer.kde.org/~sewardj
+Copyright © 2000-2002 Julian Seward
+
+Valgrind is licensed under the GNU General Public License,
+version 2
+An open-source tool for finding memory-management problems in
+x86 GNU/Linux executables.
+
+ + + + +
LOAD, STORE,
+FPU_R and FPU_W. By contrast, because of the x86
+addressing modes, almost every instruction can read or write memory.
+
+Most of the cache profiling machinery is in the file
+vg_cachesim.c.
+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.
+ +
iCC), and one for instructions that do
+(idCC):
+
+
+typedef struct _CC {
+ ULong a;
+ ULong m1;
+ ULong m2;
+} CC;
+
+typedef struct _iCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+} iCC;
+
+typedef struct _idCC {
+ /* word 1 */
+ UChar tag;
+ UChar instr_size;
+ UChar data_size;
+
+ /* words 2+ */
+ Addr instr_addr;
+ CC I;
+ CC D;
+} idCC;
+
+
+Each CC has three fields a, m1,
+m2 for recording references, level 1 misses and level 2 misses.
+Each of these is a 64-bit ULong -- the numbers can get very large,
+ie. greater than 4.2 billion allowed by a 32-bit unsigned int.
+
+A iCC has one CC for instruction cache accesses. A
+idCC has two, one for instruction cache accesses, and one for data
+cache accesses.
+
+The iCC and dCC structs also store unchanging
+information about the instruction:
+
+
+
idCC only)+
+
idCC. This is
+because for many memory-referencing instructions the data address can change
+each time it's executed (eg. if it uses register-offset addressing). We have
+to give this item to the cache simulation in a different way (see
+Instrumentation section below). Some memory-referencing instructions do always
+reference the same address, but we don't try to treat them specialy in order to
+keep things simple.
+
+Also note that there is only room for recording info about one data cache
+access in an idCC. So what about instructions that do a read then
+a write, such as:
+
+
inc %(esi)
+
+In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
+since it immediately follows the read which will drag the block into the cache
+if it's not already there. So the write access isn't really interesting, and
+Valgrind doesn't record it. This means that Valgrind doesn't measure
+memory references, but rather memory references that could miss in the cache.
+This behaviour is the same as that used by the AMD Athlon hardware counters.
+It also has the benefit of simplifying the implementation -- instructions that
+read and write memory can be treated like instructions that read memory.+ +
+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.
+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +
+movl $0x0,%eax +movl $0x99, -4(%ebp) ++ +The translation to UCode looks like this: + +
+MOVL $0x0, t20 +PUTL t20, %EAX +INCEIPo $5 + +LEA1L -4(t4), t14 +MOVL $0x99, t18 +STL t18, (t14) +INCEIPo $7 ++ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the
INCEIPo instruction, the argument of which gives
+the byte size of the instruction (note that lazy INCEIP updating is turned off
+to allow this).
+
+We can tell if an x86 instruction references memory by looking for
+LDL and STL UCode instructions, and thus what kind of
+cost centre is required. From this we can determine how many cost centres we
+need for the basic block, and their sizes. We can then allocate them in a
+single array.
+
+Consider the example code above. After the preliminary pass, we know we need
+two cost centres, one iCC and one dCC. So we
+allocate an array to store these which looks like this:
+
+
+|(uninit)| tag (1 byte) +|(uninit)| instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|(uninit)| instr_addr (4 bytes) +|(uninit)| I.a (8 bytes) +|(uninit)| I.m1 (8 bytes) +|(uninit)| I.m2 (8 bytes) + +|(uninit)| tag (1 byte) +|(uninit)| instr_size (1 byte) +|(uninit)| data_size (1 byte) +|(uninit)| (padding) (1 byte) +|(uninit)| instr_addr (4 bytes) +|(uninit)| I.a (8 bytes) +|(uninit)| I.m1 (8 bytes) +|(uninit)| I.m2 (8 bytes) +|(uninit)| D.a (8 bytes) +|(uninit)| D.m1 (8 bytes) +|(uninit)| D.m2 (8 bytes) ++ +(We can see now why we need tags to distinguish between the two types of cost +centres.)
+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.
+ +
+
+
+|INSTR_CC| tag (1 byte) +|5 | instr_size (1 bytes) +|(uninit)| (padding) (2 bytes) +|i_addr1 | instr_addr (4 bytes) +|0 | I.a (8 bytes) +|0 | I.m1 (8 bytes) +|0 | I.m2 (8 bytes) + +|WRITE_CC| tag (1 byte) +|7 | instr_size (1 byte) +|4 | data_size (1 byte) +|(uninit)| (padding) (1 byte) +|i_addr2 | instr_addr (4 bytes) +|0 | I.a (8 bytes) +|0 | I.m1 (8 bytes) +|0 | I.m2 (8 bytes) +|0 | D.a (8 bytes) +|0 | D.m1 (8 bytes) +|0 | D.m2 (8 bytes) ++ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)
+
+GCC inserts padding before the instr_size field so that it is word
+aligned.
+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +
+MOVL $0x0, t20 +PUTL t20, %EAX + PUSHL %eax + PUSHL %ecx + PUSHL %edx + MOVL $0x4091F8A4, t46 # address of 1st CC + PUSHL t46 + CALLMo $0x12 # second cachesim function + CLEARo $0x4 + POPL %edx + POPL %ecx + POPL %eax +INCEIPo $5 + +LEA1L -4(t4), t14 +MOVL $0x99, t18 + MOVL t14, t42 +STL t18, (t14) + PUSHL %eax + PUSHL %ecx + PUSHL %edx + PUSHL t42 + MOVL $0x4091F8C4, t44 # address of 2nd CC + PUSHL t44 + CALLMo $0x13 # second cachesim function + CLEARo $0x8 + POPL %edx + POPL %ecx + POPL %eax +INCEIPo $7 ++ +Consider the first instruction's UCode. Each call is surrounded by three +
PUSHL and POPL instructions to save and restore the
+caller-save registers. Then the address of the instruction's cost centre is
+pushed onto the stack, to be the first argument to the cache simulation
+function. The address is known at this point because we are doing a
+simultaneous pass through the cost centre array. This means the cost centre
+lookup for each instruction is almost free (just the cost of pushing an
+argument for a function call). Then the call to the cache simulation function
+for non-memory-reference instructions is made (note that the
+CALLMo UInstruction takes an offset into a table of predefined
+functions; it is not an absolute address), and the single argument is
+CLEARed from the stack.
+
+The second instruction's UCode is similar. The only difference is that, as
+mentioned before, we have to pass the address of the data item referenced to
+the cache simulation function too. This explains the MOVL t14,
+t42 and PUSHL t42 UInstructions. (Note that the seemingly
+redundant MOVing will probably be optimised away during register
+allocation.)
+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.
+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.
+ +
+
+The interface to the simulation is quite clean. The functions called from the
+UCode contain calls to the simulation functions in the files
+vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only
+one function call is done per simulated x86 instruction. The file
+vg_cachesim.c simply #includes the three files
+containing the simulation, which makes plugging in new cache simulations is
+very easy -- you just replace the three files and recompile.
+ +
+ +Input file has the following format: + +
+file ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line ::= "desc:" ws? non_nl_string
+cmd_line ::= "cmd:" ws? cmd
+events_line ::= "events:" ws? (event ws)+
+data_line ::= file_line | fn_line | count_line
+file_line ::= ("fl=" | "fi=" | "fe=") filename
+fn_line ::= "fn=" fn_name
+count_line ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count ::= num | "."
+
+
+Where:
+
+non_nl_string is any string not containing a newline.+
cmd is a command line invocation.+
filename and fn_name can be anything.+
num and line_num are decimal numbers.+
ws is whitespace.+
nl is a newline.+
+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.
+
+The number of counts in each line and the
+summary_line should not exceed the number of events in the
+event_line. If the number in each line is less,
+cg_annotate treats those missing as though they were a "." entry.
+
+A file_line changes the current file name. A fn_line
+changes the current function name. A count_line contains counts
+that pertain to the current filename/fn_name. A "fn=" file_line
+and a fn_line must appear before any count_lines to
+give the context of the first count_lines.
+
+Each file_line should be immediately followed by a
+fn_line. "fi=" file_lines are used to switch
+filenames for inlined functions; "fe=" file_lines are similar, but
+are put at the end of a basic block in which the file name hasn't been switched
+back to the original file name. (fi and fe lines behave the same, they are
+only distinguished to help debugging.)
+ + +
+ +
+ +
+ +
+ +
cachegrind.out output files can contain huge amounts of
+ information; file format was carefully chosen to minimise file
+ sizes.+
+
+In particular, cg_annotate would not need to change -- the file format is such
+that it is not specific to the cache simulation, but could be used for any kind
+of line-by-line information. The only part of cg_annotate that is specific to
+the cache simulation is the name of the input file
+(cachegrind.out), although it would be very simple to add an
+option to control this.
+ + + diff --git a/coregrind/coregrind_core.html b/coregrind/coregrind_core.html new file mode 100644 index 000000000..7e6083636 --- /dev/null +++ b/coregrind/coregrind_core.html @@ -0,0 +1,1270 @@ + + + +
valgrind at the start of the command line
+normally used to run the program, and tell it what skin you want to
+use.
+
+
+So, for example, if you want to run the command ls -l
+using the heavyweight memory-checking tool, issue the command:
+valgrind --skin=memcheck ls -l. The --skin=
+parameter tells the core which skin is to be used.
+
+
+To preserve compatibility with the 1.0.X series, if you do not specify
+a skin, the default is to use the memcheck skin. That means the above
+example simplifies to: valgrind ls -l.
+
+
Regardless of which skin is in use, Valgrind takes control of your +program before it starts. Debugging information is read from the +executable and associated libraries, so that error messages can be +phrased in terms of source code locations (if that is appropriate). + +
+Your program is then run on a synthetic x86 CPU provided by the +valgrind core. As new code is executed for the first time, the core +hands the code to the selected skin. The skin adds its own +instrumentation code to this and hands the result back to the core, +which coordinates the continued execution of this instrumented code. + +
+The amount of instrumentation code added varies widely between skins. +At one end of the scale, the memcheck skin adds code to check every +memory access and every value computed, increasing the size of the +code at least 12 times, and making it run 25-50 times slower than +natively. At the other end of the spectrum, the ultra-trivial "none" +skin adds no instrumentation at all and causes in total "only" about a +4 times slowdown. + +
+Valgrind simulates every single instruction your program executes.
+Because of this, the active skin checks, or profiles, not only the
+code in your application but also in all supporting dynamically-linked
+(.so-format) libraries, including the GNU C library, the
+X client libraries, Qt, if you work with KDE, and so on.
+
+
+If -- as is usually the case -- you're using one of the +error-detection skins, valgrind will often detect errors in +libraries, for example the GNU C or X11 libraries, which you have to +use. Since you're probably using valgrind to debug your own +application, and not those libraries, you don't want to see those +errors and probably can't fix them anyway. + +
+So, rather than swamping you with errors in which you are not +interested, Valgrind allows you to selectively suppress errors, by +recording them in a suppressions file which is read when Valgrind +starts up. The build mechanism attempts to select suppressions which +give reasonable behaviour for the libc and XFree86 versions detected +on your machine. + +
+Different skins report different kinds of errors. The suppression +mechanism therefore allows you to say which skin or skin(s) each +suppression applies to. + + + + +
-g flag). Without debugging info, the best valgrind
+will be able to do is guess which function a particular piece of code
+belongs to, which makes both error messages and profiling output
+nearly useless. With -g, you'll potentially get messages
+which point directly to the relevant source code lines.
+
++You don't have to do this, but doing so helps Valgrind produce more +accurate and less confusing error reports. Chances are you're set up +like this already, if you intended to debug your program with GNU gdb, +or some other debugger. + +
+This paragraph applies only if you plan to use the memcheck
+skin (which is the default). On rare occasions, optimisation levels
+at -O2 and above have been observed to generate code which
+fools memcheck into wrongly reporting uninitialised value
+errors. We have looked in detail into fixing this, and unfortunately
+the result is that doing so would give a further significant slowdown
+in what is already a slow skin. So the best solution is to turn off
+optimisation altogether. Since this often makes things unmanagably
+slow, a plausible compromise is to use -O. This gets
+you the majority of the benefits of higher optimisation levels whilst
+keeping relatively small the chances of false complaints from memcheck.
+All other skins (as far as we know) are unaffected by optimisation
+level.
+
+
+Valgrind understands both the older "stabs" debugging format, used by +gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1 +and later. We continue to refine and debug our debug-info readers, +although the majority of effort will naturally enough go into the +newer DWARF2 reader. + +
+Then just run your application, but place valgrind
+--skin=the-selected-skin in front of your usual command-line
+invokation. Note that you should run the real (machine-code)
+executable here. If your application is started by, for example, a
+shell or perl script, you'll need to modify it to invoke Valgrind on
+the real executables. Running such scripts directly under Valgrind
+will result in you getting error reports pertaining to
+/bin/sh, /usr/bin/perl, or whatever
+interpreter you're using. This almost certainly isn't what you want
+and can be confusing. You can probably force the issue by
+giving the flag --trace-children=yes, but confusion is
+still highly likely.
+
+
+
+
+ ==12345== some-message-from-Valgrind ++ +
The 12345 is the process ID. This scheme makes it easy
+to distinguish program output from Valgrind commentary, and also easy
+to differentiate commentaries from different processes which have
+become merged together, for whatever reason.
+
+
By default, Valgrind writes only essential messages to the commentary,
+so as to avoid flooding you with information of secondary importance.
+If you want more information about what is happening, re-run, passing
+the -v flag to Valgrind.
+
+
+Version 2 of valgrind gives significantly more flexibility than 1.0.X +does about where that stream is sent to. You have three options: + +
--logfile-fd=9.
++
--logfile=filename. Note
+ carefully that the commentary is not written to the file
+ you specify, but instead to one called
+ filename.pid12345, if for example the pid of the
+ traced process is 12345. This is helpful when valgrinding a whole
+ tree of processes at once, since it means that each process writes
+ to its own logfile, rather than the result being jumbled up in one
+ big logfile.
++
--logsocket=192.168.0.1:12345 if you
+ want to send the output to host IP 192.168.0.1 port 12345 (I have
+ no idea if 12345 is a port of pre-existing significance). You can
+ also omit the port number: --logsocket=192.168.0.1,
+ in which case a default port of 1500 is used. This default is
+ defined by the constant VG_CLO_DEFAULT_LOGPORT
+ in the sources.
+ + Note, unfortunately, that you have to use an IP address here -- + for technical reasons, valgrind's core itself can't use the GNU C + library, and this makes it difficult to do hostname-to-IP lookups. +
+ Writing to a network socket it pretty useless if you don't have
+ something listening at the other end. We provide a simple
+ listener program, valgrind-listener, which accepts
+ connections on the specified port and copies whatever it is sent
+ to stdout. Probably someone will tell us this is a horrible
+ security risk. It seems likely that people will write more
+ sophisticated listeners in the fullness of time.
+
+ valgrind-listener can accept simultaneous connections from up to 50 + valgrinded processes. In front of each line of output it prints + the current number of active connections in round brackets. +
+ valgrind-listener accepts two command-line flags: +
-e or --exit-at-zero: when the
+ number of connected processes falls back to zero, exit.
+ Without this, it will run forever, that is, until you send it
+ Control-C.
+ +
portnumber: changes the port it listens on from
+ the default (1500). The specified port must be in the range
+ 1024 to 65535. The same restriction applies to port numbers
+ specified by a --logsocket= to valgrind itself.
+ + If a valgrinded process fails to connect to a listener, for + whatever reason (the listener isn't running, invalid or + unreachable host or port, etc), valgrind switches back to writing + the commentary to stderr. The same goes for any process which + loses an established connection to a listener. In other words, + killing the listener doesn't kill the processes sending data to + it. +
+Here is an important point about the relationship between the
+commentary and profiling output from skins. The commentary contains a
+mix of messages from the valgrind core and the selected skin. If the
+skin reports errors, it will report them to the commentary. However,
+if the skin does profiling, the profile data will be written to a file
+of some kind, depending on the skin, and independent of what
+--log* options are in force. The commentary is intended
+to be a low-bandwidth, human-readable channel. Profiling data, on the
+other hand, is usually voluminous and not meaningful without further
+processing, which is why we have chosen this arrangement.
+
+
+
+
+ ==25832== Invalid read of size 4 + ==25832== at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45) + ==25832== by 0x80487AF: main (bogon.cpp:66) + ==25832== by 0x40371E5E: __libc_start_main (libc-start.c:129) + ==25832== by 0x80485D1: (within /home/sewardj/newmat10/bogon) + ==25832== Address 0xBFFFF74C is not stack'd, malloc'd or free'd ++ +
+This message says that the program did an illegal 4-byte read of
+address 0xBFFFF74C, which, as far as memcheck can tell, is not a valid
+stack address, nor corresponds to any currently malloc'd or free'd
+blocks. The read is happening at line 45 of bogon.cpp,
+called from line 66 of the same file, etc. For errors associated with
+an identified malloc'd/free'd block, for example reading free'd
+memory, Valgrind reports not only the location where the error
+happened, but also where the associated block was malloc'd/free'd.
+
+
+Valgrind remembers all error reports. When an error is detected, +it is compared against old reports, to see if it is a duplicate. If +so, the error is noted, but no further commentary is emitted. This +avoids you being swamped with bazillions of duplicate error reports. + +
+If you want to know how many times each error occurred, run with the
+-v option. When execution finishes, all the reports are
+printed out, along with, and sorted by, their occurrence counts. This
+makes it easy to see which errors have occurred most frequently.
+
+
+Errors are reported before the associated operation actually happens. +If you're using a skin (memcheck, addrcheck) which does address +checking, and your program attempts to read from address zero, the +skin will emit a message to this effect, and the program will then +duly die with a segmentation fault. + +
+In general, you should try and fix errors in the order that they are +reported. Not doing so can be confusing. For example, a program +which copies uninitialised values to several memory locations, and +later uses them, will generate several error messages, when run on +memcheck. The first such error message may well give the most direct +clue to the root cause of the problem. + +
+The process of detecting duplicate errors is quite an expensive one
+and can become a significant performance overhead if your program
+generates huge quantities of errors. To avoid serious problems here,
+Valgrind will simply stop collecting errors after 300 different errors
+have been seen, or 30000 errors in total have been seen. In this
+situation you might as well stop your program and fix it, because
+Valgrind won't tell you anything else useful after this. Note that
+the 300/30000 limits apply after suppressed errors are removed. These
+limits are defined in vg_include.h and can be increased
+if necessary.
+
+
+To avoid this cutoff you can use the --error-limit=no
+flag. Then valgrind will always show errors, regardless of how many
+there are. Use this flag carefully, since it may have a dire effect
+on performance.
+
+
+
+
./configure script when the system is built.
+
++You can modify and add to the suppressions file at your leisure, +or, better, write your own. Multiple suppression files are allowed. +This is useful if part of your project contains errors you can't or +don't want to fix, yet you don't want to continuously be reminded of +them. + +
+Each error to be suppressed is described very specifically, to +minimise the possibility that a suppression-directive inadvertantly +suppresses a bunch of similar errors which you did want to see. The +suppression mechanism is designed to allow precise yet flexible +specification of errors to suppress. + +
+If you use the -v flag, at the end of execution, Valgrind
+prints out one line for each used suppression, giving its name and the
+number of times it got used. Here's the suppressions used by a run of
+valgrind --skin=memcheck ls -l:
+
+ --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r + --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r + --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object ++ + + +
+ valgrind [options-for-Valgrind] your-prog [options for your-prog] ++ +
Note that Valgrind also reads options from the environment variable
+$VALGRIND_OPTS, and processes them before the command-line
+options. Options for the valgrind core may be freely mixed with those
+for the selected skin.
+
+
Valgrind's default settings succeed in giving reasonable behaviour +in most cases. Available options, in no particular order, are as +follows: +
--helpShow help for all options, both for the core and for the + selected skin. + +
--versionShow the version number of the + valgrind core. Skins can have their own version numbers. There + is a scheme in place to ensure that skins only execute when the + core version is one they are known to work with. This was done + to minimise the chances of strange problems arising from + skin-vs-core version incompatibilities.
+ +
-v --verboseBe more verbose. Gives extra + information on various aspects of your program, such as: the + shared objects loaded, the suppressions used, the progress of + the instrumentation and execution engines, and warnings about + unusual behaviour. Repeating the flag increases the verbosity + level.
+ +
-q --quietRun silently, and only print error messages. Useful if you + are running regression tests or have some other automated test + machinery. +
+ +
--demangle=no--demangle=yes [the default]
+ Disable/enable automatic demangling (decoding) of C++ names. + Enabled by default. When enabled, Valgrind will attempt to + translate encoded C++ procedure names back to something + approaching the original. The demangler handles symbols mangled + by g++ versions 2.X and 3.X. + +
An important fact about demangling is that function + names mentioned in suppressions files should be in their mangled + form. Valgrind does not demangle function names when searching + for applicable suppressions, because to do otherwise would make + suppressions file contents dependent on the state of Valgrind's + demangling machinery, and would also be slow and pointless. +
+ +
--num-callers=<number> [default=4]By default, Valgrind shows four levels of function call names + to help you identify program locations. You can change that + number with this option. This can help in determining the + program's location in deeply-nested call chains. Note that errors + are commoned up using only the top three function locations (the + place in the current function, and that of its two immediate + callers). So this doesn't affect the total number of errors + reported. +
+ The maximum value for this is 50. Note that higher settings + will make Valgrind run a bit more slowly and take a bit more + memory, but can be useful when working with programs with + deeply-nested call chains. +
+ +
--gdb-attach=no [the default]--gdb-attach=yes
+ When enabled, Valgrind will pause after every error shown,
+ and print the line
+
+ ---- Attach to GDB ? --- [Return/N/n/Y/y/C/c] ----
+
+ Pressing Ret, or N Ret
+ or n Ret, causes Valgrind not to
+ start GDB for this error.
+
+ Y Ret
+ or y Ret causes Valgrind to
+ start GDB, for the program at this point. When you have
+ finished with GDB, quit from it, and the program will continue.
+ Trying to continue from inside GDB doesn't work.
+
+ C Ret
+ or c Ret causes Valgrind not to
+ start GDB, and not to ask again.
+
+ --gdb-attach=yes conflicts with
+ --trace-children=yes. You can't use them together.
+ Valgrind refuses to start up in this situation. 1 May 2002:
+ this is a historical relic which could be easily fixed if it
+ gets in your way. Mail me and complain if this is a problem for
+ you.
+
+ Nov 2002: if you're sending output to a logfile or to a network + socket, I guess this option doesn't make any sense. Caveat emptor. +
+ +
--alignment=<number> [default: 4]By
+ default valgrind's malloc, realloc,
+ etc, return 4-byte aligned addresses. These are suitable for
+ any accesses on x86 processors.
+ Some programs might however assume that malloc et
+ al return 8- or more aligned memory.
+ These programs are broken and should be fixed, but
+ if this is impossible for whatever reason the alignment can be
+ increased using this parameter. The supplied value must be
+ between 4 and 4096 inclusive, and must be a power of two.
+ +
--sloppy-malloc=no [the default]--sloppy-malloc=yes
+ When enabled, all requests for malloc/calloc are rounded up + to a whole number of machine words -- in other words, made + divisible by 4. For example, a request for 17 bytes of space + would result in a 20-byte area being made available. This works + around bugs in sloppy libraries which assume that they can + safely rely on malloc/calloc requests being rounded up in this + fashion. Without the workaround, these libraries tend to + generate large numbers of errors when they access the ends of + these areas. +
+ Valgrind snapshots dated 17 Feb 2002 and later are
+ cleverer about this problem, and you should no longer need to
+ use this flag. To put it bluntly, if you do need to use this
+ flag, your program violates the ANSI C semantics defined for
+ malloc and free, even if it appears to
+ work correctly, and you should fix it, at least if you hope for
+ maximum portability.
+
+ +
--trace-children=no [the default]--trace-children=yes
+ When enabled, Valgrind will trace into child processes. This + is confusing and usually not what you want, so is disabled by + default. +
+ +
--logfile-fd=<number> [default: 2, stderr]
+ Specifies that Valgrind should send all of its + messages to the specified file descriptor. The default, 2, is + the standard error channel (stderr). Note that this may + interfere with the client's own use of stderr. +
+ +
--logfile=<filename>
+ Specifies that Valgrind should send all of its
+ messages to the specified file. In fact, the file name used
+ is created by concatenating the text filename,
+ ".pid" and the process ID, so as to create a file per process.
+ The specified file name may not be the empty string.
+
+ +
--logsocket=<ip-address:port-number>
+ Specifies that Valgrind should send all of its messages to
+ the specified port at the specified IP address. The port may be
+ omitted, in which case port 1500 is used. If a connection
+ cannot be made to the specified socket, valgrind falls back to
+ writing output to the standard error (stderr). This option is
+ intended to be used in conjunction with the
+ valgrind-listener program. For further details,
+ see section 2.3.
+
+ +
--suppressions=<filename>
+ [default: $PREFIX/lib/valgrind/default.supp]
+ Specifies an extra + file from which to read descriptions of errors to suppress. You + may use as many extra suppressions files as you + like. +
+ +
--error-limit=yes [default]--error-limit=no When enabled, valgrind stops + reporting errors after 30000 in total, or 300 different ones, + have been seen. This is to stop the error tracking machinery + from becoming a huge performance overhead in programs with many + errors. +
+ +
--run-libc-freeres=yes [the default]--run-libc-freeres=no
+ The GNU C library (libc.so), which is used by
+ all programs, may allocate memory for its own uses. Usually it
+ doesn't bother to free that memory when the program ends - there
+ would be no point, since the Linux kernel reclaims all process
+ resources when a process exits anyway, so it would just slow
+ things down.
+
+ The glibc authors realised that this behaviour causes leak
+ checkers, such as Valgrind, to falsely report leaks in glibc,
+ when a leak check is done at exit. In order to avoid this, they
+ provided a routine called __libc_freeres
+ specifically to make glibc release all memory it has allocated.
+ The MemCheck and AddrCheck skins therefore try and run
+ __libc_freeres at exit.
+
+ Unfortunately, in some versions of glibc,
+ __libc_freeres is sufficiently buggy to cause
+ segmentation faults. This is particularly noticeable on Red Hat
+ 7.1. So this flag is provided in order to inhibit the run of
+ __libc_freeres. If your program seems to run fine
+ on valgrind, but segfaults at exit, you may find that
+ --run-libc-freeres=no fixes that, although at the
+ cost of possibly falsely reporting space leaks in
+ libc.so.
+
+ +
--weird-hacks=hack1,hack2,...
+ Pass miscellaneous hints to Valgrind which slightly modify the
+ simulated behaviour in nonstandard or dangerous ways, possibly
+ to help the simulation of strange features. By default no hacks
+ are enabled. Use with caution! Currently known hacks are:
+ +
ioctl-VTIME Use this if you have a program
+ which sets readable file descriptors to have a timeout by
+ doing ioctl on them with a
+ TCSETA-style command and a non-zero
+ VTIME timeout value. This is considered
+ potentially dangerous and therefore is not engaged by
+ default, because it is (remotely) conceivable that it could
+ cause threads doing read to incorrectly block
+ the entire process.
+
+ You probably want to try this one if you have a program
+ which unexpectedly blocks in a read from a file
+ descriptor which you know to have been messed with by
+ ioctl. This could happen, for example, if the
+ descriptor is used to read input from some kind of screen
+ handling library.
+
+ To find out if your program is blocking unexpectedly in the
+ read system call, run with
+ --trace-syscalls=yes flag.
+
+
truncate-writes Use this if you have a threaded
+ program which appears to unexpectedly block whilst writing
+ into a pipe. The effect is to modify all calls to
+ write() so that requests to write more than
+ 4096 bytes are treated as if they only requested a write of
+ 4096 bytes. Valgrind does this by changing the
+ count argument of write(), as
+ passed to the kernel, so that it is at most 4096. The
+ amount of data written will then be less than the client
+ program asked for, but the client should have a loop around
+ its write() call to check whether the requested
+ number of bytes have been written. If not, it should issue
+ further write() calls until all the data is
+ written.
+ + This all sounds pretty dodgy to me, which is why I've made + this behaviour only happen on request. It is not the + default behaviour. At the time of writing this (30 June + 2002) I have only seen one example where this is necessary, + so either the problem is extremely rare or nobody is using + Valgrind :-) +
+ On experimentation I see that truncate-writes
+ doesn't interact well with ioctl-VTIME, so you
+ probably don't want to try both at once.
+
+ As above, to find out if your program is blocking
+ unexpectedly in the write() system call, you
+ may find the --trace-syscalls=yes
+ --trace-sched=yes flags useful.
+
+
--single-step=no [default]--single-step=yes
+ When enabled, each x86 insn is translated separately into + instrumented code. When disabled, translation is done on a + per-basic-block basis, giving much better translations.
+ +
--optimise=no--optimise=yes [default]
+ When enabled, various improvements are applied to the + intermediate code, mainly aimed at allowing the simulated CPU's + registers to be cached in the real CPU's registers over several + simulated instructions.
+ +
--profile=no--profile=yes [default]
+ When enabled, does crude internal profiling of valgrind + itself. This is not for profiling your programs. Rather it is + to allow the developers to assess where valgrind is spending + its time. The skins must be built for profiling for this to + work. +
+ +
--trace-syscalls=no [default]--trace-syscalls=yes
+ Enable/disable tracing of system call intercepts.
+ +
--trace-signals=no [default]--trace-signals=yes
+ Enable/disable tracing of signal handling.
+ +
--trace-sched=no [default]--trace-sched=yes
+ Enable/disable tracing of thread scheduling events.
+ +
--trace-pthread=none [default]--trace-pthread=some --trace-pthread=all
+ Specifies amount of trace detail for pthread-related events.
+ +
--trace-symtab=no [default]--trace-symtab=yes
+ Enable/disable tracing of symbol table reading.
+ +
--trace-malloc=no [default]--trace-malloc=yes
+ Enable/disable tracing of malloc/free (et al) intercepts. +
+ +
--stop-after=<number>
+ [default: infinity, more or less]
+ After <number> basic blocks have been executed, shut down + Valgrind and switch back to running the client on the real CPU. +
+ +
--dump-error=<number> [default: inactive]
+ After the program has exited, show gory details of the
+ translation of the basic block containing the <number>'th
+ error context. When used with --single-step=yes,
+ can show the exact x86 instruction causing an error. This is
+ all fairly dodgy and doesn't work at all if threads are
+ involved.
+
+For your convenience, a subset of these so-called client requests is +provided to allow you to tell Valgrind facts about the behaviour of +your program, and conversely to make queries. In particular, your +program can tell Valgrind about changes in memory range permissions +that Valgrind would not otherwise know about, and so allows clients to +get Valgrind to do arbitrary custom checks. +
+Clients need to include the header file valgrind.h to
+make this work. The macros therein have the magical property that
+they generate code in-line which Valgrind can spot. However, the code
+does nothing when not run on Valgrind, so you are not forced to run
+your program on Valgrind just because you use the macros in this file.
+Also, you are not required to link your program with any extra
+supporting libraries.
+
+A brief description of the available macros: +
VALGRIND_MAKE_NOACCESS,
+ VALGRIND_MAKE_WRITABLE and
+ VALGRIND_MAKE_READABLE. These mark address
+ ranges as completely inaccessible, accessible but containing
+ undefined data, and accessible and containing defined data,
+ respectively. Subsequent errors may have their faulting
+ addresses described in terms of these blocks. Returns a
+ "block handle". Returns zero when not run on Valgrind.
++
VALGRIND_DISCARD: At some point you may want
+ Valgrind to stop reporting errors in terms of the blocks
+ defined by the previous three macros. To do this, the above
+ macros return a small-integer "block handle". You can pass
+ this block handle to VALGRIND_DISCARD. After
+ doing so, Valgrind will no longer be able to relate
+ addressing errors to the user-defined block associated with
+ the handle. The permissions settings associated with the
+ handle remain in place; this just affects how errors are
+ reported, not whether they are reported. Returns 1 for an
+ invalid handle and 0 for a valid handle (although passing
+ invalid handles is harmless). Always returns 0 when not run
+ on Valgrind.
++
VALGRIND_CHECK_NOACCESS,
+ VALGRIND_CHECK_WRITABLE and
+ VALGRIND_CHECK_READABLE: check immediately
+ whether or not the given address range has the relevant
+ property, and if not, print an error message. Also, for the
+ convenience of the client, returns zero if the relevant
+ property holds; otherwise, the returned value is the address
+ of the first byte for which the property is not true.
+ Always returns 0 when not run on Valgrind.
++
VALGRIND_CHECK_NOACCESS: a quick and easy way
+ to find out whether Valgrind thinks a particular variable
+ (lvalue, to be precise) is addressible and defined. Prints
+ an error message if not. Returns no value.
++
VALGRIND_MAKE_NOACCESS_STACK: a highly
+ experimental feature. Similarly to
+ VALGRIND_MAKE_NOACCESS, this marks an address
+ range as inaccessible, so that subsequent accesses to an
+ address in the range gives an error. However, this macro
+ does not return a block handle. Instead, all annotations
+ created like this are reviewed at each client
+ ret (subroutine return) instruction, and those
+ which now define an address range block the client's stack
+ pointer register (%esp) are automatically
+ deleted.
+ + In other words, this macro allows the client to tell + Valgrind about red-zones on its own stack. Valgrind + automatically discards this information when the stack + retreats past such blocks. Beware: hacky and flaky, and + probably interacts badly with the new pthread support. +
+
RUNNING_ON_VALGRIND: returns 1 if running on
+ Valgrind, 0 if running on the real CPU.
++
VALGRIND_DO_LEAK_CHECK: run the memory leak detector
+ right now. Returns no value. I guess this could be used to
+ incrementally check for leaks between arbitrary places in the
+ program's execution. Warning: not properly tested!
++
VALGRIND_DISCARD_TRANSLATIONS: discard translations
+ of code in the specified address range. Useful if you are
+ debugging a JITter or some other dynamic code generation system.
+ After this call, attempts to execute code in the invalidated
+ address range will cause valgrind to make new translations of that
+ code, which is probably the semantics you want. Note that this is
+ implemented naively, and involves checking all 200191 entries in
+ the translation table to see if any of them overlap the specified
+ address range. So try not to call it often, or performance will
+ nosedive. Note that you can be clever about this: you only need
+ to call it when an area which previously contained code is
+ overwritten with new code. You can choose to write code into
+ fresh memory, and just call this occasionally to discard large
+ chunks of old code all at once.
+ + Warning: minimally tested, especially for the cache simulator. +
+It works as follows: threaded apps are (dynamically) linked against
+libpthread.so. Usually this is the one installed with
+your Linux distribution. Valgrind, however, supplies its own
+libpthread.so and automatically connects your program to
+it instead.
+
+The fake libpthread.so and Valgrind cooperate to
+implement a user-space pthreads package. This approach avoids the
+horrible implementation problems of implementing a truly
+multiprocessor version of Valgrind, but it does mean that threaded
+apps run only on one CPU, even if you have a multiprocessor machine.
+
+Valgrind schedules your threads in a round-robin fashion, with all +threads having equal priority. It switches threads every 50000 basic +blocks (typically around 300000 x86 instructions), which means you'll +get a much finer interleaving of thread executions than when run +natively. This in itself may cause your program to behave differently +if you have some kind of concurrency, critical race, locking, or +similar, bugs. +
+The current (valgrind-1.0 release) state of pthread support is as +follows: +
pthread_once, reader-writer locks, semaphores,
+ cleanup stacks, cancellation and thread detaching currently work.
+ Various attribute-like calls are handled but ignored; you get a
+ warning message.
++
write read nanosleep
+ sleep select poll
+ recvmsg and
+ accept.
++
pthread_sigmask, pthread_kill,
+ sigwait and raise are now implemented.
+ Each thread has its own signal mask, as POSIX requires.
+ It's a bit kludgey -- there's a system-wide pending signal set,
+ rather than one for each thread. But hey.
+./configure,
+make, make install mechanism, and I have
+attempted to ensure that it works on machines with kernel 2.2 or 2.4
+and glibc 2.1.X or 2.2.X. I don't think there is much else to say.
+There are no options apart from the usual --prefix that
+you should give to ./configure.
+
+
+The configure script tests the version of the X server
+currently indicated by the current $DISPLAY. This is a
+known bug. The intention was to detect the version of the current
+XFree86 client libraries, so that correct suppressions could be
+selected for them, but instead the test checks the server version.
+This is just plain wrong.
+
+
+If you are building a binary package of Valgrind for distribution,
+please read README_PACKAGERS. It contains some important
+information.
+
+
+Apart from that there is no excitement here. Let me know if you have +build problems. + + + + +
See Section 4 for the known limitations of +Valgrind, and for a list of programs which are known not to work on +it. + +
The translator/instrumentor has a lot of assertions in it. They +are permanently enabled, and I have no plans to disable them. If one +of these breaks, please mail me! + +
If you get an assertion failure on the expression
+chunkSane(ch) in vg_free() in
+vg_malloc.c, this may have happened because your program
+wrote off the end of a malloc'd block, or before its beginning.
+Valgrind should have emitted a proper message to that effect before
+dying in this way. This is a known problem which I should fix.
+
+ +
Under the hood, dealing with signals is a real pain, and Valgrind's +simulation leaves much to be desired. If your program does +way-strange stuff with signals, bad things may happen. If so, let me +know. I don't promise to fix it, but I'd at least like to be aware of +it. + + + + +
Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on +a kernel 2.2.X or 2.4.X system, subject to the following constraints: + +
+ +
libpthread.so, so that Valgrind can
+ substitute its own implementation at program startup time. If
+ you're statically linked against it, things will fail
+ badly.+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+
__pthread_clock_gettime and
+ __pthread_clock_settime. This appears to be due to
+ /lib/librt-2.2.5.so needing them. Unfortunately I
+ do not understand enough about this problem to fix it properly,
+ and I can't reproduce it on my test RedHat 7.3 system. Please
+ mail me if you have more information / understanding. +
+
-fno-builtin-strlen in
+ the meantime. Or use an earlier gcc.+
The dynamic linker allows each .so in the process image to have an +initialisation function which is run before main(). It also allows +each .so to have a finalisation function run after main() exits. + +
When valgrind.so's initialisation function is called by the dynamic +linker, the synthetic CPU to starts up. The real CPU remains locked +in valgrind.so for the entire rest of the program, but the synthetic +CPU returns from the initialisation function. Startup of the program +now continues as usual -- the dynamic linker calls all the other .so's +initialisation routines, and eventually runs main(). This all runs on +the synthetic CPU, not the real one, but the client program cannot +tell the difference. + +
Eventually main() exits, so the synthetic CPU calls valgrind.so's +finalisation function. Valgrind detects this, and uses it as its cue +to exit. It prints summaries of all errors detected, possibly checks +for memory leaks, and then exits the finalisation routine, but now on +the real CPU. The synthetic CPU has now lost control -- permanently +-- so the program exits back to the OS on the real CPU, just as it +would have done anyway. + +
On entry, Valgrind switches stacks, so it runs on its own stack. +On exit, it switches back. This means that the client program +continues to run on its own stack, so we can switch back and forth +between running it on the simulated and real CPUs without difficulty. +This was an important design decision, because it makes it easy (well, +significantly less difficult) to debug the synthetic CPU. + + + +
Valgrind no longer directly supports detection of self-modifying +code. Such checking is expensive, and in practice (fortunately) +almost no applications need it. However, to help people who are +debugging dynamic code generation systems, there is a Client Request +(basically a macro you can put in your program) which directs Valgrind +to discard translations in a given address range. So Valgrind can +still work in this situation provided the client tells it when +code has become out-of-date and needs to be retranslated. + +
The JITter translates basic blocks -- blocks of straight-line-code +-- as single entities. To minimise the considerable difficulties of +dealing with the x86 instruction set, x86 instructions are first +translated to a RISC-like intermediate code, similar to sparc code, +but with an infinite number of virtual integer registers. Initially +each insn is translated seperately, and there is no attempt at +instrumentation. + +
The intermediate code is improved, mostly so as to try and cache +the simulated machine's registers in the real machine's registers over +several simulated instructions. This is often very effective. Also, +we try to remove redundant updates of the simulated machines's +condition-code register. + +
The intermediate code is then instrumented, giving more +intermediate code. There are a few extra intermediate-code operations +to support instrumentation; it is all refreshingly simple. After +instrumentation there is a cleanup pass to remove redundant value +checks. + +
This gives instrumented intermediate code which mentions arbitrary +numbers of virtual registers. A linear-scan register allocator is +used to assign real registers and possibly generate spill code. All +of this is still phrased in terms of the intermediate code. This +machinery is inspired by the work of Reuben Thomas (Mite). + +
Then, and only then, is the final x86 code emitted. The +intermediate code is carefully designed so that x86 code can be +generated from it without need for spare registers or other +inconveniences. + +
The translations are managed using a traditional LRU-based caching +scheme. The translation cache has a default size of about 14MB. + + + +
When such a signal arrives, Valgrind's own handler catches it, and +notes the fact. At a convenient safe point in execution, Valgrind +builds a signal delivery frame on the client's stack and runs its +handler. If the handler longjmp()s, there is nothing more to be said. +If the handler returns, Valgrind notices this, zaps the delivery +frame, and carries on where it left off before delivering the signal. + +
The purpose of this nonsense is that setting signal handlers +essentially amounts to giving callback addresses to the Linux kernel. +We can't allow this to happen, because if it did, signal handlers +would run on the real CPU, not the simulated one. This means the +checking machinery would not operate during the handler run, and, +worse, memory permissions maps would not be updated, which could cause +spurious error reports once the handler had returned. + +
An even worse thing would happen if the signal handler longjmp'd +rather than returned: Valgrind would completely lose control of the +client program. + +
Upshot: we can't allow the client to install signal handlers +directly. Instead, Valgrind must catch, on behalf of the client, any +signal the client asks to catch, and must delivery it to the client on +the simulated CPU, not the real one. This involves considerable +gruesome fakery; see vg_signals.c for details. +
+ +
+sewardj@phoenix:~/newmat10$ +~/Valgrind-6/valgrind -v ./bogon +==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1. +==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward. +==25832== Startup, with flags: +==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp +==25832== reading syms from /lib/ld-linux.so.2 +==25832== reading syms from /lib/libc.so.6 +==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0 +==25832== reading syms from /lib/libm.so.6 +==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3 +==25832== reading syms from /home/sewardj/Valgrind/valgrind.so +==25832== reading syms from /proc/self/exe +==25832== loaded 5950 symbols, 142333 line number locations +==25832== +==25832== Invalid read of size 4 +==25832== at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45) +==25832== by 0x80487AF: main (bogon.cpp:66) +==25832== by 0x40371E5E: __libc_start_main (libc-start.c:129) +==25832== by 0x80485D1: (within /home/sewardj/newmat10/bogon) +==25832== Address 0xBFFFF74C is not stack'd, malloc'd or free'd +==25832== +==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) +==25832== malloc/free: in use at exit: 0 bytes in 0 blocks. +==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. +==25832== For a detailed leak analysis, rerun with: --leak-check=yes +==25832== +==25832== exiting, did 1881 basic blocks, 0 misses. +==25832== 223 translations, 3626 bytes in, 56801 bytes out. ++
The GCC folks fixed this about a week before gcc-3.0 shipped. +
+ +