From 8291bb8643c900faed0ddf55a9e22aa6d2d0c58a Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Mon, 23 Sep 2002 13:10:54 +0000 Subject: [PATCH] Removed files that are now elsewhere (in core/docs/ and memcheck/docs/), setup basic skeleton linking to core + skin docs. git-svn-id: svn://svn.valgrind.org/valgrind/trunk@1096 --- docs/Makefile.am | 2 +- docs/index.html | 53 +- docs/manual.html | 2731 -------------------------------------------- docs/nav.html | 72 -- docs/techdocs.html | 2524 ---------------------------------------- 5 files changed, 31 insertions(+), 5351 deletions(-) delete mode 100644 docs/manual.html delete mode 100644 docs/nav.html delete mode 100644 docs/techdocs.html diff --git a/docs/Makefile.am b/docs/Makefile.am index e8a58fa18..39b9008b6 100644 --- a/docs/Makefile.am +++ b/docs/Makefile.am @@ -1,5 +1,5 @@ docdir = $(datadir)/doc/valgrind -doc_DATA = index.html manual.html nav.html techdocs.html +doc_DATA = index.html EXTRA_DIST = $(doc_DATA) diff --git a/docs/index.html b/docs/index.html index 111170256..d4db7c868 100644 --- a/docs/index.html +++ b/docs/index.html @@ -1,26 +1,33 @@ - + + Valgrind + + + - - - - - - - - Valgrind's user manual - - - - - - - <body> - <p>This page uses frames, but your browser doesn't support them.</p> - </body> - - - + +

Documentation Contents

+ Core
+ memcheck
+ Cachegrind
+ diff --git a/docs/manual.html b/docs/manual.html deleted file mode 100644 index 95fe84080..000000000 --- a/docs/manual.html +++ /dev/null @@ -1,2731 +0,0 @@ - - - - Valgrind - - - - -  -

Valgrind, version 1.0.0

-
This manual was last updated on 20020726
-

- -

-jseward@acm.org
-Copyright © 2000-2002 Julian Seward -

-Valgrind is licensed under the GNU General Public License, -version 2
-An open-source tool for finding memory-management problems in -Linux-x86 executables. -

- -

- -


- -

Contents of this manual

- -

Introduction

- 1.1  What Valgrind is for
- 1.2  What it does with your program - -

How to use it, and how to make sense - of the results

- 2.1  Getting started
- 2.2  The commentary
- 2.3  Reporting of errors
- 2.4  Suppressing errors
- 2.5  Command-line flags
- 2.6  Explaination of error messages
- 2.7  Writing suppressions files
- 2.8  The Client Request mechanism
- 2.9  Support for POSIX pthreads
- 2.10  Building and installing
- 2.11  If you have problems
- -

Details of the checking machinery

- 3.1  Valid-value (V) bits
- 3.2  Valid-address (A) bits
- 3.3  Putting it all together
- 3.4  Signals
- 3.5  Memory leak detection
- -

Limitations

- -

How it works -- a rough overview

- 5.1  Getting started
- 5.2  The translation/instrumentation engine
- 5.3  Tracking the status of memory
- 5.4  System calls
- 5.5  Signals
- -

An example

- -

Cache profiling

- -

The design and implementation of Valgrind

- -
- - -

1  Introduction

- - -

1.1  What Valgrind is for

- -Valgrind is a tool to help you find memory-management problems in your -programs. When a program is run under Valgrind's supervision, all -reads and writes of memory are checked, and calls to -malloc/new/free/delete are intercepted. As a result, Valgrind can -detect problems such as: - - -Problems like these can be difficult to find by other means, often -lying undetected for long periods, then causing occasional, -difficult-to-diagnose crashes. - -

-Valgrind is closely tied to details of the CPU, operating system and -to a less extent, compiler and basic C libraries. This makes it -difficult to make it portable, so I have chosen at the outset to -concentrate on what I believe to be a widely used platform: Linux on -x86s. Valgrind uses the standard Unix ./configure, -make, make install mechanism, and I have -attempted to ensure that it works on machines with kernel 2.2 or 2.4 -and glibc 2.1.X or 2.2.X. This should cover the vast majority of -modern Linux installations. - - -

-Valgrind is licensed under the GNU General Public License, version -2. Read the file LICENSE in the source distribution for details. Some -of the PThreads test cases, test/pth_*.c, are taken from -"Pthreads Programming" by Bradford Nichols, Dick Buttlar & Jacqueline -Proulx Farrell, ISBN 1-56592-115-1, published by O'Reilly & -Associates, Inc. - - - -

1.2  What it does with your program

- -Valgrind is designed to be as non-intrusive as possible. It works -directly with existing executables. You don't need to recompile, -relink, or otherwise modify, the program to be checked. Simply place -the word valgrind at the start of the command line -normally used to run the program. So, for example, if you want to run -the command ls -l on Valgrind, simply issue the -command: valgrind ls -l. - -

Valgrind takes control of your program before it starts. Debugging -information is read from the executable and associated libraries, so -that error messages can be phrased in terms of source code -locations. Your program is then run on a synthetic x86 CPU which -checks every memory access. All detected errors are written to a -log. When the program finishes, Valgrind searches for and reports on -leaked memory. - -

You can run pretty much any dynamically linked ELF x86 executable -using Valgrind. Programs run 25 to 50 times slower, and take a lot -more memory, than they usually would. It works well enough to run -large programs. For example, the Konqueror web browser from the KDE -Desktop Environment, version 3.0, runs slowly but usably on Valgrind. - -

Valgrind simulates every single instruction your program executes. -Because of this, it finds errors not only in your application but also -in all supporting dynamically-linked (.so-format) -libraries, including the GNU C library, the X client libraries, Qt, if -you work with KDE, and so on. That often includes libraries, for -example the GNU C library, which contain memory access violations, but -which you cannot or do not want to fix. - -

Rather than swamping you with errors in which you are not -interested, Valgrind allows you to selectively suppress errors, by -recording them in a suppressions file which is read when Valgrind -starts up. The build mechanism attempts to select suppressions which -give reasonable behaviour for the libc and XFree86 versions detected -on your machine. - - -

Section 6 shows an example of use. -

-


- - -

2  How to use it, and how to make sense of the results

- - -

2.1  Getting started

- -First off, consider whether it might be beneficial to recompile your -application and supporting libraries with optimisation disabled and -debugging info enabled (the -g flag). You don't have to -do this, but doing so helps Valgrind produce more accurate and less -confusing error reports. Chances are you're set up like this already, -if you intended to debug your program with GNU gdb, or some other -debugger. - -

-A plausible compromise is to use -g -O. -Optimisation levels above -O have been observed, on very -rare occasions, to cause gcc to generate code which fools Valgrind's -error tracking machinery into wrongly reporting uninitialised value -errors. -O gets you the vast majority of the benefits of -higher optimisation levels anyway, so you don't lose much there. - -

-Valgrind understands both the older "stabs" debugging format, used by -gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1 -and later. - -

-Then just run your application, but place the word -valgrind in front of your usual command-line invokation. -Note that you should run the real (machine-code) executable here. If -your application is started by, for example, a shell or perl script, -you'll need to modify it to invoke Valgrind on the real executables. -Running such scripts directly under Valgrind will result in you -getting error reports pertaining to /bin/sh, -/usr/bin/perl, or whatever interpreter you're using. -This almost certainly isn't what you want and can be confusing. - - -

2.2  The commentary

- -Valgrind writes a commentary, detailing error reports and other -significant events. The commentary goes to standard output by -default. This may interfere with your program, so you can ask for it -to be directed elsewhere. - -

All lines in the commentary are of the following form:
-

-  ==12345== some-message-from-Valgrind
-
-

The 12345 is the process ID. This scheme makes it easy -to distinguish program output from Valgrind commentary, and also easy -to differentiate commentaries from different processes which have -become merged together, for whatever reason. - -

By default, Valgrind writes only essential messages to the commentary, -so as to avoid flooding you with information of secondary importance. -If you want more information about what is happening, re-run, passing -the -v flag to Valgrind. - - - -

2.3  Reporting of errors

- -When Valgrind detects something bad happening in the program, an error -message is written to the commentary. For example:
-
-  ==25832== Invalid read of size 4
-  ==25832==    at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45)
-  ==25832==    by 0x80487AF: main (bogon.cpp:66)
-  ==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
-  ==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
-  ==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
-
- -

This message says that the program did an illegal 4-byte read of -address 0xBFFFF74C, which, as far as it can tell, is not a valid stack -address, nor corresponds to any currently malloc'd or free'd blocks. -The read is happening at line 45 of bogon.cpp, called -from line 66 of the same file, etc. For errors associated with an -identified malloc'd/free'd block, for example reading free'd memory, -Valgrind reports not only the location where the error happened, but -also where the associated block was malloc'd/free'd. - -

Valgrind remembers all error reports. When an error is detected, -it is compared against old reports, to see if it is a duplicate. If -so, the error is noted, but no further commentary is emitted. This -avoids you being swamped with bazillions of duplicate error reports. - -

If you want to know how many times each error occurred, run with -the -v option. When execution finishes, all the reports -are printed out, along with, and sorted by, their occurrence counts. -This makes it easy to see which errors have occurred most frequently. - -

Errors are reported before the associated operation actually -happens. For example, if you program decides to read from address -zero, Valgrind will emit a message to this effect, and the program -will then duly die with a segmentation fault. - -

In general, you should try and fix errors in the order that they -are reported. Not doing so can be confusing. For example, a program -which copies uninitialised values to several memory locations, and -later uses them, will generate several error messages. The first such -error message may well give the most direct clue to the root cause of -the problem. - -

The process of detecting duplicate errors is quite an expensive -one and can become a significant performance overhead if your program -generates huge quantities of errors. To avoid serious problems here, -Valgrind will simply stop collecting errors after 300 different errors -have been seen, or 30000 errors in total have been seen. In this -situation you might as well stop your program and fix it, because -Valgrind won't tell you anything else useful after this. Note that -the 300/30000 limits apply after suppressed errors are removed. These -limits are defined in vg_include.h and can be increased -if necessary. - -

To avoid this cutoff you can use the ---error-limit=no flag. Then valgrind will always show -errors, regardless of how many there are. Use this flag carefully, -since it may have a dire effect on performance. - - - -

2.4  Suppressing errors

- -Valgrind detects numerous problems in the base libraries, such as the -GNU C library, and the XFree86 client libraries, which come -pre-installed on your GNU/Linux system. You can't easily fix these, -but you don't want to see these errors (and yes, there are many!) So -Valgrind reads a list of errors to suppress at startup. -A default suppression file is cooked up by the -./configure script. - -

You can modify and add to the suppressions file at your leisure, -or, better, write your own. Multiple suppression files are allowed. -This is useful if part of your project contains errors you can't or -don't want to fix, yet you don't want to continuously be reminded of -them. - -

Each error to be suppressed is described very specifically, to -minimise the possibility that a suppression-directive inadvertantly -suppresses a bunch of similar errors which you did want to see. The -suppression mechanism is designed to allow precise yet flexible -specification of errors to suppress. - -

If you use the -v flag, at the end of execution, Valgrind -prints out one line for each used suppression, giving its name and the -number of times it got used. Here's the suppressions used by a run of -ls -l: -

-  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r
-  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r
-  --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object
-
- - -

2.5  Command-line flags

- -You invoke Valgrind like this: -
-  valgrind [options-for-Valgrind] your-prog [options for your-prog]
-
- -

Note that Valgrind also reads options from the environment variable -$VALGRIND_OPTS, and processes them before the command-line -options. - -

Valgrind's default settings succeed in giving reasonable behaviour -in most cases. Available options, in no particular order, are as -follows: -

- -There are also some options for debugging Valgrind itself. You -shouldn't need to use them in the normal run of things. Nevertheless: - - - - - -

2.6  Explaination of error messages

- -Despite considerable sophistication under the hood, Valgrind can only -really detect two kinds of errors, use of illegal addresses, and use -of undefined values. Nevertheless, this is enough to help you -discover all sorts of memory-management nasties in your code. This -section presents a quick summary of what error messages mean. The -precise behaviour of the error-checking machinery is described in -Section 4. - - -

2.6.1  Illegal read / Illegal write errors

-For example: -
-  Invalid read of size 4
-     at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
-     by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
-     by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326)
-     by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
-     Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
-
- -

This happens when your program reads or writes memory at a place -which Valgrind reckons it shouldn't. In this example, the program did -a 4-byte read at address 0xBFFFF0E0, somewhere within the -system-supplied library libpng.so.2.1.0.9, which was called from -somewhere else in the same library, called from line 326 of -qpngio.cpp, and so on. - -

Valgrind tries to establish what the illegal address might relate -to, since that's often useful. So, if it points into a block of -memory which has already been freed, you'll be informed of this, and -also where the block was free'd at. Likewise, if it should turn out -to be just off the end of a malloc'd block, a common result of -off-by-one-errors in array subscripting, you'll be informed of this -fact, and also where the block was malloc'd. - -

In this example, Valgrind can't identify the address. Actually the -address is on the stack, but, for some reason, this is not a valid -stack address -- it is below the stack pointer, %esp, and that isn't -allowed. In this particular case it's probably caused by gcc -generating invalid code, a known bug in various flavours of gcc. - -

Note that Valgrind only tells you that your program is about to -access memory at an illegal address. It can't stop the access from -happening. So, if your program makes an access which normally would -result in a segmentation fault, you program will still suffer the same -fate -- but you will get a message from Valgrind immediately prior to -this. In this particular example, reading junk on the stack is -non-fatal, and the program stays alive. - - -

2.6.2  Use of uninitialised values

-For example: -
-  Conditional jump or move depends on uninitialised value(s)
-     at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
-     by 0x402E8476: _IO_printf (printf.c:36)
-     by 0x8048472: main (tests/manuel1.c:8)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-
- -

An uninitialised-value use error is reported when your program uses -a value which hasn't been initialised -- in other words, is undefined. -Here, the undefined value is used somewhere inside the printf() -machinery of the C library. This error was reported when running the -following small program: -

-  int main()
-  {
-    int x;
-    printf ("x = %d\n", x);
-  }
-
- -

It is important to understand that your program can copy around -junk (uninitialised) data to its heart's content. Valgrind observes -this and keeps track of the data, but does not complain. A complaint -is issued only when your program attempts to make use of uninitialised -data. In this example, x is uninitialised. Valgrind observes the -value being passed to _IO_printf and thence to _IO_vfprintf, but makes -no comment. However, _IO_vfprintf has to examine the value of x so it -can turn it into the corresponding ASCII string, and it is at this -point that Valgrind complains. - -

Sources of uninitialised data tend to be: -

- - - -

2.6.3  Illegal frees

-For example: -
-  Invalid free()
-     at 0x4004FFDF: free (ut_clientmalloc.c:577)
-     by 0x80484C7: main (tests/doublefree.c:10)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/doublefree)
-     Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
-     at 0x4004FFDF: free (ut_clientmalloc.c:577)
-     by 0x80484C7: main (tests/doublefree.c:10)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/doublefree)
-
-

Valgrind keeps track of the blocks allocated by your program with -malloc/new, so it can know exactly whether or not the argument to -free/delete is legitimate or not. Here, this test program has -freed the same block twice. As with the illegal read/write errors, -Valgrind attempts to make sense of the address free'd. If, as -here, the address is one which has previously been freed, you wil -be told that -- making duplicate frees of the same block easy to spot. - - -

2.6.4  When a block is freed with an inappropriate -deallocation function

-In the following example, a block allocated with new[] -has wrongly been deallocated with free: -
-  Mismatched free() / delete / delete []
-     at 0x40043249: free (vg_clientfuncs.c:171)
-     by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
-     by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
-     by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
-     Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
-     at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
-     by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
-     by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
-     by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
-
-The following was told to me be the KDE 3 developers. I didn't know -any of it myself. They also implemented the check itself. -

-In C++ it's important to deallocate memory in a way compatible with -how it was allocated. The deal is: -

-The worst thing is that on Linux apparently it doesn't matter if you -do muddle these up, and it all seems to work ok, but the same program -may then crash on a different platform, Solaris for example. So it's -best to fix it properly. According to the KDE folks "it's amazing how -many C++ programmers don't know this". -

-Pascal Massimino adds the following clarification: -delete[] must be called associated with a -new[] because the compiler stores the size of the array -and the pointer-to-member to the destructor of the array's content -just before the pointer actually returned. This implies a -variable-sized overhead in what's returned by new or -new[]. It rather surprising how compilers [Ed: -runtime-support libraries?] are robust to mismatch in -new/delete -new[]/delete[]. - - -

2.6.5  Passing system call parameters with inadequate -read/write permissions

- -Valgrind checks all parameters to system calls. If a system call -needs to read from a buffer provided by your program, Valgrind checks -that the entire buffer is addressible and has valid data, ie, it is -readable. And if the system call needs to write to a user-supplied -buffer, Valgrind checks that the buffer is addressible. After the -system call, Valgrind updates its administrative information to -precisely reflect any changes in memory permissions caused by the -system call. - -

Here's an example of a system call with an invalid parameter: -

-  #include <stdlib.h>
-  #include <unistd.h>
-  int main( void )
-  {
-    char* arr = malloc(10);
-    (void) write( 1 /* stdout */, arr, 10 );
-    return 0;
-  }
-
- -

You get this complaint ... -

-  Syscall param write(buf) contains uninitialised or unaddressable byte(s)
-     at 0x4035E072: __libc_write
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/badwrite)
-     by <bogus frame pointer> ???
-     Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
-     at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
-     by 0x80484A0: main (tests/badwrite.c:6)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/badwrite)
-
- -

... because the program has tried to write uninitialised junk from -the malloc'd block to the standard output. - - -

2.6.6  Warning messages you might see

- -Most of these only appear if you run in verbose mode (enabled by --v): - - - - -

2.7  Writing suppressions files

- -A suppression file describes a bunch of errors which, for one reason -or another, you don't want Valgrind to tell you about. Usually the -reason is that the system libraries are buggy but unfixable, at least -within the scope of the current debugging session. Multiple -suppressions files are allowed. By default, Valgrind uses -$PREFIX/lib/valgrind/default.supp. - -

-You can ask to add suppressions from another file, by specifying ---suppressions=/path/to/file.supp. - -

Each suppression has the following components:
-

- -

-Locations may be either names of shared objects/executables or wildcards -matching function names. They begin obj: and fun: -respectively. Function and object names to match against may use the -wildcard characters * and ?. - -A suppression only suppresses an error when the error matches all the -details in the suppression. Here's an example: -

-  {
-    __gconv_transform_ascii_internal/__mbrtowc/mbtowc
-    Value4
-    fun:__gconv_transform_ascii_internal
-    fun:__mbr*toc
-    fun:mbtowc
-  }
-
- -

What is means is: suppress a use-of-uninitialised-value error, when -the data size is 4, when it occurs in the function -__gconv_transform_ascii_internal, when that is called -from any function of name matching __mbr*toc, -when that is called from -mbtowc. It doesn't apply under any other circumstances. -The string by which this suppression is identified to the user is -__gconv_transform_ascii_internal/__mbrtowc/mbtowc. - -

Another example: -

-  {
-    libX11.so.6.2/libX11.so.6.2/libXaw.so.7.0
-    Value4
-    obj:/usr/X11R6/lib/libX11.so.6.2
-    obj:/usr/X11R6/lib/libX11.so.6.2
-    obj:/usr/X11R6/lib/libXaw.so.7.0
-  }
-
- -

Suppress any size 4 uninitialised-value error which occurs anywhere -in libX11.so.6.2, when called from anywhere in the same -library, when called from anywhere in libXaw.so.7.0. The -inexact specification of locations is regrettable, but is about all -you can hope for, given that the X11 libraries shipped with Red Hat -7.2 have had their symbol tables removed. - -

Note -- since the above two examples did not make it clear -- that -you can freely mix the obj: and fun: -styles of description within a single suppression record. - - - -

2.8  The Client Request mechanism

- -Valgrind has a trapdoor mechanism via which the client program can -pass all manner of requests and queries to Valgrind. Internally, this -is used extensively to make malloc, free, signals, threads, etc, work, -although you don't see that. -

-For your convenience, a subset of these so-called client requests is -provided to allow you to tell Valgrind facts about the behaviour of -your program, and conversely to make queries. In particular, your -program can tell Valgrind about changes in memory range permissions -that Valgrind would not otherwise know about, and so allows clients to -get Valgrind to do arbitrary custom checks. -

-Clients need to include the header file valgrind.h to -make this work. The macros therein have the magical property that -they generate code in-line which Valgrind can spot. However, the code -does nothing when not run on Valgrind, so you are not forced to run -your program on Valgrind just because you use the macros in this file. -Also, you are not required to link your program with any extra -supporting libraries. -

-A brief description of the available macros: -

-

- - - -

2.9  Support for POSIX Pthreads

- -As of late April 02, Valgrind supports programs which use POSIX -pthreads. Doing this has proved technically challenging but is now -mostly complete. It works well enough for significant threaded -applications to work. -

-It works as follows: threaded apps are (dynamically) linked against -libpthread.so. Usually this is the one installed with -your Linux distribution. Valgrind, however, supplies its own -libpthread.so and automatically connects your program to -it instead. -

-The fake libpthread.so and Valgrind cooperate to -implement a user-space pthreads package. This approach avoids the -horrible implementation problems of implementing a truly -multiprocessor version of Valgrind, but it does mean that threaded -apps run only on one CPU, even if you have a multiprocessor machine. -

-Valgrind schedules your threads in a round-robin fashion, with all -threads having equal priority. It switches threads every 50000 basic -blocks (typically around 300000 x86 instructions), which means you'll -get a much finer interleaving of thread executions than when run -natively. This in itself may cause your program to behave differently -if you have some kind of concurrency, critical race, locking, or -similar, bugs. -

-The current (valgrind-1.0 release) state of pthread support is as -follows: -

- - -As of 18 May 02, the following threaded programs now work fine on my -RedHat 7.2 box: Opera 6.0Beta2, KNode in KDE 3.0, Mozilla-0.9.2.1 and -Galeon-0.11.3, both as supplied with RedHat 7.2. Also Mozilla 1.0RC2. -OpenOffice 1.0. MySQL 3.something (the current stable release). - - -

2.10  Building and installing

- -We now use the standard Unix ./configure, -make, make install mechanism, and I have -attempted to ensure that it works on machines with kernel 2.2 or 2.4 -and glibc 2.1.X or 2.2.X. I don't think there is much else to say. -There are no options apart from the usual --prefix that -you should give to ./configure. - -

-The configure script tests the version of the X server -currently indicated by the current $DISPLAY. This is a -known bug. The intention was to detect the version of the current -XFree86 client libraries, so that correct suppressions could be -selected for them, but instead the test checks the server version. -This is just plain wrong. - -

-If you are building a binary package of Valgrind for distribution, -please read README_PACKAGERS. It contains some important -information. - -

-Apart from that there is no excitement here. Let me know if you have -build problems. - - - - -

2.11  If you have problems

-Mail me (jseward@acm.org). - -

See Section 4 for the known limitations of -Valgrind, and for a list of programs which are known not to work on -it. - -

The translator/instrumentor has a lot of assertions in it. They -are permanently enabled, and I have no plans to disable them. If one -of these breaks, please mail me! - -

If you get an assertion failure on the expression -chunkSane(ch) in vg_free() in -vg_malloc.c, this may have happened because your program -wrote off the end of a malloc'd block, or before its beginning. -Valgrind should have emitted a proper message to that effect before -dying in this way. This is a known problem which I should fix. -

- -


- - -

3  Details of the checking machinery

- -Read this section if you want to know, in detail, exactly what and how -Valgrind is checking. - - -

3.1  Valid-value (V) bits

- -It is simplest to think of Valgrind implementing a synthetic Intel x86 -CPU which is identical to a real CPU, except for one crucial detail. -Every bit (literally) of data processed, stored and handled by the -real CPU has, in the synthetic CPU, an associated "valid-value" bit, -which says whether or not the accompanying bit has a legitimate value. -In the discussions which follow, this bit is referred to as the V -(valid-value) bit. - -

Each byte in the system therefore has a 8 V bits which follow -it wherever it goes. For example, when the CPU loads a word-size item -(4 bytes) from memory, it also loads the corresponding 32 V bits from -a bitmap which stores the V bits for the process' entire address -space. If the CPU should later write the whole or some part of that -value to memory at a different address, the relevant V bits will be -stored back in the V-bit bitmap. - -

In short, each bit in the system has an associated V bit, which -follows it around everywhere, even inside the CPU. Yes, the CPU's -(integer and %eflags) registers have their own V bit -vectors. - -

Copying values around does not cause Valgrind to check for, or -report on, errors. However, when a value is used in a way which might -conceivably affect the outcome of your program's computation, the -associated V bits are immediately checked. If any of these indicate -that the value is undefined, an error is reported. - -

Here's an (admittedly nonsensical) example: -

-  int i, j;
-  int a[10], b[10];
-  for (i = 0; i < 10; i++) {
-    j = a[i];
-    b[i] = j;
-  }
-
- -

Valgrind emits no complaints about this, since it merely copies -uninitialised values from a[] into b[], and -doesn't use them in any way. However, if the loop is changed to -

-  for (i = 0; i < 10; i++) {
-    j += a[i];
-  }
-  if (j == 77) 
-     printf("hello there\n");
-
-then Valgrind will complain, at the if, that the -condition depends on uninitialised values. - -

Most low level operations, such as adds, cause Valgrind to -use the V bits for the operands to calculate the V bits for the -result. Even if the result is partially or wholly undefined, -it does not complain. - -

Checks on definedness only occur in two places: when a value is -used to generate a memory address, and where control flow decision -needs to be made. Also, when a system call is detected, valgrind -checks definedness of parameters as required. - -

If a check should detect undefinedness, an error message is -issued. The resulting value is subsequently regarded as well-defined. -To do otherwise would give long chains of error messages. In effect, -we say that undefined values are non-infectious. - -

This sounds overcomplicated. Why not just check all reads from -memory, and complain if an undefined value is loaded into a CPU register? -Well, that doesn't work well, because perfectly legitimate C programs routinely -copy uninitialised values around in memory, and we don't want endless complaints -about that. Here's the canonical example. Consider a struct -like this: -

-  struct S { int x; char c; };
-  struct S s1, s2;
-  s1.x = 42;
-  s1.c = 'z';
-  s2 = s1;
-
- -

The question to ask is: how large is struct S, in -bytes? An int is 4 bytes and a char one byte, so perhaps a struct S -occupies 5 bytes? Wrong. All (non-toy) compilers I know of will -round the size of struct S up to a whole number of words, -in this case 8 bytes. Not doing this forces compilers to generate -truly appalling code for subscripting arrays of struct -S's. - -

So s1 occupies 8 bytes, yet only 5 of them will be initialised. -For the assignment s2 = s1, gcc generates code to copy -all 8 bytes wholesale into s2 without regard for their -meaning. If Valgrind simply checked values as they came out of -memory, it would yelp every time a structure assignment like this -happened. So the more complicated semantics described above is -necessary. This allows gcc to copy s1 into -s2 any way it likes, and a warning will only be emitted -if the uninitialised values are later used. - -

One final twist to this story. The above scheme allows garbage to -pass through the CPU's integer registers without complaint. It does -this by giving the integer registers V tags, passing these around in -the expected way. This complicated and computationally expensive to -do, but is necessary. Valgrind is more simplistic about -floating-point loads and stores. In particular, V bits for data read -as a result of floating-point loads are checked at the load -instruction. So if your program uses the floating-point registers to -do memory-to-memory copies, you will get complaints about -uninitialised values. Fortunately, I have not yet encountered a -program which (ab)uses the floating-point registers in this way. - - -

3.2  Valid-address (A) bits

- -Notice that the previous section describes how the validity of values -is established and maintained without having to say whether the -program does or does not have the right to access any particular -memory location. We now consider the latter issue. - -

As described above, every bit in memory or in the CPU has an -associated valid-value (V) bit. In addition, all bytes in memory, but -not in the CPU, have an associated valid-address (A) bit. This -indicates whether or not the program can legitimately read or write -that location. It does not give any indication of the validity or the -data at that location -- that's the job of the V bits -- only whether -or not the location may be accessed. - -

Every time your program reads or writes memory, Valgrind checks the -A bits associated with the address. If any of them indicate an -invalid address, an error is emitted. Note that the reads and writes -themselves do not change the A bits, only consult them. - -

So how do the A bits get set/cleared? Like this: - -

- - - -

3.3  Putting it all together

-Valgrind's checking machinery can be summarised as follows: - - - -Valgrind intercepts calls to malloc, calloc, realloc, valloc, -memalign, free, new and delete. The behaviour you get is: - - - - - - -

3.4  Signals

- -Valgrind provides suitable handling of signals, so, provided you stick -to POSIX stuff, you should be ok. Basic sigaction() and sigprocmask() -are handled. Signal handlers may return in the normal way or do -longjmp(); both should work ok. As specified by POSIX, a signal is -blocked in its own handler. Default actions for signals should work -as before. Etc, etc. - -

Under the hood, dealing with signals is a real pain, and Valgrind's -simulation leaves much to be desired. If your program does -way-strange stuff with signals, bad things may happen. If so, let me -know. I don't promise to fix it, but I'd at least like to be aware of -it. - - - -

3.5  Memory leak detection

- -Valgrind keeps track of all memory blocks issued in response to calls -to malloc/calloc/realloc/new. So when the program exits, it knows -which blocks are still outstanding -- have not been returned, in other -words. Ideally, you want your program to have no blocks still in use -at exit. But many programs do. - -

For each such block, Valgrind scans the entire address space of the -process, looking for pointers to the block. One of three situations -may result: - -

- -Valgrind reports summaries about leaked and dubious blocks. -For each such block, it will also tell you where the block was -allocated. This should help you figure out why the pointer to it has -been lost. In general, you should attempt to ensure your programs do -not have any leaked or dubious blocks at exit. - -

The precise area of memory in which Valgrind searches for pointers -is: all naturally-aligned 4-byte words for which all A bits indicate -addressibility and all V bits indicated that the stored value is -actually valid. - -


- - - -

4  Limitations

- -The following list of limitations seems depressingly long. However, -most programs actually work fine. - -

Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on -a kernel 2.2.X or 2.4.X system, subject to the following constraints: - -

- -Programs which are known not to work are: - - - -Known platform-specific limitations, as of release 1.0.0: - - - - -


- - - -

5  How it works -- a rough overview

-Some gory details, for those with a passion for gory details. You -don't need to read this section if all you want to do is use Valgrind. - - -

5.1  Getting started

- -Valgrind is compiled into a shared object, valgrind.so. The shell -script valgrind sets the LD_PRELOAD environment variable to point to -valgrind.so. This causes the .so to be loaded as an extra library to -any subsequently executed dynamically-linked ELF binary, viz, the -program you want to debug. - -

The dynamic linker allows each .so in the process image to have an -initialisation function which is run before main(). It also allows -each .so to have a finalisation function run after main() exits. - -

When valgrind.so's initialisation function is called by the dynamic -linker, the synthetic CPU to starts up. The real CPU remains locked -in valgrind.so for the entire rest of the program, but the synthetic -CPU returns from the initialisation function. Startup of the program -now continues as usual -- the dynamic linker calls all the other .so's -initialisation routines, and eventually runs main(). This all runs on -the synthetic CPU, not the real one, but the client program cannot -tell the difference. - -

Eventually main() exits, so the synthetic CPU calls valgrind.so's -finalisation function. Valgrind detects this, and uses it as its cue -to exit. It prints summaries of all errors detected, possibly checks -for memory leaks, and then exits the finalisation routine, but now on -the real CPU. The synthetic CPU has now lost control -- permanently --- so the program exits back to the OS on the real CPU, just as it -would have done anyway. - -

On entry, Valgrind switches stacks, so it runs on its own stack. -On exit, it switches back. This means that the client program -continues to run on its own stack, so we can switch back and forth -between running it on the simulated and real CPUs without difficulty. -This was an important design decision, because it makes it easy (well, -significantly less difficult) to debug the synthetic CPU. - - - -

5.2  The translation/instrumentation engine

- -Valgrind does not directly run any of the original program's code. Only -instrumented translations are run. Valgrind maintains a translation -table, which allows it to find the translation quickly for any branch -target (code address). If no translation has yet been made, the -translator - a just-in-time translator - is summoned. This makes an -instrumented translation, which is added to the collection of -translations. Subsequent jumps to that address will use this -translation. - -

Valgrind no longer directly supports detection of self-modifying -code. Such checking is expensive, and in practice (fortunately) -almost no applications need it. However, to help people who are -debugging dynamic code generation systems, there is a Client Request -(basically a macro you can put in your program) which directs Valgrind -to discard translations in a given address range. So Valgrind can -still work in this situation provided the client tells it when -code has become out-of-date and needs to be retranslated. - -

The JITter translates basic blocks -- blocks of straight-line-code --- as single entities. To minimise the considerable difficulties of -dealing with the x86 instruction set, x86 instructions are first -translated to a RISC-like intermediate code, similar to sparc code, -but with an infinite number of virtual integer registers. Initially -each insn is translated seperately, and there is no attempt at -instrumentation. - -

The intermediate code is improved, mostly so as to try and cache -the simulated machine's registers in the real machine's registers over -several simulated instructions. This is often very effective. Also, -we try to remove redundant updates of the simulated machines's -condition-code register. - -

The intermediate code is then instrumented, giving more -intermediate code. There are a few extra intermediate-code operations -to support instrumentation; it is all refreshingly simple. After -instrumentation there is a cleanup pass to remove redundant value -checks. - -

This gives instrumented intermediate code which mentions arbitrary -numbers of virtual registers. A linear-scan register allocator is -used to assign real registers and possibly generate spill code. All -of this is still phrased in terms of the intermediate code. This -machinery is inspired by the work of Reuben Thomas (MITE). - -

Then, and only then, is the final x86 code emitted. The -intermediate code is carefully designed so that x86 code can be -generated from it without need for spare registers or other -inconveniences. - -

The translations are managed using a traditional LRU-based caching -scheme. The translation cache has a default size of about 14MB. - - - -

5.3  Tracking the status of memory

Each byte in the -process' address space has nine bits associated with it: one A bit and -eight V bits. The A and V bits for each byte are stored using a -sparse array, which flexibly and efficiently covers arbitrary parts of -the 32-bit address space without imposing significant space or -performance overheads for the parts of the address space never -visited. The scheme used, and speedup hacks, are described in detail -at the top of the source file vg_memory.c, so you should read that for -the gory details. - - - -

5.4 System calls

-All system calls are intercepted. The memory status map is consulted -before and updated after each call. It's all rather tiresome. See -vg_syscall_mem.c for details. - - - -

5.5  Signals

-All system calls to sigaction() and sigprocmask() are intercepted. If -the client program is trying to set a signal handler, Valgrind makes a -note of the handler address and which signal it is for. Valgrind then -arranges for the same signal to be delivered to its own handler. - -

When such a signal arrives, Valgrind's own handler catches it, and -notes the fact. At a convenient safe point in execution, Valgrind -builds a signal delivery frame on the client's stack and runs its -handler. If the handler longjmp()s, there is nothing more to be said. -If the handler returns, Valgrind notices this, zaps the delivery -frame, and carries on where it left off before delivering the signal. - -

The purpose of this nonsense is that setting signal handlers -essentially amounts to giving callback addresses to the Linux kernel. -We can't allow this to happen, because if it did, signal handlers -would run on the real CPU, not the simulated one. This means the -checking machinery would not operate during the handler run, and, -worse, memory permissions maps would not be updated, which could cause -spurious error reports once the handler had returned. - -

An even worse thing would happen if the signal handler longjmp'd -rather than returned: Valgrind would completely lose control of the -client program. - -

Upshot: we can't allow the client to install signal handlers -directly. Instead, Valgrind must catch, on behalf of the client, any -signal the client asks to catch, and must delivery it to the client on -the simulated CPU, not the real one. This involves considerable -gruesome fakery; see vg_signals.c for details. -

- -


- - -

6  Example

-This is the log for a run of a small program. The program is in fact -correct, and the reported error is as the result of a potentially serious -code generation bug in GNU g++ (snapshot 20010527). -
-sewardj@phoenix:~/newmat10$
-~/Valgrind-6/valgrind -v ./bogon 
-==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
-==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
-==25832== Startup, with flags:
-==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
-==25832== reading syms from /lib/ld-linux.so.2
-==25832== reading syms from /lib/libc.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
-==25832== reading syms from /lib/libm.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
-==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
-==25832== reading syms from /proc/self/exe
-==25832== loaded 5950 symbols, 142333 line number locations
-==25832== 
-==25832== Invalid read of size 4
-==25832==    at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45)
-==25832==    by 0x80487AF: main (bogon.cpp:66)
-==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
-==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
-==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
-==25832==
-==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
-==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
-==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
-==25832== For a detailed leak analysis, rerun with: --leak-check=yes
-==25832==
-==25832== exiting, did 1881 basic blocks, 0 misses.
-==25832== 223 translations, 3626 bytes in, 56801 bytes out.
-
-

The GCC folks fixed this about a week before gcc-3.0 shipped. -


-

- - - - -

7  Cache profiling

-As well as memory debugging, Valgrind also allows you to do cache simulations -and annotate your source line-by-line with the number of cache misses. In -particular, it records: - -On a modern x86 machine, an L1 miss will typically cost around 10 cycles, -and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program.

- -Also, since one instruction cache read is performed per instruction executed, -you can find out how many instructions are executed per line, which can be -useful for traditional profiling and test coverage.

- -Any feedback, bug-fixes, suggestions, etc, welcome. - - -

7.1  Overview

-First off, as for normal Valgrind use, you probably want to turn on debugging -info (the -g flag). But by contrast with normal Valgrind use, you -probably do want to turn optimisation on, since you should profile your -program as it will be normally run. - -The two steps are: -
    -
  1. Run your program with cachegrind in front of the - normal command line invocation. When the program finishes, - Valgrind will print summary cache statistics. It also collects - line-by-line information in a file - cachegrind.out.pid, where pid - is the program's process id. -

    - This step should be done every time you want to collect - information about a new program, a changed program, or about the - same program with different input. -

  2. -

    -

  3. Generate a function-by-function summary, and possibly annotate - source files with 'vg_annotate'. Source files to annotate can be - specified manually, or manually on the command line, or - "interesting" source files can be annotated automatically with - the --auto=yes option. You can annotate C/C++ - files or assembly language files equally easily. -

    - This step can be performed as many times as you like for each - Step 2. You may want to do multiple annotations showing - different information each time.

    -

  4. -
- -The steps are described in detail in the following sections.

- - -

7.2  Cache simulation specifics

- -Cachegrind uses a simulation for a machine with a split L1 cache and a unified -L2 cache. This configuration is used for all (modern) x86-based machines we -are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are -ancient history now.

- -The more specific characteristics of the simulation are as follows. - -

- -The cache configuration simulated (cache size, associativity and line size) is -determined automagically using the CPUID instruction. If you have an old -machine that (a) doesn't support the CPUID instruction, or (b) supports it in -an early incarnation that doesn't give any cache information, then Cachegrind -will fall back to using a default configuration (that of a model 3/4 Athlon). -Cachegrind will tell you if this happens. You can manually specify one, two or -all three levels (I1/D1/L2) of the cache from the command line using the ---I1, --D1 and --L2 options.

- -Other noteworthy behaviour: - -

- -If you are interested in simulating a cache with different properties, it is -not particularly hard to write your own cache simulator, or to modify the -existing ones in vg_cachesim_I1.c, vg_cachesim_D1.c, -vg_cachesim_L2.c and vg_cachesim_gen.c. We'd be -interested to hear from anyone who does. - - -

7.3  Profiling programs

- -Cache profiling is enabled by using the --cachesim=yes -option to the valgrind shell script. Alternatively, it -is probably more convenient to use the cachegrind script. -Either way automatically turns off Valgrind's memory checking functions, -since the cache simulation is slow enough already, and you probably -don't want to do both at once. -

-To gather cache profiling information about the program ls --l, type: - -

cachegrind ls -l
- -The program will execute (slowly). Upon completion, summary statistics -that look like this will be printed: - -
-==31751== I   refs:      27,742,716
-==31751== I1  misses:           276
-==31751== L2  misses:           275
-==31751== I1  miss rate:        0.0%
-==31751== L2i miss rate:        0.0%
-==31751== 
-==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
-==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
-==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
-==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
-==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
-==31751== 
-==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
-==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
-
- -Cache accesses for instruction fetches are summarised first, giving the -number of fetches made (this is the number of instructions executed, which -can be useful to know in its own right), the number of I1 misses, and the -number of L2 instruction (L2i) misses.

- -Cache accesses for data follow. The information is similar to that of the -instruction fetches, except that the values are also shown split between reads -and writes (note each row's rd and wr values add up -to the row's total).

- -Combined instruction and data figures for the L2 cache follow that.

- - -

7.4  Output file

- -As well as printing summary information, Cachegrind also writes -line-by-line cache profiling information to a file named -cachegrind.out.pid. This file is human-readable, but is -best interpreted by the accompanying program vg_annotate, -described in the next section. -

-Things to note about the cachegrind.out.pid file: -

- -Note that older versions of Cachegrind used a log file named -cachegrind.out (i.e. no .pid suffix). -The suffix serves two purposes. Firstly, it means you don't have to rename old -log files that you don't want to overwrite. Secondly, and more importantly, -it allows correct profiling with the --trace-children=yes option -of programs that spawn child processes. - - -

7.5  Cachegrind options

-Cachegrind accepts all the options that Valgrind does, although some of them -(ones related to memory checking) don't do anything when cache profiling.

- -The interesting cache-simulation specific options are: - -

- - - -

7.6  Annotating C/C++ programs

- -Before using vg_annotate, it is worth widening your -window to be at least 120-characters wide if possible, as the output -lines can be quite long. -

-To get a function-by-function summary, run vg_annotate ---pid in a directory containing a -cachegrind.out.pid file. The --pid -is required so that vg_annotate knows which log file to use when -several are present. -

-The output looks like this: - -

---------------------------------------------------------------------------------
-I1 cache:              65536 B, 64 B, 2-way associative
-D1 cache:              65536 B, 64 B, 2-way associative
-L2 cache:              262144 B, 64 B, 8-way associative
-Command:               concord vg_to_ucode.c
-Events recorded:       Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Events shown:          Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Event sort order:      Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Threshold:             99%
-Chosen for annotation:
-Auto-annotation:       on
-
---------------------------------------------------------------------------------
-Ir         I1mr I2mr Dr         D1mr   D2mr  Dw        D1mw   D2mw
---------------------------------------------------------------------------------
-27,742,716  276  275 10,955,517 21,905 3,987 4,474,773 19,280 19,098  PROGRAM TOTALS
-
---------------------------------------------------------------------------------
-Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
---------------------------------------------------------------------------------
-8,821,482    5    5 2,242,702 1,621    73 1,794,230      0      0  getc.c:_IO_getc
-5,222,023    4    4 2,276,334    16    12   875,959      1      1  concord.c:get_word
-2,649,248    2    2 1,344,810 7,326 1,385         .      .      .  vg_main.c:strcmp
-2,521,927    2    2   591,215     0     0   179,398      0      0  concord.c:hash
-2,242,740    2    2 1,046,612   568    22   448,548      0      0  ctype.c:tolower
-1,496,937    4    4   630,874 9,000 1,400   279,388      0      0  concord.c:insert
-  897,991   51   51   897,831    95    30        62      1      1  ???:???
-  598,068    1    1   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__flockfile
-  598,068    0    0   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__funlockfile
-  598,024    4    4   213,580    35    16   149,506      0      0  vg_clientmalloc.c:malloc
-  446,587    1    1   215,973 2,167   430   129,948 14,057 13,957  concord.c:add_existing
-  341,760    2    2   128,160     0     0   128,160      0      0  vg_clientmalloc.c:vg_trap_here_WRAPPER
-  320,782    4    4   150,711   276     0    56,027     53     53  concord.c:init_hash_table
-  298,998    1    1   106,785     0     0    64,071      1      1  concord.c:create
-  149,518    0    0   149,516     0     0         1      0      0  ???:tolower@@GLIBC_2.0
-  149,518    0    0   149,516     0     0         1      0      0  ???:fgetc@@GLIBC_2.0
-   95,983    4    4    38,031     0     0    34,409  3,152  3,150  concord.c:new_word_node
-   85,440    0    0    42,720     0     0    21,360      0      0  vg_clientmalloc.c:vg_bogus_epilogue
-
- -First up is a summary of the annotation options: - - - -Then follows summary statistics for the whole program. These are similar -to the summary provided when running cachegrind.

- -Then follows function-by-function statistics. Each function is -identified by a file_name:function_name pair. If a column -contains only a dot it means the function never performs -that event (eg. the third row shows that strcmp() -contains no instructions that write to memory). The name -??? is used if the the file name and/or function name -could not be determined from debugging information. If most of the -entries have the form ???:??? the program probably wasn't -compiled with -g. If any code was invalidated (either due to -self-modifying code or unloading of shared objects) its counts are aggregated -into a single cost centre written as (discarded):(discarded).

- -It is worth noting that functions will come from three types of source files: -

    -
  1. From the profiled program (concord.c in this example).
  2. -
  3. From libraries (eg. getc.c)
  4. -
  5. From Valgrind's implementation of some libc functions (eg. - vg_clientmalloc.c:malloc). These are recognisable because - the filename begins with vg_, and is probably one of - vg_main.c, vg_clientmalloc.c or - vg_mylibc.c. -
  6. -
- -There are two ways to annotate source files -- by choosing them -manually, or with the --auto=yes option. To do it -manually, just specify the filenames as arguments to -vg_annotate. For example, the output from running -vg_annotate concord.c for our example produces the same -output as above followed by an annotated version of -concord.c, a section of which looks like: - -
---------------------------------------------------------------------------------
--- User-annotated source: concord.c
---------------------------------------------------------------------------------
-Ir        I1mr I2mr Dr      D1mr  D2mr  Dw      D1mw   D2mw
-
-[snip]
-
-        .    .    .       .     .     .       .      .      .  void init_hash_table(char *file_name, Word_Node *table[])
-        3    1    1       .     .     .       1      0      0  {
-        .    .    .       .     .     .       .      .      .      FILE *file_ptr;
-        .    .    .       .     .     .       .      .      .      Word_Info *data;
-        1    0    0       .     .     .       1      1      1      int line = 1, i;
-        .    .    .       .     .     .       .      .      .
-        5    0    0       .     .     .       3      0      0      data = (Word_Info *) create(sizeof(Word_Info));
-        .    .    .       .     .     .       .      .      .
-    4,991    0    0   1,995     0     0     998      0      0      for (i = 0; i < TABLE_SIZE; i++)
-    3,988    1    1   1,994     0     0     997     53     52          table[i] = NULL;
-        .    .    .       .     .     .       .      .      .
-        .    .    .       .     .     .       .      .      .      /* Open file, check it. */
-        6    0    0       1     0     0       4      0      0      file_ptr = fopen(file_name, "r");
-        2    0    0       1     0     0       .      .      .      if (!(file_ptr)) {
-        .    .    .       .     .     .       .      .      .          fprintf(stderr, "Couldn't open '%s'.\n", file_name);
-        1    1    1       .     .     .       .      .      .          exit(EXIT_FAILURE);
-        .    .    .       .     .     .       .      .      .      }
-        .    .    .       .     .     .       .      .      .
-  165,062    1    1  73,360     0     0  91,700      0      0      while ((line = get_word(data, line, file_ptr)) != EOF)
-  146,712    0    0  73,356     0     0  73,356      0      0          insert(data->;word, data->line, table);
-        .    .    .       .     .     .       .      .      .
-        4    0    0       1     0     0       2      0      0      free(data);
-        4    0    0       1     0     0       2      0      0      fclose(file_ptr);
-        3    0    0       2     0     0       .      .      .  }
-
- -(Although column widths are automatically minimised, a wide terminal is clearly -useful.)

- -Each source file is clearly marked (User-annotated source) as -having been chosen manually for annotation. If the file was found in one of -the directories specified with the -I/--include -option, the directory and file are both given.

- -Each line is annotated with its event counts. Events not applicable for a line -are represented by a `.'; this is useful for distinguishing between an event -which cannot happen, and one which can but did not.

- -Sometimes only a small section of a source file is executed. To minimise -uninteresting output, Valgrind only shows annotated lines and lines within a -small distance of annotated lines. Gaps are marked with the line numbers so -you know which part of a file the shown code comes from, eg: - -

-(figures and code for line 704)
--- line 704 ----------------------------------------
--- line 878 ----------------------------------------
-(figures and code for line 878)
-
- -The amount of context to show around annotated lines is controlled by the ---context option.

- -To get automatic annotation, run vg_annotate --auto=yes. -vg_annotate will automatically annotate every source file it can find that is -mentioned in the function-by-function summary. Therefore, the files chosen for -auto-annotation are affected by the --sort and ---threshold options. Each source file is clearly marked -(Auto-annotated source) as being chosen automatically. Any files -that could not be found are mentioned at the end of the output, eg: - -

---------------------------------------------------------------------------------
-The following files chosen for auto-annotation could not be found:
---------------------------------------------------------------------------------
-  getc.c
-  ctype.c
-  ../sysdeps/generic/lockfile.c
-
- -This is quite common for library files, since libraries are usually compiled -with debugging information, but the source files are often not present on a -system. If a file is chosen for annotation both manually and -automatically, it is marked as User-annotated source. - -Use the -I/--include option to tell Valgrind where to look for -source files if the filenames found from the debugging information aren't -specific enough. - -Beware that vg_annotate can take some time to digest large -cachegrind.out.pid files, e.g. 30 seconds or more. Also -beware that auto-annotation can produce a lot of output if your program is -large! - - -

7.7  Annotating assembler programs

- -Valgrind can annotate assembler programs too, or annotate the -assembler generated for your C program. Sometimes this is useful for -understanding what is really happening when an interesting line of C -code is translated into multiple instructions.

- -To do this, you just need to assemble your .s files with -assembler-level debug information. gcc doesn't do this, but you can -use the GNU assembler with the --gstabs option to -generate object files with this information, eg: - -

as --gstabs foo.s
- -You can then profile and annotate source files in the same way as for C/C++ -programs. - - -

7.8  vg_annotate options

- - - -

7.9  Warnings

-There are a couple of situations in which vg_annotate issues warnings. - - - - -

7.10  Things to watch out for

-Some odd things that can occur during annotation: - - - -This list looks long, but these cases should be fairly rare.

- -Note: stabs is not an easy format to read. If you come across bizarre -annotations that look like might be caused by a bug in the stabs reader, -please let us know.

- - -

7.11  Accuracy

-Valgrind's cache profiling has a number of shortcomings: - - - -Another thing worth nothing is that results are very sensitive. Changing the -size of the valgrind.so file, the size of the program being -profiled, or even the length of its name can perturb the results. Variations -will be small, but don't expect perfectly repeatable results if your program -changes at all.

- -While these factors mean you shouldn't trust the results to be super-accurate, -hopefully they should be close enough to be useful.

- - -

7.12  Todo

- -
- - - diff --git a/docs/nav.html b/docs/nav.html deleted file mode 100644 index ad920ad44..000000000 --- a/docs/nav.html +++ /dev/null @@ -1,72 +0,0 @@ - - - Valgrind - - - - - -
- Contents of this manual
- 1 Introduction
- 1.1 What Valgrind is for
- 1.2 What it does with - your program -

- 2 How to use it, and how to - make sense of the results
- 2.1 Getting started
- 2.2 The commentary
- 2.3 Reporting of errors
- 2.4 Suppressing errors
- 2.5 Command-line flags
- 2.6 Explanation of error messages
- 2.7 Writing suppressions files
- 2.8 The Client Request mechanism
- 2.9 Support for POSIX pthreads
- 2.10 Building and installing
- 2.11 If you have problems -

- 3 Details of the checking machinery
- 3.1 Valid-value (V) bits
- 3.2 Valid-address (A) bits
- 3.3 Putting it all together
- 3.4 Signals
- 3.5 Memory leak detection -

- 4 Limitations
-

- 5 How it works -- a rough overview
- 5.1 Getting started
- 5.2 The translation/instrumentation engine
- 5.3 Tracking the status of memory
- 5.4 System calls
- 5.5 Signals -

- 6 An example
-

- 7 Cache profiling -

- 8 The design and implementation of Valgrind
- - - diff --git a/docs/techdocs.html b/docs/techdocs.html deleted file mode 100644 index 2e1cc8b7e..000000000 --- a/docs/techdocs.html +++ /dev/null @@ -1,2524 +0,0 @@ - - - - The design and implementation of Valgrind - - - - -  -

The design and implementation of Valgrind

- -
-Detailed technical notes for hackers, maintainers and the -overly-curious
-These notes pertain to snapshot 20020306
-

-jseward@acm.org
-
http://developer.kde.org/~sewardj
-Copyright © 2000-2002 Julian Seward -

-Valgrind is licensed under the GNU General Public License, -version 2
-An open-source tool for finding memory-management problems in -x86 GNU/Linux executables. -

- -

- - - - -


- -

Introduction

- -This document contains a detailed, highly-technical description of the -internals of Valgrind. This is not the user manual; if you are an -end-user of Valgrind, you do not want to read this. Conversely, if -you really are a hacker-type and want to know how it works, I assume -that you have read the user manual thoroughly. -

-You may need to read this document several times, and carefully. Some -important things, I only say once. - - -

History

- -Valgrind came into public view in late Feb 2002. However, it has been -under contemplation for a very long time, perhaps seriously for about -five years. Somewhat over two years ago, I started working on the x86 -code generator for the Glasgow Haskell Compiler -(http://www.haskell.org/ghc), gaining familiarity with x86 internals -on the way. I then did Cacheprof (http://www.cacheprof.org), gaining -further x86 experience. Some time around Feb 2000 I started -experimenting with a user-space x86 interpreter for x86-Linux. This -worked, but it was clear that a JIT-based scheme would be necessary to -give reasonable performance for Valgrind. Design work for the JITter -started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86 -dynamic translator which could run quite large programs. This -translator was in a sense pointless, since it did not do any -instrumentation or checking. - -

-Most of the rest of 2001 was taken up designing and implementing the -instrumentation scheme. The main difficulty, which consumed a lot -of effort, was to design a scheme which did not generate large numbers -of false uninitialised-value warnings. By late 2001 a satisfactory -scheme had been arrived at, and I started to test it on ever-larger -programs, with an eventual eye to making it work well enough so that -it was helpful to folks debugging the upcoming version 3 of KDE. I've -used KDE since before version 1.0, and wanted to Valgrind to be an -indirect contribution to the KDE 3 development effort. At the start of -Feb 02 the kde-core-devel crew started using it, and gave a huge -amount of helpful feedback and patches in the space of three weeks. -Snapshot 20020306 is the result. - -

-In the best Unix tradition, or perhaps in the spirit of Fred Brooks' -depressing-but-completely-accurate epitaph "build one to throw away; -you will anyway", much of Valgrind is a second or third rendition of -the initial idea. The instrumentation machinery -(vg_translate.c, vg_memory.c) and core CPU -simulation (vg_to_ucode.c, vg_from_ucode.c) -have had three redesigns and rewrites; the register allocator, -low-level memory manager (vg_malloc2.c) and symbol table -reader (vg_symtab2.c) are on the second rewrite. In a -sense, this document serves to record some of the knowledge gained as -a result. - - -

Design overview

- -Valgrind is compiled into a Linux shared object, -valgrind.so, and also a dummy one, -valgrinq.so, of which more later. The -valgrind shell script adds valgrind.so to -the LD_PRELOAD list of extra libraries to be -loaded with any dynamically linked library. This is a standard trick, -one which I assume the LD_PRELOAD mechanism was developed -to support. - -

-valgrind.so -is linked with the -z initfirst flag, which requests that -its initialisation code is run before that of any other object in the -executable image. When this happens, valgrind gains control. The -real CPU becomes "trapped" in valgrind.so and the -translations it generates. The synthetic CPU provided by Valgrind -does, however, return from this initialisation function. So the -normal startup actions, orchestrated by the dynamic linker -ld.so, continue as usual, except on the synthetic CPU, -not the real one. Eventually main is run and returns, -and then the finalisation code of the shared objects is run, -presumably in inverse order to which they were initialised. Remember, -this is still all happening on the simulated CPU. Eventually -valgrind.so's own finalisation code is called. It spots -this event, shuts down the simulated CPU, prints any error summaries -and/or does leak detection, and returns from the initialisation code -on the real CPU. At this point, in effect the real and synthetic CPUs -have merged back into one, Valgrind has lost control of the program, -and the program finally exit()s back to the kernel in the -usual way. - -

-The normal course of activity, one Valgrind has started up, is as -follows. Valgrind never runs any part of your program (usually -referred to as the "client"), not a single byte of it, directly. -Instead it uses function VG_(translate) to translate -basic blocks (BBs, straight-line sequences of code) into instrumented -translations, and those are run instead. The translations are stored -in the translation cache (TC), vg_tc, with the -translation table (TT), vg_tt supplying the -original-to-translation code address mapping. Auxiliary array -VG_(tt_fast) is used as a direct-map cache for fast -lookups in TT; it usually achieves a hit rate of around 98% and -facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad. - -

-Function VG_(dispatch) in vg_dispatch.S is -the heart of the JIT dispatcher. Once a translated code address has -been found, it is executed simply by an x86 call -to the translation. At the end of the translation, the next -original code addr is loaded into %eax, and the -translation then does a ret, taking it back to the -dispatch loop, with, interestingly, zero branch mispredictions. -The address requested in %eax is looked up first in -VG_(tt_fast), and, if not found, by calling C helper -VG_(search_transtab). If there is still no translation -available, VG_(dispatch) exits back to the top-level -C dispatcher VG_(toploop), which arranges for -VG_(translate) to make a new translation. All fairly -unsurprising, really. There are various complexities described below. - -

-The translator, orchestrated by VG_(translate), is -complicated but entirely self-contained. It is described in great -detail in subsequent sections. Translations are stored in TC, with TT -tracking administrative information. The translations are subject to -an approximate LRU-based management scheme. With the current -settings, the TC can hold at most about 15MB of translations, and LRU -passes prune it to about 13.5MB. Given that the -orig-to-translation expansion ratio is about 13:1 to 14:1, this means -TC holds translations for more or less a megabyte of original code, -which generally comes to about 70000 basic blocks for C++ compiled -with optimisation on. Generating new translations is expensive, so it -is worth having a large TC to minimise the (capacity) miss rate. - -

-The dispatcher, VG_(dispatch), receives hints from -the translations which allow it to cheaply spot all control -transfers corresponding to x86 call and ret -instructions. It has to do this in order to spot some special events: -

-Valgrind intercepts the client's malloc, -free, etc, -calls, so that it can store additional information. Each block -malloc'd by the client gives rise to a shadow block -in which Valgrind stores the call stack at the time of the -malloc -call. When the client calls free, Valgrind tries to -find the shadow block corresponding to the address passed to -free, and emits an error message if none can be found. -If it is found, the block is placed on the freed blocks queue -vg_freed_list, it is marked as inaccessible, and -its shadow block now records the call stack at the time of the -free call. Keeping free'd blocks in -this queue allows Valgrind to spot all (presumably invalid) accesses -to them. However, once the volume of blocks in the free queue -exceeds VG_(clo_freelist_vol), blocks are finally -removed from the queue. - -

-Keeping track of A and V bits (note: if you don't know what these are, -you haven't read the user guide carefully enough) for memory is done -in vg_memory.c. This implements a sparse array structure -which covers the entire 4G address space in a way which is reasonably -fast and reasonably space efficient. The 4G address space is divided -up into 64K sections, each covering 64Kb of address space. Given a -32-bit address, the top 16 bits are used to select one of the 65536 -entries in VG_(primary_map). The resulting "secondary" -(SecMap) holds A and V bits for the 64k of address space -chunk corresponding to the lower 16 bits of the address. - - -

Design decisions

- -Some design decisions were motivated by the need to make Valgrind -debuggable. Imagine you are writing a CPU simulator. It works fairly -well. However, you run some large program, like Netscape, and after -tens of millions of instructions, it crashes. How can you figure out -where in your simulator the bug is? - -

-Valgrind's answer is: cheat. Valgrind is designed so that it is -possible to switch back to running the client program on the real -CPU at any point. Using the --stop-after= flag, you can -ask Valgrind to run just some number of basic blocks, and then -run the rest of the way on the real CPU. If you are searching for -a bug in the simulated CPU, you can use this to do a binary search, -which quickly leads you to the specific basic block which is -causing the problem. - -

-This is all very handy. It does constrain the design in certain -unimportant ways. Firstly, the layout of memory, when viewed from the -client's point of view, must be identical regardless of whether it is -running on the real or simulated CPU. This means that Valgrind can't -do pointer swizzling -- well, no great loss -- and it can't run on -the same stack as the client -- again, no great loss. -Valgrind operates on its own stack, VG_(stack), which -it switches to at startup, temporarily switching back to the client's -stack when doing system calls for the client. - -

-Valgrind also receives signals on its own stack, -VG_(sigstack), but for different gruesome reasons -discussed below. - -

-This nice clean switch-back-to-the-real-CPU-whenever-you-like story -is muddied by signals. Problem is that signals arrive at arbitrary -times and tend to slightly perturb the basic block count, with the -result that you can get close to the basic block causing a problem but -can't home in on it exactly. My kludgey hack is to define -SIGNAL_SIMULATION to 1 towards the bottom of -vg_syscall_mem.c, so that signal handlers are run on the -real CPU and don't change the BB counts. - -

-A second hole in the switch-back-to-real-CPU story is that Valgrind's -way of delivering signals to the client is different from that of the -kernel. Specifically, the layout of the signal delivery frame, and -the mechanism used to detect a sighandler returning, are different. -So you can't expect to make the transition inside a sighandler and -still have things working, but in practice that's not much of a -restriction. - -

-Valgrind's implementation of malloc, free, -etc, (in vg_clientmalloc.c, not the low-level stuff in -vg_malloc2.c) is somewhat complicated by the need to -handle switching back at arbitrary points. It does work tho. - - - -

Correctness

- -There's only one of me, and I have a Real Life (tm) as well as hacking -Valgrind [allegedly :-]. That means I don't have time to waste -chasing endless bugs in Valgrind. My emphasis is therefore on doing -everything as simply as possible, with correctness, stability and -robustness being the number one priority, more important than -performance or functionality. As a result: - - -

-Some more specific things are: - -

- -

Current limitations

- -No threads. I think fixing this is close to a research-grade problem. -

-No MMX. Fixing this should be relatively easy, using the same giant -trick used for x86 FPU instructions. See below. -

-Support for weird (non-POSIX) signal stuff is patchy. Does anybody -care? -

- - - - -


- -

The instrumenting JITter

- -This really is the heart of the matter. We begin with various side -issues. - -

Run-time storage, and the use of host registers

- -Valgrind translates client (original) basic blocks into instrumented -basic blocks, which live in the translation cache TC, until either the -client finishes or the translations are ejected from TC to make room -for newer ones. -

-Since it generates x86 code in memory, Valgrind has complete control -of the use of registers in the translations. Now pay attention. I -shall say this only once, and it is important you understand this. In -what follows I will refer to registers in the host (real) cpu using -their standard names, %eax, %edi, etc. I -refer to registers in the simulated CPU by capitalising them: -%EAX, %EDI, etc. These two sets of -registers usually bear no direct relationship to each other; there is -no fixed mapping between them. This naming scheme is used fairly -consistently in the comments in the sources. -

-Host registers, once things are up and running, are used as follows: -

- -

-The state of the simulated CPU is stored in memory, in -VG_(baseBlock), which is a block of 200 words IIRC. -Recall that %ebp points permanently at the start of this -block. Function vg_init_baseBlock decides what the -offsets of various entities in VG_(baseBlock) are to be, -and allocates word offsets for them. The code generator then emits -%ebp relative addresses to get at those things. The -sequence in which entities are allocated has been carefully chosen so -that the 32 most popular entities come first, because this means 8-bit -offsets can be used in the generated code. - -

-If I was clever, I could make %ebp point 32 words along -VG_(baseBlock), so that I'd have another 32 words of -short-form offsets available, but that's just complicated, and it's -not important -- the first 32 words take 99% (or whatever) of the -traffic. - -

-Currently, the sequence of stuff in VG_(baseBlock) is as -follows: -

- -

-As a general rule, the simulated machine's state lives permanently in -memory at VG_(baseBlock). However, the JITter does some -optimisations which allow the simulated integer registers to be -cached in real registers over multiple simulated instructions within -the same basic block. These are always flushed back into memory at -the end of every basic block, so that the in-memory state is -up-to-date between basic blocks. (This flushing is implied by the -statement above that the real machine's allocatable registers are -dead in between simulated blocks). - - -

Startup, shutdown, and system calls

- -Getting into of Valgrind (VG_(startup), called from -valgrind.so's initialisation section), really means -copying the real CPU's state into VG_(baseBlock), and -then installing our own stack pointer, etc, into the real CPU, and -then starting up the JITter. Exiting valgrind involves copying the -simulated state back to the real state. - -

-Unfortunately, there's a complication at startup time. Problem is -that at the point where we need to take a snapshot of the real CPU's -state, the offsets in VG_(baseBlock) are not set up yet, -because to do so would involve disrupting the real machine's state -significantly. The way round this is to dump the real machine's state -into a temporary, static block of memory, -VG_(m_state_static). We can then set up the -VG_(baseBlock) offsets at our leisure, and copy into it -from VG_(m_state_static) at some convenient later time. -This copying is done by -VG_(copy_m_state_static_to_baseBlock). - -

-On exit, the inverse transformation is (rather unnecessarily) used: -stuff in VG_(baseBlock) is copied to -VG_(m_state_static), and the assembly stub then copies -from VG_(m_state_static) into the real machine registers. - -

-Doing system calls on behalf of the client (vg_syscall.S) -is something of a half-way house. We have to make the world look -sufficiently like that which the client would normally have to make -the syscall actually work properly, but we can't afford to lose -control. So the trick is to copy all of the client's state, except -its program counter, into the real CPU, do the system call, and -copy the state back out. Note that the client's state includes its -stack pointer register, so one effect of this partial restoration is -to cause the system call to be run on the client's stack, as it should -be. - -

-As ever there are complications. We have to save some of our own state -somewhere when restoring the client's state into the CPU, so that we -can keep going sensibly afterwards. In fact the only thing which is -important is our own stack pointer, but for paranoia reasons I save -and restore our own FPU state as well, even though that's probably -pointless. - -

-The complication on the above complication is, that for horrible -reasons to do with signals, we may have to handle a second client -system call whilst the client is blocked inside some other system -call (unbelievable!). That means there's two sets of places to -dump Valgrind's stack pointer and FPU state across the syscall, -and we decide which to use by consulting -VG_(syscall_depth), which is in turn maintained by -VG_(wrap_syscall). - - - -

Introduction to UCode

- -UCode lies at the heart of the x86-to-x86 JITter. The basic premise -is that dealing the the x86 instruction set head-on is just too darn -complicated, so we do the traditional compiler-writer's trick and -translate it into a simpler, easier-to-deal-with form. - -

-In normal operation, translation proceeds through six stages, -coordinated by VG_(translate): -

    -
  1. Parsing of an x86 basic block into a sequence of UCode - instructions (VG_(disBB)). -

    -

  2. UCode optimisation (vg_improve), with the aim of - caching simulated registers in real registers over multiple - simulated instructions, and removing redundant simulated - %EFLAGS saving/restoring. -

    -

  3. UCode instrumentation (vg_instrument), which adds - value and address checking code. -

    -

  4. Post-instrumentation cleanup (vg_cleanup), removing - redundant value-check computations. -

    -

  5. Register allocation (vg_do_register_allocation), - which, note, is done on UCode. -

    -

  6. Emission of final instrumented x86 code - (VG_(emit_code)). -
- -

-Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode -transformation passes, all on straight-line blocks of UCode (type -UCodeBlock). Steps 2 and 4 are optimisation passes and -can be disabled for debugging purposes, with ---optimise=no and --cleanup=no respectively. - -

-Valgrind can also run in a no-instrumentation mode, given ---instrument=no. This is useful for debugging the JITter -quickly without having to deal with the complexity of the -instrumentation mechanism too. In this mode, steps 3 and 4 are -omitted. - -

-These flags combine, so that --instrument=no together with ---optimise=no means only steps 1, 5 and 6 are used. ---single-step=yes causes each x86 instruction to be -treated as a single basic block. The translations are terrible but -this is sometimes instructive. - -

-The --stop-after=N flag switches back to the real CPU -after N basic blocks. It also re-JITs the final basic -block executed and prints the debugging info resulting, so this -gives you a way to get a quick snapshot of how a basic block looks as -it passes through the six stages mentioned above. If you want to -see full information for every block translated (probably not, but -still ...) find, in VG_(translate), the lines -
dis = True; -
dis = debugging_translation; -
-and comment out the second line. This will spew out debugging -junk faster than you can possibly imagine. - - - -

UCode operand tags: type Tag

- -UCode is, more or less, a simple two-address RISC-like code. In -keeping with the x86 AT&T assembly syntax, generally speaking the -first operand is the source operand, and the second is the destination -operand, which is modified when the uinstr is notionally executed. - -

-UCode instructions have up to three operand fields, each of which has -a corresponding Tag describing it. Possible values for -the tag are: - -

- - -

UCode instructions: type UInstr

- -

-UCode was carefully designed to make it possible to do register -allocation on UCode and then translate the result into x86 code -without needing any extra registers ... well, that was the original -plan, anyway. Things have gotten a little more complicated since -then. In what follows, UCode instructions are referred to as uinstrs, -to distinguish them from x86 instructions. Uinstrs of course have -uopcodes which are (naturally) different from x86 opcodes. - -

-A uinstr (type UInstr) contains -various fields, not all of which are used by any one uopcode: -

- -

-UOpcodes (type Opcode) are divided into two groups: those -necessary merely to express the functionality of the x86 code, and -extra uopcodes needed to express the instrumentation. The former -group contains: -

- -

-Stages 1 and 2 of the 6-stage translation process mentioned above -deal purely with these uopcodes, and no others. They are -sufficient to express pretty much all the x86 32-bit protected-mode -instruction set, at -least everything understood by a pre-MMX original Pentium (P54C). - -

-Stages 3, 4, 5 and 6 also deal with the following extra -"instrumentation" uopcodes. They are used to express all the -definedness-tracking and -checking machinery which valgrind does. In -later sections we show how to create checking code for each of the -uopcodes above. Note that these instrumentation uopcodes, although -some appearing complicated, have been carefully chosen so that -efficient x86 code can be generated for them. GNU superopt v2.5 did a -great job helping out here. Anyways, the uopcodes are as follows: - -

- -

-These 10 uopcodes are sufficient to express Valgrind's entire -definedness-checking semantics. In fact most of the interesting magic -is done by the TAG1 and TAG2 -suboperations. - -

-First, however, I need to explain about V-vector operation sizes. -There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32 -V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations. -However there is also the mysterious size 0, which really means a -single V bit. Single V bits are used in various circumstances; in -particular, the definedness of %EFLAGS is modelled with a -single V bit. Now might be a good time to also point out that for -V bits, 1 means "undefined" and 0 means "defined". Similarly, for A -bits, 1 means "invalid address" and 0 means "valid address". This -seems counterintuitive (and so it is), but testing against zero on -x86s saves instructions compared to testing against all 1s, because -many ALU operations set the Z flag for free, so to speak. - -

-With that in mind, the tag ops are: - -

- -

-That's all the tag ops. If you stare at this long enough, and then -run Valgrind and stare at the pre- and post-instrumented ucode, it -should be fairly obvious how the instrumentation machinery hangs -together. - -

-One point, if you do this: in order to make it easy to differentiate -TempRegs carrying values from TempRegs -carrying V bit vectors, Valgrind prints the former as (for example) -t28 and the latter as q28; the fact that -they carry the same number serves to indicate their relationship. -This is purely for the convenience of the human reader; the register -allocator and code generator don't regard them as different. - - -

Translation into UCode

- -VG_(disBB) allocates a new UCodeBlock and -then uses disInstr to translate x86 instructions one at a -time into UCode, dumping the result in the UCodeBlock. -This goes on until a control-flow transfer instruction is encountered. - -

-Despite the large size of vg_to_ucode.c, this translation -is really very simple. Each x86 instruction is translated entirely -independently of its neighbours, merrily allocating new -TempRegs as it goes. The idea is to have a simple -translator -- in reality, no more than a macro-expander -- and the -- -resulting bad UCode translation is cleaned up by the UCode -optimisation phase which follows. To give you an idea of some x86 -instructions and their translations (this is a complete basic block, -as Valgrind sees it): -

-        0x40435A50:  incl %edx
-
-           0: GETL      %EDX, t0
-           1: INCL      t0  (-wOSZAP)
-           2: PUTL      t0, %EDX
-
-        0x40435A51:  movsbl (%edx),%eax
-
-           3: GETL      %EDX, t2
-           4: LDB       (t2), t2
-           5: WIDENL_Bs t2
-           6: PUTL      t2, %EAX
-
-        0x40435A54:  testb $0x20, 1(%ecx,%eax,2)
-
-           7: GETL      %EAX, t6
-           8: GETL      %ECX, t8
-           9: LEA2L     1(t8,t6,2), t4
-          10: LDB       (t4), t10
-          11: MOVB      $0x20, t12
-          12: ANDB      t12, t10  (-wOSZACP)
-          13: INCEIPo   $9
-
-        0x40435A59:  jnz-8 0x40435A50
-
-          14: Jnzo      $0x40435A50  (-rOSZACP)
-          15: JMPo      $0x40435A5B
-
- -

-Notice how the block always ends with an unconditional jump to the -next block. This is a bit unnecessary, but makes many things simpler. - -

-Most x86 instructions turn into sequences of GET, -PUT, LEA1, LEA2, -LOAD and STORE. Some complicated ones -however rely on calling helper bits of code in -vg_helpers.S. The ucode instructions PUSH, -POP, CALL, CALLM_S and -CALLM_E support this. The calling convention is somewhat -ad-hoc and is not the C calling convention. The helper routines must -save all integer registers, and the flags, that they use. Args are -passed on the stack underneath the return address, as usual, and if -result(s) are to be returned, it (they) are either placed in dummy arg -slots created by the ucode PUSH sequence, or just -overwrite the incoming args. - -

-In order that the instrumentation mechanism can handle calls to these -helpers, VG_(saneUCodeBlock) enforces the following -restrictions on calls to helpers: - -

- -Some of the translations may appear to have redundant -TempReg-to-TempReg moves. This helps the -next phase, UCode optimisation, to generate better code. - - - -

UCode optimisation

- -UCode is then subjected to an improvement pass -(vg_improve()), which blurs the boundaries between the -translations of the original x86 instructions. It's pretty -straightforward. Three transformations are done: - - - -The effect of these transformations on our short block is rather -unexciting, and shown below. On longer basic blocks they can -dramatically improve code quality. - -
-at 3: delete GET, rename t2 to t0 in (4 .. 6)
-at 7: delete GET, rename t6 to t0 in (8 .. 9)
-at 1: annul flag write OSZAP due to later OSZACP
-
-Improved code:
-           0: GETL      %EDX, t0
-           1: INCL      t0
-           2: PUTL      t0, %EDX
-           4: LDB       (t0), t0
-           5: WIDENL_Bs t0
-           6: PUTL      t0, %EAX
-           8: GETL      %ECX, t8
-           9: LEA2L     1(t8,t0,2), t4
-          10: LDB       (t4), t10
-          11: MOVB      $0x20, t12
-          12: ANDB      t12, t10  (-wOSZACP)
-          13: INCEIPo   $9
-          14: Jnzo      $0x40435A50  (-rOSZACP)
-          15: JMPo      $0x40435A5B
-
- -

UCode instrumentation

- -Once you understand the meaning of the instrumentation uinstrs, -discussed in detail above, the instrumentation scheme is fairly -straighforward. Each uinstr is instrumented in isolation, and the -instrumentation uinstrs are placed before the original uinstr. -Our running example continues below. I have placed a blank line -after every original ucode, to make it easier to see which -instrumentation uinstrs correspond to which originals. - -

-As mentioned somewhere above, TempRegs carrying values -have names like t28, and each one has a shadow carrying -its V bits, with names like q28. This pairing aids in -reading instrumented ucode. - -

-One decision about all this is where to have "observation points", -that is, where to check that V bits are valid. I use a minimalistic -scheme, only checking where a failure of validity could cause the -original program to (seg)fault. So the use of values as memory -addresses causes a check, as do conditional jumps (these cause a check -on the definedness of the condition codes). And arguments -PUSHed for helper calls are checked, hence the wierd -restrictions on help call preambles described above. - -

-Another decision is that once a value is tested, it is thereafter -regarded as defined, so that we do not emit multiple undefined-value -errors for the same undefined value. That means that -TESTV uinstrs are always followed by SETV -on the same (shadow) TempRegs. Most of these -SETVs are redundant and are removed by the -post-instrumentation cleanup phase. - -

-The instrumentation for calling helper functions deserves further -comment. The definedness of results from a helper is modelled using -just one V bit. So, in short, we do pessimising casts of the -definedness of all the args, down to a single bit, and then -UifU these bits together. So this single V bit will say -"undefined" if any part of any arg is undefined. This V bit is then -pessimally cast back up to the result(s) sizes, as needed. If, by -seeing that all the args are got rid of with CLEAR and -none with POP, Valgrind sees that the result of the call -is not actually used, it immediately examines the result V bit with a -TESTV -- SETV pair. If it did not do this, -there would be no observation point to detect that the some of the -args to the helper were undefined. Of course, if the helper's results -are indeed used, we don't do this, since the result usage will -presumably cause the result definedness to be checked at some suitable -future point. - -

-In general Valgrind tries to track definedness on a bit-for-bit basis, -but as the above para shows, for calls to helpers we throw in the -towel and approximate down to a single bit. This is because it's too -complex and difficult to track bit-level definedness through complex -ops such as integer multiply and divide, and in any case there is no -reasonable code fragments which attempt to (eg) multiply two -partially-defined values and end up with something meaningful, so -there seems little point in modelling multiplies, divides, etc, in -that level of detail. - -

-Integer loads and stores are instrumented with firstly a test of the -definedness of the address, followed by a LOADV or -STOREV respectively. These turn into calls to -(for example) VG_(helperc_LOADV4). These helpers do two -things: they perform an address-valid check, and they load or store V -bits from/to the relevant address in the (simulated V-bit) memory. - -

-FPU loads and stores are different. As above the definedness of the -address is first tested. However, the helper routine for FPU loads -(VGM_(fpu_read_check)) emits an error if either the -address is invalid or the referenced area contains undefined values. -It has to do this because we do not simulate the FPU at all, and so -cannot track definedness of values loaded into it from memory, so we -have to check them as soon as they are loaded into the FPU, ie, at -this point. We notionally assume that everything in the FPU is -defined. - -

-It follows therefore that FPU writes first check the definedness of -the address, then the validity of the address, and finally mark the -written bytes as well-defined. - -

-If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest -you use the same trick. It works provided that the FPU/MMX unit is -not used to merely as a conduit to copy partially undefined data from -one place in memory to another. Unfortunately the integer CPU is used -like that (when copying C structs with holes, for example) and this is -the cause of much of the elaborateness of the instrumentation here -described. - -

-vg_instrument() in vg_translate.c actually -does the instrumentation. There are comments explaining how each -uinstr is handled, so we do not repeat that here. As explained -already, it is bit-accurate, except for calls to helper functions. -Unfortunately the x86 insns bt/bts/btc/btr are done by -helper fns, so bit-level accuracy is lost there. This should be fixed -by doing them inline; it will probably require adding a couple new -uinstrs. Also, left and right rotates through the carry flag (x86 -rcl and rcr) are approximated via a single -V bit; so far this has not caused anyone to complain. The -non-carry rotates, rol and ror, are much -more common and are done exactly. Re-visiting the instrumentation for -AND and OR, they seem rather verbose, and I wonder if it could be done -more concisely now. - -

-The lowercase o on many of the uopcodes in the running -example indicates that the size field is zero, usually meaning a -single-bit operation. - -

-Anyroads, the post-instrumented version of our running example looks -like this: - -

-Instrumented code:
-           0: GETVL     %EDX, q0
-           1: GETL      %EDX, t0
-
-           2: TAG1o     q0 = Left4 ( q0 )
-           3: INCL      t0
-
-           4: PUTVL     q0, %EDX
-           5: PUTL      t0, %EDX
-
-           6: TESTVL    q0
-           7: SETVL     q0
-           8: LOADVB    (t0), q0
-           9: LDB       (t0), t0
-
-          10: TAG1o     q0 = SWiden14 ( q0 )
-          11: WIDENL_Bs t0
-
-          12: PUTVL     q0, %EAX
-          13: PUTL      t0, %EAX
-
-          14: GETVL     %ECX, q8
-          15: GETL      %ECX, t8
-
-          16: MOVL      q0, q4
-          17: SHLL      $0x1, q4
-          18: TAG2o     q4 = UifU4 ( q8, q4 )
-          19: TAG1o     q4 = Left4 ( q4 )
-          20: LEA2L     1(t8,t0,2), t4
-
-          21: TESTVL    q4
-          22: SETVL     q4
-          23: LOADVB    (t4), q10
-          24: LDB       (t4), t10
-
-          25: SETVB     q12
-          26: MOVB      $0x20, t12
-
-          27: MOVL      q10, q14
-          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
-          29: TAG2o     q10 = UifU1 ( q12, q10 )
-          30: TAG2o     q10 = DifD1 ( q14, q10 )
-          31: MOVL      q12, q14
-          32: TAG2o     q14 = ImproveAND1_TQ ( t12, q14 )
-          33: TAG2o     q10 = DifD1 ( q14, q10 )
-          34: MOVL      q10, q16
-          35: TAG1o     q16 = PCast10 ( q16 )
-          36: PUTVFo    q16
-          37: ANDB      t12, t10  (-wOSZACP)
-
-          38: INCEIPo   $9
-
-          39: GETVFo    q18
-          40: TESTVo    q18
-          41: SETVo     q18
-          42: Jnzo      $0x40435A50  (-rOSZACP)
-
-          43: JMPo      $0x40435A5B
-
- - -

UCode post-instrumentation cleanup

- -

-This pass, coordinated by vg_cleanup(), removes redundant -definedness computation created by the simplistic instrumentation -pass. It consists of two passes, -vg_propagate_definedness() followed by -vg_delete_redundant_SETVs. - -

-vg_propagate_definedness() is a simple -constant-propagation and constant-folding pass. It tries to determine -which TempRegs containing V bits will always indicate -"fully defined", and it propagates this information as far as it can, -and folds out as many operations as possible. For example, the -instrumentation for an ADD of a literal to a variable quantity will be -reduced down so that the definedness of the result is simply the -definedness of the variable quantity, since the literal is by -definition fully defined. - -

-vg_delete_redundant_SETVs removes SETVs on -shadow TempRegs for which the next action is a write. -I don't think there's anything else worth saying about this; it is -simple. Read the sources for details. - -

-So the cleaned-up running example looks like this. As above, I have -inserted line breaks after every original (non-instrumentation) uinstr -to aid readability. As with straightforward ucode optimisation, the -results in this block are undramatic because it is so short; longer -blocks benefit more because they have more redundancy which gets -eliminated. - - -

-at 29: delete UifU1 due to defd arg1
-at 32: change ImproveAND1_TQ to MOV due to defd arg2
-at 41: delete SETV
-at 31: delete MOV
-at 25: delete SETV
-at 22: delete SETV
-at 7: delete SETV
-
-           0: GETVL     %EDX, q0
-           1: GETL      %EDX, t0
-
-           2: TAG1o     q0 = Left4 ( q0 )
-           3: INCL      t0
-
-           4: PUTVL     q0, %EDX
-           5: PUTL      t0, %EDX
-
-           6: TESTVL    q0
-           8: LOADVB    (t0), q0
-           9: LDB       (t0), t0
-
-          10: TAG1o     q0 = SWiden14 ( q0 )
-          11: WIDENL_Bs t0
-
-          12: PUTVL     q0, %EAX
-          13: PUTL      t0, %EAX
-
-          14: GETVL     %ECX, q8
-          15: GETL      %ECX, t8
-
-          16: MOVL      q0, q4
-          17: SHLL      $0x1, q4
-          18: TAG2o     q4 = UifU4 ( q8, q4 )
-          19: TAG1o     q4 = Left4 ( q4 )
-          20: LEA2L     1(t8,t0,2), t4
-
-          21: TESTVL    q4
-          23: LOADVB    (t4), q10
-          24: LDB       (t4), t10
-
-          26: MOVB      $0x20, t12
-
-          27: MOVL      q10, q14
-          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
-          30: TAG2o     q10 = DifD1 ( q14, q10 )
-          32: MOVL      t12, q14
-          33: TAG2o     q10 = DifD1 ( q14, q10 )
-          34: MOVL      q10, q16
-          35: TAG1o     q16 = PCast10 ( q16 )
-          36: PUTVFo    q16
-          37: ANDB      t12, t10  (-wOSZACP)
-
-          38: INCEIPo   $9
-          39: GETVFo    q18
-          40: TESTVo    q18
-          42: Jnzo      $0x40435A50  (-rOSZACP)
-
-          43: JMPo      $0x40435A5B
-
- - -

Translation from UCode

- -This is all very simple, even though vg_from_ucode.c -is a big file. Position-independent x86 code is generated into -a dynamically allocated array emitted_code; this is -doubled in size when it overflows. Eventually the array is handed -back to the caller of VG_(translate), who must copy -the result into TC and TT, and free the array. - -

-This file is structured into four layers of abstraction, which, -thankfully, are glued back together with extensive -__inline__ directives. From the bottom upwards: - -

- -

-Some comments: -

- -

-And so ... that's the end of the documentation for the instrumentating -translator! It's really not that complex, because it's composed as a -sequence of simple(ish) self-contained transformations on -straight-line blocks of code. - - -

Top-level dispatch loop

- -Urk. In VG_(toploop). This is basically boring and -unsurprising, not to mention fiddly and fragile. It needs to be -cleaned up. - -

-The only perhaps surprise is that the whole thing is run -on top of a setjmp-installed exception handler, because, -supposing a translation got a segfault, we have to bail out of the -Valgrind-supplied exception handler VG_(oursignalhandler) -and immediately start running the client's segfault handler, if it has -one. In particular we can't finish the current basic block and then -deliver the signal at some convenient future point, because signals -like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not -simply be re-tried. (I'm sure there is a clearer way to explain this). - - -

Exceptions, creating new translations

-

Self-modifying code

- -

Lazy updates of the simulated program counter

- -Simulated %EIP is not updated after every simulated x86 -insn as this was regarded as too expensive. Instead ucode -INCEIP insns move it along as and when necessary. -Currently we don't allow it to fall more than 4 bytes behind reality -(see VG_(disBB) for the way this works). -

-Note that %EIP is always brought up to date by the inner -dispatch loop in VG_(dispatch), so that if the client -takes a fault we know at least which basic block this happened in. - - -

The translation cache and translation table

- -

Signals

- -Horrible, horrible. vg_signals.c. -Basically, since we have to intercept all system -calls anyway, we can see when the client tries to install a signal -handler. If it does so, we make a note of what the client asked to -happen, and ask the kernel to route the signal to our own signal -handler, VG_(oursignalhandler). This simply notes the -delivery of signals, and returns. - -

-Every 1000 basic blocks, we see if more signals have arrived. If so, -VG_(deliver_signals) builds signal delivery frames on the -client's stack, and allows their handlers to be run. Valgrind places -in these signal delivery frames a bogus return address, -VG_(signalreturn_bogusRA), and checks all jumps to see -if any jump to it. If so, this is a sign that a signal handler is -returning, and if so Valgrind removes the relevant signal frame from -the client's stack, restores the from the signal frame the simulated -state before the signal was delivered, and allows the client to run -onwards. We have to do it this way because some signal handlers never -return, they just longjmp(), which nukes the signal -delivery frame. - -

-The Linux kernel has a different but equally horrible hack for -detecting signal handler returns. Discovering it is left as an -exercise for the reader. - - - -

Errors, error contexts, error reporting, suppressions

-

Client malloc/free

-

Low-level memory management

-

A and V bitmaps

-

Symbol table management

-

Dealing with system calls

-

Namespace management

-

GDB attaching

-

Non-dependence on glibc or anything else

-

The leak detector

-

Performance problems

-

Continuous sanity checking

-

Tracing, or not tracing, child processes

-

Assembly glue for syscalls

- - -
- -

Extensions

- -Some comments about Stuff To Do. - -

Bugs

- -Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS -(RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has -it looping at startup. I can't repro either behaviour. Needs -repro-ing and fixing. - - -

Threads

- -Doing a good job of thread support strikes me as almost a -research-level problem. The central issues are how to do fast cheap -locking of the VG_(primary_map) structure, whether or not -accesses to the individual secondary maps need locking, what -race-condition issues result, and whether the already-nasty mess that -is the signal simulator needs further hackery. - -

-I realise that threads are the most-frequently-requested feature, and -I am thinking about it all. If you have guru-level understanding of -fast mutual exclusion mechanisms and race conditions, I would be -interested in hearing from you. - - -

Verification suite

- -Directory tests/ contains various ad-hoc tests for -Valgrind. However, there is no systematic verification or regression -suite, that, for example, exercises all the stuff in -vg_memory.c, to ensure that illegal memory accesses and -undefined value uses are detected as they should be. It would be good -to have such a suite. - - -

Porting to other platforms

- -It would be great if Valgrind was ported to FreeBSD and x86 NetBSD, -and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style -executables, not ELF ?) - -

-The main difficulties, for an x86-ELF platform, seem to be: - -

- -All in all, I think a port to x86-ELF *BSDs is not really very -difficult, and in some ways I would like to see it happen, because -that would force a more clear factoring of Valgrind into platform -dependent and independent pieces. Not to mention, *BSD folks also -deserve to use Valgrind just as much as the Linux crew do. - - -

-


- -

Easy stuff which ought to be done

- -

MMX instructions

- -MMX insns should be supported, using the same trick as for FPU insns. -If the MMX registers are not used to copy uninitialised junk from one -place to another in memory, this means we don't have to actually -simulate the internal MMX unit state, so the FPU hack applies. This -should be fairly easy. - - - -

Fix stabs-info reader

- -The machinery in vg_symtab2.c which reads "stabs" style -debugging info is pretty weak. It usually correctly translates -simulated program counter values into line numbers and procedure -names, but the file name is often completely wrong. I think the -logic used to parse "stabs" entries is weak. It should be fixed. -The simplest solution, IMO, is to copy either the logic or simply the -code out of GNU binutils which does this; since GDB can clearly get it -right, binutils (or GDB?) must have code to do this somewhere. - - - - - -

BT/BTC/BTS/BTR

- -These are x86 instructions which test, complement, set, or reset, a -single bit in a word. At the moment they are both incorrectly -implemented and incorrectly instrumented. - -

-The incorrect instrumentation is due to use of helper functions. This -means we lose bit-level definedness tracking, which could wind up -giving spurious uninitialised-value use errors. The Right Thing to do -is to invent a couple of new UOpcodes, I think GET_BIT -and SET_BIT, which can be used to implement all 4 x86 -insns, get rid of the helpers, and give bit-accurate instrumentation -rules for the two new UOpcodes. - -

-I realised the other day that they are mis-implemented too. The x86 -insns take a bit-index and a register or memory location to access. -For registers the bit index clearly can only be in the range zero to -register-width minus 1, and I assumed the same applied to memory -locations too. But evidently not; for memory locations the index can -be arbitrary, and the processor will index arbitrarily into memory as -a result. This too should be fixed. Sigh. Presumably indexing -outside the immediate word is not actually used by any programs yet -tested on Valgrind, for otherwise they (presumably) would simply not -work at all. If you plan to hack on this, first check the Intel docs -to make sure my understanding is really correct. - - - -

Using PREFETCH instructions

- -Here's a small but potentially interesting project for performance -junkies. Experiments with valgrind's code generator and optimiser(s) -suggest that reducing the number of instructions executed in the -translations and mem-check helpers gives disappointingly small -performance improvements. Perhaps this is because performance of -Valgrindified code is limited by cache misses. After all, each read -in the original program now gives rise to at least three reads, one -for the VG_(primary_map), one of the resulting -secondary, and the original. Not to mention, the instrumented -translations are 13 to 14 times larger than the originals. All in all -one would expect the memory system to be hammered to hell and then -some. - -

-So here's an idea. An x86 insn involving a read from memory, after -instrumentation, will turn into ucode of the following form: -

-    ... calculate effective addr, into ta and qa ...
-    TESTVL qa             -- is the addr defined?
-    LOADV (ta), qloaded   -- fetch V bits for the addr
-    LOAD  (ta), tloaded   -- do the original load
-
-At the point where the LOADV is done, we know the actual -address (ta) from which the real LOAD will -be done. We also know that the LOADV will take around -20 x86 insns to do. So it seems plausible that doing a prefetch of -ta just before the LOADV might just avoid a -miss at the LOAD point, and that might be a significant -performance win. - -

-Prefetch insns are notoriously tempermental, more often than not -making things worse rather than better, so this would require -considerable fiddling around. It's complicated because Intels and -AMDs have different prefetch insns with different semantics, so that -too needs to be taken into account. As a general rule, even placing -the prefetches before the LOADV insn is too near the -LOAD; the ideal distance is apparently circa 200 CPU -cycles. So it might be worth having another analysis/transformation -pass which pushes prefetches as far back as possible, hopefully -immediately after the effective address becomes available. - -

-Doing too many prefetches is also bad because they soak up bus -bandwidth / cpu resources, so some cleverness in deciding which loads -to prefetch and which to not might be helpful. One can imagine not -prefetching client-stack-relative (%EBP or -%ESP) accesses, since the stack in general tends to show -good locality anyway. - -

-There's quite a lot of experimentation to do here, but I think it -might make an interesting week's work for someone. - -

-As of 15-ish March 2002, I've started to experiment with this, using -the AMD prefetch/prefetchw insns. - - - -

User-defined permission ranges

- -This is quite a large project -- perhaps a month's hacking for a -capable hacker to do a good job -- but it's potentially very -interesting. The outcome would be that Valgrind could detect a -whole class of bugs which it currently cannot. - -

-The presentation falls into two pieces. - -

-Part 1: user-defined address-range permission setting -

- -Valgrind intercepts the client's malloc, -free, etc calls, watches system calls, and watches the -stack pointer move. This is currently the only way it knows about -which addresses are valid and which not. Sometimes the client program -knows extra information about its memory areas. For example, the -client could at some point know that all elements of an array are -out-of-date. We would like to be able to convey to Valgrind this -information that the array is now addressable-but-uninitialised, so -that Valgrind can then warn if elements are used before they get new -values. - -

-What I would like are some macros like this: -

-   VALGRIND_MAKE_NOACCESS(addr, len)
-   VALGRIND_MAKE_WRITABLE(addr, len)
-   VALGRIND_MAKE_READABLE(addr, len)
-
-and also, to check that memory is addressible/initialised, -
-   VALGRIND_CHECK_ADDRESSIBLE(addr, len)
-   VALGRIND_CHECK_INITIALISED(addr, len)
-
- -

-I then include in my sources a header defining these macros, rebuild -my app, run under Valgrind, and get user-defined checks. - -

-Now here's a neat trick. It's a nuisance to have to re-link the app -with some new library which implements the above macros. So the idea -is to define the macros so that the resulting executable is still -completely stand-alone, and can be run without Valgrind, in which case -the macros do nothing, but when run on Valgrind, the Right Thing -happens. How to do this? The idea is for these macros to turn into a -piece of inline assembly code, which (1) has no effect when run on the -real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane -person would ever write, which is important for avoiding false matches -in (2). So here's a suggestion: -

-   VALGRIND_MAKE_NOACCESS(addr, len)
-
-becomes (roughly speaking) -
-   movl addr, %eax
-   movl len,  %ebx
-   movl $1,   %ecx   -- 1 describes the action; MAKE_WRITABLE might be
-                     -- 2, etc
-   rorl $13, %ecx
-   rorl $19, %ecx
-   rorl $11, %eax
-   rorl $21, %eax
-
-The rotate sequences have no effect, and it's unlikely they would -appear for any other reason, but they define a unique byte-sequence -which the JITter can easily spot. Using the operand constraints -section at the end of a gcc inline-assembly statement, we can tell gcc -that the assembly fragment kills %eax, %ebx, -%ecx and the condition codes, so this fragment is made -harmless when not running on Valgrind, runs quickly when not on -Valgrind, and does not require any other library support. - - -

-Part 2: using it to detect interference between stack variables -

- -Currently Valgrind cannot detect errors of the following form: -

-void fooble ( void )
-{
-   int a[10];
-   int b[10];
-   a[10] = 99;
-}
-
-Now imagine rewriting this as -
-void fooble ( void )
-{
-   int spacer0;
-   int a[10];
-   int spacer1;
-   int b[10];
-   int spacer2;
-   VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
-   VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
-   VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
-   a[10] = 99;
-}
-
-Now the invalid write is certain to hit spacer0 or -spacer1, so Valgrind will spot the error. - -

-There are two complications. - -

-The first is that we don't want to annotate sources by hand, so the -Right Thing to do is to write a C/C++ parser, annotator, prettyprinter -which does this automatically, and run it on post-CPP'd C/C++ source. -See http://www.cacheprof.org for an example of a system which -transparently inserts another phase into the gcc/g++ compilation -route. The parser/prettyprinter is probably not as hard as it sounds; -I would write it in Haskell, a powerful functional language well -suited to doing symbolic computation, with which I am intimately -familar. There is already a C parser written in Haskell by someone in -the Haskell community, and that would probably be a good starting -point. - -

-The second complication is how to get rid of these -NOACCESS records inside Valgrind when the instrumented -function exits; after all, these refer to stack addresses and will -make no sense whatever when some other function happens to re-use the -same stack address range, probably shortly afterwards. I think I -would be inclined to define a special stack-specific macro -

-   VALGRIND_MAKE_NOACCESS_STACK(addr, len)
-
-which causes Valgrind to record the client's %ESP at the -time it is executed. Valgrind will then watch for changes in -%ESP and discard such records as soon as the protected -area is uncovered by an increase in %ESP. I hesitate -with this scheme only because it is potentially expensive, if there -are hundreds of such records, and considering that changes in -%ESP already require expensive messing with stack access -permissions. - -

-This is probably easier and more robust than for the instrumenter -program to try and spot all exit points for the procedure and place -suitable deallocation annotations there. Plus C++ procedures can -bomb out at any point if they get an exception, so spotting return -points at the source level just won't work at all. - -

-Although some work, it's all eminently doable, and it would make -Valgrind into an even-more-useful tool. - - -

- - -


- -

Cache profiling

-Valgrind is a very nice platform for doing cache profiling and other kinds of -simulation, because it converts horrible x86 instructions into nice clean -RISC-like UCode. For example, for cache profiling we are interested in -instructions that read and write memory; in UCode there are only four -instructions that do this: LOAD, STORE, -FPU_R and FPU_W. By contrast, because of the x86 -addressing modes, almost every instruction can read or write memory.

- -Most of the cache profiling machinery is in the file -vg_cachesim.c.

- -These notes are a somewhat haphazard guide to how Valgrind's cache profiling -works.

- -

Cost centres

-Valgrind gathers cache profiling about every instruction executed, -individually. Each instruction has a cost centre associated with it. -There are two kinds of cost centre: one for instructions that don't reference -memory (iCC), and one for instructions that do -(idCC): - -
-typedef struct _CC {
-   ULong a;
-   ULong m1;
-   ULong m2;
-} CC;
-
-typedef struct _iCC {
-   /* word 1 */
-   UChar tag;
-   UChar instr_size;
-
-   /* words 2+ */
-   Addr instr_addr;
-   CC I;
-} iCC;
-   
-typedef struct _idCC {
-   /* word 1 */
-   UChar tag;
-   UChar instr_size;
-   UChar data_size;
-
-   /* words 2+ */
-   Addr instr_addr;
-   CC I; 
-   CC D; 
-} idCC; 
-
- -Each CC has three fields a, m1, -m2 for recording references, level 1 misses and level 2 misses. -Each of these is a 64-bit ULong -- the numbers can get very large, -ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

- -A iCC has one CC for instruction cache accesses. A -idCC has two, one for instruction cache accesses, and one for data -cache accesses.

- -The iCC and dCC structs also store unchanging -information about the instruction: -

- -Note that data address is not one of the fields for idCC. This is -because for many memory-referencing instructions the data address can change -each time it's executed (eg. if it uses register-offset addressing). We have -to give this item to the cache simulation in a different way (see -Instrumentation section below). Some memory-referencing instructions do always -reference the same address, but we don't try to treat them specialy in order to -keep things simple.

- -Also note that there is only room for recording info about one data cache -access in an idCC. So what about instructions that do a read then -a write, such as: - -

inc %(esi)
- -In a write-allocate cache, as simulated by Valgrind, the write cannot miss, -since it immediately follows the read which will drag the block into the cache -if it's not already there. So the write access isn't really interesting, and -Valgrind doesn't record it. This means that Valgrind doesn't measure -memory references, but rather memory references that could miss in the cache. -This behaviour is the same as that used by the AMD Athlon hardware counters. -It also has the benefit of simplifying the implementation -- instructions that -read and write memory can be treated like instructions that read memory.

- -

Storing cost-centres

-Cost centres are stored in a way that makes them very cheap to lookup, which is -important since one is looked up for every original x86 instruction -executed.

- -Valgrind does JIT translations at the basic block level, and cost centres are -also setup and stored at the basic block level. By doing things carefully, we -store all the cost centres for a basic block in a contiguous array, and lookup -comes almost for free.

- -Consider this part of a basic block (for exposition purposes, pretend it's an -entire basic block): - -

-movl $0x0,%eax
-movl $0x99, -4(%ebp)
-
- -The translation to UCode looks like this: - -
-MOVL      $0x0, t20
-PUTL      t20, %EAX
-INCEIPo   $5
-
-LEA1L     -4(t4), t14
-MOVL      $0x99, t18
-STL       t18, (t14)
-INCEIPo   $7
-
- -The first step is to allocate the cost centres. This requires a preliminary -pass to count how many x86 instructions were in the basic block, and their -types (and thus sizes). UCode translations for single x86 instructions are -delimited by the INCEIPo instruction, the argument of which gives -the byte size of the instruction (note that lazy INCEIP updating is turned off -to allow this).

- -We can tell if an x86 instruction references memory by looking for -LDL and STL UCode instructions, and thus what kind of -cost centre is required. From this we can determine how many cost centres we -need for the basic block, and their sizes. We can then allocate them in a -single array.

- -Consider the example code above. After the preliminary pass, we know we need -two cost centres, one iCC and one dCC. So we -allocate an array to store these which looks like this: - -

-|(uninit)|      tag         (1 byte)
-|(uninit)|      instr_size  (1 bytes)
-|(uninit)|      (padding)   (2 bytes)
-|(uninit)|      instr_addr  (4 bytes)
-|(uninit)|      I.a         (8 bytes)
-|(uninit)|      I.m1        (8 bytes)
-|(uninit)|      I.m2        (8 bytes)
-
-|(uninit)|      tag         (1 byte)
-|(uninit)|      instr_size  (1 byte)
-|(uninit)|      data_size   (1 byte)
-|(uninit)|      (padding)   (1 byte)
-|(uninit)|      instr_addr  (4 bytes)
-|(uninit)|      I.a         (8 bytes)
-|(uninit)|      I.m1        (8 bytes)
-|(uninit)|      I.m2        (8 bytes)
-|(uninit)|      D.a         (8 bytes)
-|(uninit)|      D.m1        (8 bytes)
-|(uninit)|      D.m2        (8 bytes)
-
- -(We can see now why we need tags to distinguish between the two types of cost -centres.)

- -We also record the size of the array. We look up the debug info of the first -instruction in the basic block, and then stick the array into a table indexed -by filename and function name. This makes it easy to dump the information -quickly to file at the end.

- -

Instrumentation

-The instrumentation pass has two main jobs: - -
    -
  1. Fill in the gaps in the allocated cost centres.
  2. -

  3. Add UCode to call the cache simulator for each instruction.
  4. -

- -The instrumentation pass steps through the UCode and the cost centres in -tandem. As each original x86 instruction's UCode is processed, the appropriate -gaps in the instructions cost centre are filled in, for example: - -
-|INSTR_CC|      tag         (1 byte)
-|5       |      instr_size  (1 bytes)
-|(uninit)|      (padding)   (2 bytes)
-|i_addr1 |      instr_addr  (4 bytes)
-|0       |      I.a         (8 bytes)
-|0       |      I.m1        (8 bytes)
-|0       |      I.m2        (8 bytes)
-
-|WRITE_CC|      tag         (1 byte)
-|7       |      instr_size  (1 byte)
-|4       |      data_size   (1 byte)
-|(uninit)|      (padding)   (1 byte)
-|i_addr2 |      instr_addr  (4 bytes)
-|0       |      I.a         (8 bytes)
-|0       |      I.m1        (8 bytes)
-|0       |      I.m2        (8 bytes)
-|0       |      D.a         (8 bytes)
-|0       |      D.m1        (8 bytes)
-|0       |      D.m2        (8 bytes)
-
- -(Note that this step is not performed if a basic block is re-translated; see -here for more information.)

- -GCC inserts padding before the instr_size field so that it is word -aligned.

- -The instrumentation added to call the cache simulation function looks like this -(instrumentation is indented to distinguish it from the original UCode): - -

-MOVL      $0x0, t20
-PUTL      t20, %EAX
-  PUSHL     %eax
-  PUSHL     %ecx
-  PUSHL     %edx
-  MOVL      $0x4091F8A4, t46  # address of 1st CC
-  PUSHL     t46
-  CALLMo    $0x12             # second cachesim function
-  CLEARo    $0x4
-  POPL      %edx
-  POPL      %ecx
-  POPL      %eax
-INCEIPo   $5
-
-LEA1L     -4(t4), t14
-MOVL      $0x99, t18
-  MOVL      t14, t42
-STL       t18, (t14)
-  PUSHL     %eax
-  PUSHL     %ecx
-  PUSHL     %edx
-  PUSHL     t42
-  MOVL      $0x4091F8C4, t44  # address of 2nd CC
-  PUSHL     t44
-  CALLMo    $0x13             # second cachesim function
-  CLEARo    $0x8
-  POPL      %edx
-  POPL      %ecx
-  POPL      %eax
-INCEIPo   $7
-
- -Consider the first instruction's UCode. Each call is surrounded by three -PUSHL and POPL instructions to save and restore the -caller-save registers. Then the address of the instruction's cost centre is -pushed onto the stack, to be the first argument to the cache simulation -function. The address is known at this point because we are doing a -simultaneous pass through the cost centre array. This means the cost centre -lookup for each instruction is almost free (just the cost of pushing an -argument for a function call). Then the call to the cache simulation function -for non-memory-reference instructions is made (note that the -CALLMo UInstruction takes an offset into a table of predefined -functions; it is not an absolute address), and the single argument is -CLEARed from the stack.

- -The second instruction's UCode is similar. The only difference is that, as -mentioned before, we have to pass the address of the data item referenced to -the cache simulation function too. This explains the MOVL t14, -t42 and PUSHL t42 UInstructions. (Note that the seemingly -redundant MOVing will probably be optimised away during register -allocation.)

- -Note that instead of storing unchanging information about each instruction -(instruction size, data size, etc) in its cost centre, we could have passed in -these arguments to the simulation function. But this would slow the calls down -(two or three extra arguments pushed onto the stack). Also it would bloat the -UCode instrumentation by amounts similar to the space required for them in the -cost centre; bloated UCode would also fill the translation cache more quickly, -requiring more translations for large programs and slowing them down more.

- - -

Handling basic block retranslations

-The above description ignores one complication. Valgrind has a limited size -cache for basic block translations; if it fills up, old translations are -discarded. If a discarded basic block is executed again, it must be -re-translated.

- -However, we can't use this approach for profiling -- we can't throw away cost -centres for instructions in the middle of execution! So when a basic block is -translated, we first look for its cost centre array in the hash table. If -there is no cost centre array, it must be the first translation, so we proceed -as described above. But if there is a cost centre array already, it must be a -retranslation. In this case, we skip the cost centre allocation and -initialisation steps, but still do the UCode instrumentation step.

- -

The cache simulation

-The cache simulation is fairly straightforward. It just tracks which memory -blocks are in the cache at the moment (it doesn't track the contents, since -that is irrelevant).

- -The interface to the simulation is quite clean. The functions called from the -UCode contain calls to the simulation functions in the files -vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only -one function call is done per simulated x86 instruction. The file -vg_cachesim.c simply #includes the three files -containing the simulation, which makes plugging in new cache simulations is -very easy -- you just replace the three files and recompile.

- -

Output

-Output is fairly straightforward, basically printing the cost centre for every -instruction, grouped by files and functions. Total counts (eg. total cache -accesses, total L1 misses) are calculated when traversing this structure rather -than during execution, to save time; the cache simulation functions are called -so often that even one or two extra adds can make a sizeable difference.

- -Input file has the following format: - -

-file         ::= desc_line* cmd_line events_line data_line+ summary_line
-desc_line    ::= "desc:" ws? non_nl_string
-cmd_line     ::= "cmd:" ws? cmd
-events_line  ::= "events:" ws? (event ws)+
-data_line    ::= file_line | fn_line | count_line
-file_line    ::= ("fl=" | "fi=" | "fe=") filename
-fn_line      ::= "fn=" fn_name
-count_line   ::= line_num ws? (count ws)+
-summary_line ::= "summary:" ws? (count ws)+
-count        ::= num | "."
-
- -Where: - - - -The contents of the "desc:" lines is printed out at the top of the summary. -This is a generic way of providing simulation specific information, eg. for -giving the cache configuration for cache simulation.

- -Counts can be "." to represent "N/A", eg. the number of write misses for an -instruction that doesn't write to memory.

- -The number of counts in each line and the -summary_line should not exceed the number of events in the -event_line. If the number in each line is less, -vg_annotate treats those missing as though they were a "." entry.

- -A file_line changes the current file name. A fn_line -changes the current function name. A count_line contains counts -that pertain to the current filename/fn_name. A "fn=" file_line -and a fn_line must appear before any count_lines to -give the context of the first count_lines.

- -Each file_line should be immediately followed by a -fn_line. "fi=" file_lines are used to switch -filenames for inlined functions; "fe=" file_lines are similar, but -are put at the end of a basic block in which the file name hasn't been switched -back to the original file name. (fi and fe lines behave the same, they are -only distinguished to help debugging.)

- - -

Summary of performance features

-Quite a lot of work has gone into making the profiling as fast as possible. -This is a summary of the important features: - - - - -

Annotation

-Annotation is done by vg_annotate. It is a fairly straightforward Perl script -that slurps up all the cost centres, and then runs through all the chosen -source files, printing out cost centres with them. It too has been carefully -optimised. - - -

Similar work, extensions

-It would be relatively straightforward to do other simulations and obtain -line-by-line information about interesting events. A good example would be -branch prediction -- all branches could be instrumented to interact with a -branch prediction simulator, using very similar techniques to those described -above.

- -In particular, vg_annotate would not need to change -- the file format is such -that it is not specific to the cache simulation, but could be used for any kind -of line-by-line information. The only part of vg_annotate that is specific to -the cache simulation is the name of the input file -(cachegrind.out), although it would be very simple to add an -option to control this.

- - -