diff --git a/docs/Makefile.am b/docs/Makefile.am index e8a58fa18..39b9008b6 100644 --- a/docs/Makefile.am +++ b/docs/Makefile.am @@ -1,5 +1,5 @@ docdir = $(datadir)/doc/valgrind -doc_DATA = index.html manual.html nav.html techdocs.html +doc_DATA = index.html EXTRA_DIST = $(doc_DATA) diff --git a/docs/index.html b/docs/index.html index 111170256..d4db7c868 100644 --- a/docs/index.html +++ b/docs/index.html @@ -1,26 +1,33 @@ - + + Valgrind + + + - - - - - - - - Valgrind's user manual - - - - - - - <body> - <p>This page uses frames, but your browser doesn't support them.</p> - </body> - - - + +

Documentation Contents

+ Core
+ memcheck
+ Cachegrind
+ diff --git a/docs/manual.html b/docs/manual.html deleted file mode 100644 index 95fe84080..000000000 --- a/docs/manual.html +++ /dev/null @@ -1,2731 +0,0 @@ - - - - Valgrind - - - - - -

Valgrind, version 1.0.0

-This manual was last updated on 20020726 -

- -

-jseward@acm.org
-Copyright © 2000-2002 Julian Seward -

-Valgrind is licensed under the GNU General Public License, -version 2
-An open-source tool for finding memory-management problems in -Linux-x86 executables. -

- -

- -

- -

Contents of this manual

- -

1 Introduction

- 1.1 What Valgrind is for
- 1.2 What it does with your program - -

2 How to use it, and how to make sense - of the results

- 2.1 Getting started
- 2.2 The commentary
- 2.3 Reporting of errors
- 2.4 Suppressing errors
- 2.5 Command-line flags
- 2.6 Explaination of error messages
- 2.7 Writing suppressions files
- 2.8 The Client Request mechanism
- 2.9 Support for POSIX pthreads
- 2.10 Building and installing
- 2.11 If you have problems
- -

3 Details of the checking machinery

- 3.1 Valid-value (V) bits
- 3.2 Valid-address (A) bits
- 3.3 Putting it all together
- 3.4 Signals
- 3.5 Memory leak detection
- -

4 Limitations

- -

5 How it works -- a rough overview

- 5.1 Getting started
- 5.2 The translation/instrumentation engine
- 5.3 Tracking the status of memory
- 5.4 System calls
- 5.5 Signals
- -

6 An example

- -

7 Cache profiling

- -

8 The design and implementation of Valgrind

- -

- - -

1 Introduction

- - -

1.1 What Valgrind is for

- -Valgrind is a tool to help you find memory-management problems in your -programs. When a program is run under Valgrind's supervision, all -reads and writes of memory are checked, and calls to -malloc/new/free/delete are intercepted. As a result, Valgrind can -detect problems such as: -

Use of uninitialised memory
Reading/writing memory after it has been free'd
Reading/writing off the end of malloc'd blocks
Reading/writing inappropriate areas on the stack
Memory leaks -- where pointers to malloc'd blocks are lost - forever
Mismatched use of malloc/new/new [] vs free/delete/delete - []
Some misuses of the POSIX pthreads API

- -Problems like these can be difficult to find by other means, often -lying undetected for long periods, then causing occasional, -difficult-to-diagnose crashes. - -

-Valgrind is closely tied to details of the CPU, operating system and -to a less extent, compiler and basic C libraries. This makes it -difficult to make it portable, so I have chosen at the outset to -concentrate on what I believe to be a widely used platform: Linux on -x86s. Valgrind uses the standard Unix ./configure, -make, make install mechanism, and I have -attempted to ensure that it works on machines with kernel 2.2 or 2.4 -and glibc 2.1.X or 2.2.X. This should cover the vast majority of -modern Linux installations. - - -

-Valgrind is licensed under the GNU General Public License, version -2. Read the file LICENSE in the source distribution for details. Some -of the PThreads test cases, test/pth_*.c, are taken from -"Pthreads Programming" by Bradford Nichols, Dick Buttlar & Jacqueline -Proulx Farrell, ISBN 1-56592-115-1, published by O'Reilly & -Associates, Inc. - - - -

1.2 What it does with your program

- -Valgrind is designed to be as non-intrusive as possible. It works -directly with existing executables. You don't need to recompile, -relink, or otherwise modify, the program to be checked. Simply place -the word valgrind at the start of the command line -normally used to run the program. So, for example, if you want to run -the command ls -l on Valgrind, simply issue the -command: valgrind ls -l. - -

Valgrind takes control of your program before it starts. Debugging -information is read from the executable and associated libraries, so -that error messages can be phrased in terms of source code -locations. Your program is then run on a synthetic x86 CPU which -checks every memory access. All detected errors are written to a -log. When the program finishes, Valgrind searches for and reports on -leaked memory. - -

You can run pretty much any dynamically linked ELF x86 executable -using Valgrind. Programs run 25 to 50 times slower, and take a lot -more memory, than they usually would. It works well enough to run -large programs. For example, the Konqueror web browser from the KDE -Desktop Environment, version 3.0, runs slowly but usably on Valgrind. - -

Valgrind simulates every single instruction your program executes. -Because of this, it finds errors not only in your application but also -in all supporting dynamically-linked (.so-format) -libraries, including the GNU C library, the X client libraries, Qt, if -you work with KDE, and so on. That often includes libraries, for -example the GNU C library, which contain memory access violations, but -which you cannot or do not want to fix. - -

Rather than swamping you with errors in which you are not -interested, Valgrind allows you to selectively suppress errors, by -recording them in a suppressions file which is read when Valgrind -starts up. The build mechanism attempts to select suppressions which -give reasonable behaviour for the libc and XFree86 versions detected -on your machine. - - -

Section 6 shows an example of use. -

-

- - -

2 How to use it, and how to make sense of the results

- - -

2.1 Getting started

- -First off, consider whether it might be beneficial to recompile your -application and supporting libraries with optimisation disabled and -debugging info enabled (the -g flag). You don't have to -do this, but doing so helps Valgrind produce more accurate and less -confusing error reports. Chances are you're set up like this already, -if you intended to debug your program with GNU gdb, or some other -debugger. - -

-A plausible compromise is to use -g -O. -Optimisation levels above -O have been observed, on very -rare occasions, to cause gcc to generate code which fools Valgrind's -error tracking machinery into wrongly reporting uninitialised value -errors. -O gets you the vast majority of the benefits of -higher optimisation levels anyway, so you don't lose much there. - -

-Valgrind understands both the older "stabs" debugging format, used by -gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1 -and later. - -

-Then just run your application, but place the word -valgrind in front of your usual command-line invokation. -Note that you should run the real (machine-code) executable here. If -your application is started by, for example, a shell or perl script, -you'll need to modify it to invoke Valgrind on the real executables. -Running such scripts directly under Valgrind will result in you -getting error reports pertaining to /bin/sh, -/usr/bin/perl, or whatever interpreter you're using. -This almost certainly isn't what you want and can be confusing. - - -

2.2 The commentary

- -Valgrind writes a commentary, detailing error reports and other -significant events. The commentary goes to standard output by -default. This may interfere with your program, so you can ask for it -to be directed elsewhere. - -

All lines in the commentary are of the following form:
-

-  ==12345== some-message-from-Valgrind
-

-

The 12345 is the process ID. This scheme makes it easy -to distinguish program output from Valgrind commentary, and also easy -to differentiate commentaries from different processes which have -become merged together, for whatever reason. - -

By default, Valgrind writes only essential messages to the commentary, -so as to avoid flooding you with information of secondary importance. -If you want more information about what is happening, re-run, passing -the -v flag to Valgrind. - - - -

2.3 Reporting of errors

- -When Valgrind detects something bad happening in the program, an error -message is written to the commentary. For example:
-

-  ==25832== Invalid read of size 4
-  ==25832==    at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45)
-  ==25832==    by 0x80487AF: main (bogon.cpp:66)
-  ==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
-  ==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
-  ==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
-

- -

This message says that the program did an illegal 4-byte read of -address 0xBFFFF74C, which, as far as it can tell, is not a valid stack -address, nor corresponds to any currently malloc'd or free'd blocks. -The read is happening at line 45 of bogon.cpp, called -from line 66 of the same file, etc. For errors associated with an -identified malloc'd/free'd block, for example reading free'd memory, -Valgrind reports not only the location where the error happened, but -also where the associated block was malloc'd/free'd. - -

Valgrind remembers all error reports. When an error is detected, -it is compared against old reports, to see if it is a duplicate. If -so, the error is noted, but no further commentary is emitted. This -avoids you being swamped with bazillions of duplicate error reports. - -

If you want to know how many times each error occurred, run with -the -v option. When execution finishes, all the reports -are printed out, along with, and sorted by, their occurrence counts. -This makes it easy to see which errors have occurred most frequently. - -

Errors are reported before the associated operation actually -happens. For example, if you program decides to read from address -zero, Valgrind will emit a message to this effect, and the program -will then duly die with a segmentation fault. - -

In general, you should try and fix errors in the order that they -are reported. Not doing so can be confusing. For example, a program -which copies uninitialised values to several memory locations, and -later uses them, will generate several error messages. The first such -error message may well give the most direct clue to the root cause of -the problem. - -

The process of detecting duplicate errors is quite an expensive -one and can become a significant performance overhead if your program -generates huge quantities of errors. To avoid serious problems here, -Valgrind will simply stop collecting errors after 300 different errors -have been seen, or 30000 errors in total have been seen. In this -situation you might as well stop your program and fix it, because -Valgrind won't tell you anything else useful after this. Note that -the 300/30000 limits apply after suppressed errors are removed. These -limits are defined in vg_include.h and can be increased -if necessary. - -

To avoid this cutoff you can use the ---error-limit=no flag. Then valgrind will always show -errors, regardless of how many there are. Use this flag carefully, -since it may have a dire effect on performance. - - - -

2.4 Suppressing errors

- -Valgrind detects numerous problems in the base libraries, such as the -GNU C library, and the XFree86 client libraries, which come -pre-installed on your GNU/Linux system. You can't easily fix these, -but you don't want to see these errors (and yes, there are many!) So -Valgrind reads a list of errors to suppress at startup. -A default suppression file is cooked up by the -./configure script. - -

You can modify and add to the suppressions file at your leisure, -or, better, write your own. Multiple suppression files are allowed. -This is useful if part of your project contains errors you can't or -don't want to fix, yet you don't want to continuously be reminded of -them. - -

Each error to be suppressed is described very specifically, to -minimise the possibility that a suppression-directive inadvertantly -suppresses a bunch of similar errors which you did want to see. The -suppression mechanism is designed to allow precise yet flexible -specification of errors to suppress. - -

If you use the -v flag, at the end of execution, Valgrind -prints out one line for each used suppression, giving its name and the -number of times it got used. Here's the suppressions used by a run of -ls -l: -

-  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r
-  --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r
-  --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object
-

- - -

2.5 Command-line flags

- -You invoke Valgrind like this: -

-  valgrind [options-for-Valgrind] your-prog [options for your-prog]
-

- -

Note that Valgrind also reads options from the environment variable -$VALGRIND_OPTS, and processes them before the command-line -options. - -

Valgrind's default settings succeed in giving reasonable behaviour -in most cases. Available options, in no particular order, are as -follows: -

--help

--version
-
The usual deal.

- -

-v --verbose
-
Be more verbose. Gives extra information on various aspects - of your program, such as: the shared objects loaded, the - suppressions used, the progress of the instrumentation engine, - and warnings about unusual behaviour. -

- -

-q --quiet
-
Run silently, and only print error messages. Useful if you - are running regression tests or have some other automated test - machinery. -

- -

--demangle=no
- --demangle=yes [the default] -
Disable/enable automatic demangling (decoding) of C++ names. - Enabled by default. When enabled, Valgrind will attempt to - translate encoded C++ procedure names back to something - approaching the original. The demangler handles symbols mangled - by g++ versions 2.X and 3.X. - -
An important fact about demangling is that function - names mentioned in suppressions files should be in their mangled - form. Valgrind does not demangle function names when searching - for applicable suppressions, because to do otherwise would make - suppressions file contents dependent on the state of Valgrind's - demangling machinery, and would also be slow and pointless. -

- -

--num-callers=<number> [default=4]
-
By default, Valgrind shows four levels of function call names - to help you identify program locations. You can change that - number with this option. This can help in determining the - program's location in deeply-nested call chains. Note that errors - are commoned up using only the top three function locations (the - place in the current function, and that of its two immediate - callers). So this doesn't affect the total number of errors - reported. -
- The maximum value for this is 50. Note that higher settings - will make Valgrind run a bit more slowly and take a bit more - memory, but can be useful when working with programs with - deeply-nested call chains. -

- -

--gdb-attach=no [the default]
- --gdb-attach=yes -
When enabled, Valgrind will pause after every error shown, - and print the line -
- ---- Attach to GDB ? --- [Return/N/n/Y/y/C/c] ---- -
- Pressing Ret, or N Ret - or n Ret, causes Valgrind not to - start GDB for this error. -
- Y Ret - or y Ret causes Valgrind to - start GDB, for the program at this point. When you have - finished with GDB, quit from it, and the program will continue. - Trying to continue from inside GDB doesn't work. -
- C Ret - or c Ret causes Valgrind not to - start GDB, and not to ask again. -
- --gdb-attach=yes conflicts with - --trace-children=yes. You can't use them together. - Valgrind refuses to start up in this situation. 1 May 2002: - this is a historical relic which could be easily fixed if it - gets in your way. Mail me and complain if this is a problem for - you.

- -

--partial-loads-ok=yes [the default]
- --partial-loads-ok=no -
Controls how Valgrind handles word (4-byte) loads from - addresses for which some bytes are addressible and others - are not. When yes (the default), such loads - do not elicit an address error. Instead, the loaded V bytes - corresponding to the illegal addresses indicate undefined, and - those corresponding to legal addresses are loaded from shadow - memory, as usual. -
- When no, loads from partially - invalid addresses are treated the same as loads from completely - invalid addresses: an illegal-address error is issued, - and the resulting V bytes indicate valid data. -

- -

--sloppy-malloc=no [the default]
- --sloppy-malloc=yes -
When enabled, all requests for malloc/calloc are rounded up - to a whole number of machine words -- in other words, made - divisible by 4. For example, a request for 17 bytes of space - would result in a 20-byte area being made available. This works - around bugs in sloppy libraries which assume that they can - safely rely on malloc/calloc requests being rounded up in this - fashion. Without the workaround, these libraries tend to - generate large numbers of errors when they access the ends of - these areas. -
- Valgrind snapshots dated 17 Feb 2002 and later are - cleverer about this problem, and you should no longer need to - use this flag. To put it bluntly, if you do need to use this - flag, your program violates the ANSI C semantics defined for - malloc and free, even if it appears to - work correctly, and you should fix it, at least if you hope for - maximum portability. -

- -

--alignment=<number> [default: 4]

By - default valgrind's malloc, realloc, - etc, return 4-byte aligned addresses. These are suitable for - any accesses on x86 processors. - Some programs might however assume that malloc et - al return 8- or more aligned memory. - These programs are broken and should be fixed, but - if this is impossible for whatever reason the alignment can be - increased using this parameter. The supplied value must be - between 4 and 4096 inclusive, and must be a power of two.

- -

--trace-children=no [the default]
- --trace-children=yes -
When enabled, Valgrind will trace into child processes. This - is confusing and usually not what you want, so is disabled by - default. As of 1 May 2002, tracing into a child process from a - parent which uses libpthread.so is probably broken - and is likely to cause breakage. Please report any such - problems to me.

- -

--freelist-vol=<number> [default: 1000000] -
When the client program releases memory using free (in C) or - delete (C++), that memory is not immediately made available for - re-allocation. Instead it is marked inaccessible and placed in - a queue of freed blocks. The purpose is to delay the point at - which freed-up memory comes back into circulation. This - increases the chance that Valgrind will be able to detect - invalid accesses to blocks for some significant period of time - after they have been freed. -
- This flag specifies the maximum total size, in bytes, of the - blocks in the queue. The default value is one million bytes. - Increasing this increases the total amount of memory used by - Valgrind but may detect invalid uses of freed blocks which would - otherwise go undetected.

- -

--logfile-fd=<number> [default: 2, stderr] -
Specifies the file descriptor on which Valgrind communicates - all of its messages. The default, 2, is the standard error - channel. This may interfere with the client's own use of - stderr. To dump Valgrind's commentary in a file without using - stderr, something like the following works well (sh/bash - syntax):
- - valgrind --logfile-fd=9 my_prog 9> logfile
- That is: tell Valgrind to send all output to file descriptor 9, - and ask the shell to route file descriptor 9 to "logfile". -

- -

--suppressions=<filename> - [default: $PREFIX/lib/valgrind/default.supp] -
Specifies an extra - file from which to read descriptions of errors to suppress. You - may use as many extra suppressions files as you - like.

- -

--leak-check=no [default]
- --leak-check=yes -
When enabled, search for memory leaks when the client program - finishes. A memory leak means a malloc'd block, which has not - yet been free'd, but to which no pointer can be found. Such a - block can never be free'd by the program, since no pointer to it - exists. Leak checking is disabled by default because it tends - to generate dozens of error messages.

- -

--show-reachable=no [default]
- --show-reachable=yes -
When disabled, the memory leak detector only shows blocks for - which it cannot find a pointer to at all, or it can only find a - pointer to the middle of. These blocks are prime candidates for - memory leaks. When enabled, the leak detector also reports on - blocks which it could find a pointer to. Your program could, at - least in principle, have freed such blocks before exit. - Contrast this to blocks for which no pointer, or only an - interior pointer could be found: they are more likely to - indicate memory leaks, because you do not actually have a - pointer to the start of the block which you can hand to - free, even if you wanted to.

- -

--leak-resolution=low [default]
- --leak-resolution=med
- --leak-resolution=high -
When doing leak checking, determines how willing Valgrind is - to consider different backtraces to be the same. When set to - low, the default, only the first two entries need - match. When med, four entries have to match. When - high, all entries need to match. -
- For hardcore leak debugging, you probably want to use - --leak-resolution=high together with - --num-callers=40 or some such large number. Note - however that this can give an overwhelming amount of - information, which is why the defaults are 4 callers and - low-resolution matching. -
- Note that the --leak-resolution= setting does not - affect Valgrind's ability to find leaks. It only changes how - the results are presented. -

- -

--workaround-gcc296-bugs=no [default]
- --workaround-gcc296-bugs=yes
When enabled, - assume that reads and writes some small distance below the stack - pointer %esp are due to bugs in gcc 2.96, and does - not report them. The "small distance" is 256 bytes by default. - Note that gcc 2.96 is the default compiler on some popular Linux - distributions (RedHat 7.X, Mandrake) and so you may well need to - use this flag. Do not use it if you do not have to, as it can - cause real errors to be overlooked. Another option is to use a - gcc/g++ which does not generate accesses below the stack - pointer. 2.95.3 seems to be a good choice in this respect. -
- Unfortunately (27 Feb 02) it looks like g++ 3.0.4 has a similar - bug, so you may need to issue this flag if you use 3.0.4. A - while later (early Apr 02) this is confirmed as a scheduling bug - in g++-3.0.4. -

- -

--error-limit=yes [default]
- --error-limit=no
When enabled, valgrind stops - reporting errors after 30000 in total, or 300 different ones, - have been seen. This is to stop the error tracking machinery - from becoming a huge performance overhead in programs with many - errors.

- -

--cachesim=no [default]
- --cachesim=yes
When enabled, turns off memory - checking, and turns on cache profiling. Cache profiling is - described in detail in Section 7. -

- -

--weird-hacks=hack1,hack2,... - Pass miscellaneous hints to Valgrind which slightly modify the - simulated behaviour in nonstandard or dangerous ways, possibly - to help the simulation of strange features. By default no hacks - are enabled. Use with caution! Currently known hacks are: -
-
- ioctl-VTIME Use this if you have a program - which sets readable file descriptors to have a timeout by - doing ioctl on them with a - TCSETA-style command and a non-zero - VTIME timeout value. This is considered - potentially dangerous and therefore is not engaged by - default, because it is (remotely) conceivable that it could - cause threads doing read to incorrectly block - the entire process. -
  - You probably want to try this one if you have a program - which unexpectedly blocks in a read from a file - descriptor which you know to have been messed with by - ioctl. This could happen, for example, if the - descriptor is used to read input from some kind of screen - handling library. -
  - To find out if your program is blocking unexpectedly in the - read system call, run with - --trace-syscalls=yes flag. -
  -
- truncate-writes Use this if you have a threaded - program which appears to unexpectedly block whilst writing - into a pipe. The effect is to modify all calls to - write() so that requests to write more than - 4096 bytes are treated as if they only requested a write of - 4096 bytes. Valgrind does this by changing the - count argument of write(), as - passed to the kernel, so that it is at most 4096. The - amount of data written will then be less than the client - program asked for, but the client should have a loop around - its write() call to check whether the requested - number of bytes have been written. If not, it should issue - further write() calls until all the data is - written. -
  - This all sounds pretty dodgy to me, which is why I've made - this behaviour only happen on request. It is not the - default behaviour. At the time of writing this (30 June - 2002) I have only seen one example where this is necessary, - so either the problem is extremely rare or nobody is using - Valgrind :-) -
  - On experimentation I see that truncate-writes - doesn't interact well with ioctl-VTIME, so you - probably don't want to try both at once. -
  - As above, to find out if your program is blocking - unexpectedly in the write() system call, you - may find the --trace-syscalls=yes - --trace-sched=yes flags useful. -
- -

-

- -There are also some options for debugging Valgrind itself. You -shouldn't need to use them in the normal run of things. Nevertheless: - -

--single-step=no [default]
- --single-step=yes -
When enabled, each x86 insn is translated seperately into - instrumented code. When disabled, translation is done on a - per-basic-block basis, giving much better translations.

- -

--optimise=no
- --optimise=yes [default] -
When enabled, various improvements are applied to the - intermediate code, mainly aimed at allowing the simulated CPU's - registers to be cached in the real CPU's registers over several - simulated instructions.

- -

--instrument=no
- --instrument=yes [default] -
When disabled, the translations don't actually contain any - instrumentation.

- -

--cleanup=no
- --cleanup=yes [default] -
When enabled, various improvments are applied to the - post-instrumented intermediate code, aimed at removing redundant - value checks.

- -

--trace-syscalls=no [default]
- --trace-syscalls=yes -
Enable/disable tracing of system call intercepts.

- -

--trace-signals=no [default]
- --trace-signals=yes -
Enable/disable tracing of signal handling.

- -

--trace-sched=no [default]
- --trace-sched=yes -
Enable/disable tracing of thread scheduling events.

- -

--trace-pthread=none [default]
- --trace-pthread=some
- --trace-pthread=all -
Specifies amount of trace detail for pthread-related events.

- -

--trace-symtab=no [default]
- --trace-symtab=yes -
Enable/disable tracing of symbol table reading.

- -

--trace-malloc=no [default]
- --trace-malloc=yes -
Enable/disable tracing of malloc/free (et al) intercepts. -

- -

--stop-after=<number> - [default: infinity, more or less] -
After <number> basic blocks have been executed, shut down - Valgrind and switch back to running the client on the real CPU. -

- -

--dump-error=<number> [default: inactive] -
After the program has exited, show gory details of the - translation of the basic block containing the <number>'th - error context. When used with --single-step=yes, - can show the exact x86 instruction causing an error. This is - all fairly dodgy and doesn't work at all if threads are - involved.

-

- - - -

2.6 Explaination of error messages

- -Despite considerable sophistication under the hood, Valgrind can only -really detect two kinds of errors, use of illegal addresses, and use -of undefined values. Nevertheless, this is enough to help you -discover all sorts of memory-management nasties in your code. This -section presents a quick summary of what error messages mean. The -precise behaviour of the error-checking machinery is described in -Section 4. - - -

2.6.1 Illegal read / Illegal write errors

-For example: -

-  Invalid read of size 4
-     at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
-     by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
-     by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326)
-     by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
-     Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
-

- -

This happens when your program reads or writes memory at a place -which Valgrind reckons it shouldn't. In this example, the program did -a 4-byte read at address 0xBFFFF0E0, somewhere within the -system-supplied library libpng.so.2.1.0.9, which was called from -somewhere else in the same library, called from line 326 of -qpngio.cpp, and so on. - -

Valgrind tries to establish what the illegal address might relate -to, since that's often useful. So, if it points into a block of -memory which has already been freed, you'll be informed of this, and -also where the block was free'd at. Likewise, if it should turn out -to be just off the end of a malloc'd block, a common result of -off-by-one-errors in array subscripting, you'll be informed of this -fact, and also where the block was malloc'd. - -

In this example, Valgrind can't identify the address. Actually the -address is on the stack, but, for some reason, this is not a valid -stack address -- it is below the stack pointer, %esp, and that isn't -allowed. In this particular case it's probably caused by gcc -generating invalid code, a known bug in various flavours of gcc. - -

Note that Valgrind only tells you that your program is about to -access memory at an illegal address. It can't stop the access from -happening. So, if your program makes an access which normally would -result in a segmentation fault, you program will still suffer the same -fate -- but you will get a message from Valgrind immediately prior to -this. In this particular example, reading junk on the stack is -non-fatal, and the program stays alive. - - -

2.6.2 Use of uninitialised values

-For example: -

-  Conditional jump or move depends on uninitialised value(s)
-     at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
-     by 0x402E8476: _IO_printf (printf.c:36)
-     by 0x8048472: main (tests/manuel1.c:8)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-

- -

An uninitialised-value use error is reported when your program uses -a value which hasn't been initialised -- in other words, is undefined. -Here, the undefined value is used somewhere inside the printf() -machinery of the C library. This error was reported when running the -following small program: -

-  int main()
-  {
-    int x;
-    printf ("x = %d\n", x);
-  }
-

- -

It is important to understand that your program can copy around -junk (uninitialised) data to its heart's content. Valgrind observes -this and keeps track of the data, but does not complain. A complaint -is issued only when your program attempts to make use of uninitialised -data. In this example, x is uninitialised. Valgrind observes the -value being passed to _IO_printf and thence to _IO_vfprintf, but makes -no comment. However, _IO_vfprintf has to examine the value of x so it -can turn it into the corresponding ASCII string, and it is at this -point that Valgrind complains. - -

Sources of uninitialised data tend to be: -

Local variables in procedures which have not been initialised, - as in the example above.

- -

The contents of malloc'd blocks, before you write something - there. In C++, the new operator is a wrapper round malloc, so - if you create an object with new, its fields will be - uninitialised until you (or the constructor) fill them in, which - is only Right and Proper.

- - - -

2.6.3 Illegal frees

-For example: -

-  Invalid free()
-     at 0x4004FFDF: free (ut_clientmalloc.c:577)
-     by 0x80484C7: main (tests/doublefree.c:10)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/doublefree)
-     Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
-     at 0x4004FFDF: free (ut_clientmalloc.c:577)
-     by 0x80484C7: main (tests/doublefree.c:10)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/doublefree)
-

-

Valgrind keeps track of the blocks allocated by your program with -malloc/new, so it can know exactly whether or not the argument to -free/delete is legitimate or not. Here, this test program has -freed the same block twice. As with the illegal read/write errors, -Valgrind attempts to make sense of the address free'd. If, as -here, the address is one which has previously been freed, you wil -be told that -- making duplicate frees of the same block easy to spot. - - -

2.6.4 When a block is freed with an inappropriate -deallocation function

-In the following example, a block allocated with new[] -has wrongly been deallocated with free: -

-  Mismatched free() / delete / delete []
-     at 0x40043249: free (vg_clientfuncs.c:171)
-     by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
-     by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
-     by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
-     Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
-     at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
-     by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
-     by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
-     by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
-

-The following was told to me be the KDE 3 developers. I didn't know -any of it myself. They also implemented the check itself. -

-In C++ it's important to deallocate memory in a way compatible with -how it was allocated. The deal is: -

If allocated with malloc, calloc, - realloc, valloc or - memalign, you must deallocate with free. -
If allocated with new[], you must deallocate with - delete[]. -
If allocated with new, you must deallocate with - delete. -

-The worst thing is that on Linux apparently it doesn't matter if you -do muddle these up, and it all seems to work ok, but the same program -may then crash on a different platform, Solaris for example. So it's -best to fix it properly. According to the KDE folks "it's amazing how -many C++ programmers don't know this". -

-Pascal Massimino adds the following clarification: -delete[] must be called associated with a -new[] because the compiler stores the size of the array -and the pointer-to-member to the destructor of the array's content -just before the pointer actually returned. This implies a -variable-sized overhead in what's returned by new or -new[]. It rather surprising how compilers [Ed: -runtime-support libraries?] are robust to mismatch in -new/delete -new[]/delete[]. - - -

2.6.5 Passing system call parameters with inadequate -read/write permissions

- -Valgrind checks all parameters to system calls. If a system call -needs to read from a buffer provided by your program, Valgrind checks -that the entire buffer is addressible and has valid data, ie, it is -readable. And if the system call needs to write to a user-supplied -buffer, Valgrind checks that the buffer is addressible. After the -system call, Valgrind updates its administrative information to -precisely reflect any changes in memory permissions caused by the -system call. - -

Here's an example of a system call with an invalid parameter: -

-  #include <stdlib.h>
-  #include <unistd.h>
-  int main( void )
-  {
-    char* arr = malloc(10);
-    (void) write( 1 /* stdout */, arr, 10 );
-    return 0;
-  }
-

- -

You get this complaint ... -

-  Syscall param write(buf) contains uninitialised or unaddressable byte(s)
-     at 0x4035E072: __libc_write
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/badwrite)
-     by <bogus frame pointer> ???
-     Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
-     at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
-     by 0x80484A0: main (tests/badwrite.c:6)
-     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
-     by 0x80483B1: (within tests/badwrite)
-

- -

... because the program has tried to write uninitialised junk from -the malloc'd block to the standard output. - - -

2.6.6 Warning messages you might see

- -Most of these only appear if you run in verbose mode (enabled by --v): -

More than 50 errors detected. Subsequent errors - will still be recorded, but in less detail than before. -
- After 50 different errors have been shown, Valgrind becomes - more conservative about collecting them. It then requires only - the program counters in the top two stack frames to match when - deciding whether or not two errors are really the same one. - Prior to this point, the PCs in the top four frames are required - to match. This hack has the effect of slowing down the - appearance of new errors after the first 50. The 50 constant can - be changed by recompiling Valgrind. -
-
More than 300 errors detected. I'm not reporting any more. - Final error counts may be inaccurate. Go fix your - program! -
- After 300 different errors have been detected, Valgrind ignores - any more. It seems unlikely that collecting even more different - ones would be of practical help to anybody, and it avoids the - danger that Valgrind spends more and more of its time comparing - new errors against an ever-growing collection. As above, the 300 - number is a compile-time constant. -
-
Warning: client switching stacks? -
- Valgrind spotted such a large change in the stack pointer, %esp, - that it guesses the client is switching to a different stack. - At this point it makes a kludgey guess where the base of the new - stack is, and sets memory permissions accordingly. You may get - many bogus error messages following this, if Valgrind guesses - wrong. At the moment "large change" is defined as a change of - more that 2000000 in the value of the %esp (stack pointer) - register. -
-
Warning: client attempted to close Valgrind's logfile fd <number> - -
- Valgrind doesn't allow the client - to close the logfile, because you'd never see any diagnostic - information after that point. If you see this message, - you may want to use the --logfile-fd=<number> - option to specify a different logfile file-descriptor number. -
-
Warning: noted but unhandled ioctl <number> -
- Valgrind observed a call to one of the vast family of - ioctl system calls, but did not modify its - memory status info (because I have not yet got round to it). - The call will still have gone through, but you may get spurious - errors after this as a result of the non-update of the memory info. -
-
Warning: set address range perms: large range <number> -
- Diagnostic message, mostly for my benefit, to do with memory - permissions. -

- - - -

2.7 Writing suppressions files

- -A suppression file describes a bunch of errors which, for one reason -or another, you don't want Valgrind to tell you about. Usually the -reason is that the system libraries are buggy but unfixable, at least -within the scope of the current debugging session. Multiple -suppressions files are allowed. By default, Valgrind uses -$PREFIX/lib/valgrind/default.supp. - -

-You can ask to add suppressions from another file, by specifying ---suppressions=/path/to/file.supp. - -

Each suppression has the following components:
-

Its name. This merely gives a handy name to the suppression, by - which it is referred to in the summary of used suppressions - printed out when a program finishes. It's not important what - the name is; any identifying string will do. -
- -
The nature of the error to suppress. Either: - Value1, - Value2, - Value4 or - Value8, - meaning an uninitialised-value error when - using a value of 1, 2, 4 or 8 bytes. - Or - Cond (or its old name, Value0), - meaning use of an uninitialised CPU condition code. Or: - Addr1, - Addr2, - Addr4 or - Addr8, meaning an invalid address during a - memory access of 1, 2, 4 or 8 bytes respectively. Or - Param, - meaning an invalid system call parameter error. Or - Free, meaning an invalid or mismatching free. - Or PThread, meaning any kind of complaint to do - with the PThreads API.

- -

The "immediate location" specification. For Value and Addr - errors, it is either the name of the function in which the error - occurred, or, failing that, the full path of the .so file or - executable containing the error location. For Param errors, - is the name of the offending system call parameter. For Free - errors, is the name of the function doing the freeing (eg, - free, __builtin_vec_delete, etc)

- -

The caller of the above "immediate location". Again, either a - function or shared-object/executable name.

- -

Optionally, one or two extra calling-function or object names, - for greater precision.

- -

-Locations may be either names of shared objects/executables or wildcards -matching function names. They begin obj: and fun: -respectively. Function and object names to match against may use the -wildcard characters * and ?. - -A suppression only suppresses an error when the error matches all the -details in the suppression. Here's an example: -

-  {
-    __gconv_transform_ascii_internal/__mbrtowc/mbtowc
-    Value4
-    fun:__gconv_transform_ascii_internal
-    fun:__mbr*toc
-    fun:mbtowc
-  }
-

- -

What is means is: suppress a use-of-uninitialised-value error, when -the data size is 4, when it occurs in the function -__gconv_transform_ascii_internal, when that is called -from any function of name matching __mbr*toc, -when that is called from -mbtowc. It doesn't apply under any other circumstances. -The string by which this suppression is identified to the user is -__gconv_transform_ascii_internal/__mbrtowc/mbtowc. - -

Another example: -

-  {
-    libX11.so.6.2/libX11.so.6.2/libXaw.so.7.0
-    Value4
-    obj:/usr/X11R6/lib/libX11.so.6.2
-    obj:/usr/X11R6/lib/libX11.so.6.2
-    obj:/usr/X11R6/lib/libXaw.so.7.0
-  }
-

- -

Suppress any size 4 uninitialised-value error which occurs anywhere -in libX11.so.6.2, when called from anywhere in the same -library, when called from anywhere in libXaw.so.7.0. The -inexact specification of locations is regrettable, but is about all -you can hope for, given that the X11 libraries shipped with Red Hat -7.2 have had their symbol tables removed. - -

Note -- since the above two examples did not make it clear -- that -you can freely mix the obj: and fun: -styles of description within a single suppression record. - - - -

2.8 The Client Request mechanism

- -Valgrind has a trapdoor mechanism via which the client program can -pass all manner of requests and queries to Valgrind. Internally, this -is used extensively to make malloc, free, signals, threads, etc, work, -although you don't see that. -

-For your convenience, a subset of these so-called client requests is -provided to allow you to tell Valgrind facts about the behaviour of -your program, and conversely to make queries. In particular, your -program can tell Valgrind about changes in memory range permissions -that Valgrind would not otherwise know about, and so allows clients to -get Valgrind to do arbitrary custom checks. -

-Clients need to include the header file valgrind.h to -make this work. The macros therein have the magical property that -they generate code in-line which Valgrind can spot. However, the code -does nothing when not run on Valgrind, so you are not forced to run -your program on Valgrind just because you use the macros in this file. -Also, you are not required to link your program with any extra -supporting libraries. -

-A brief description of the available macros: -

VALGRIND_MAKE_NOACCESS, - VALGRIND_MAKE_WRITABLE and - VALGRIND_MAKE_READABLE. These mark address - ranges as completely inaccessible, accessible but containing - undefined data, and accessible and containing defined data, - respectively. Subsequent errors may have their faulting - addresses described in terms of these blocks. Returns a - "block handle". Returns zero when not run on Valgrind. -
-
VALGRIND_DISCARD: At some point you may want - Valgrind to stop reporting errors in terms of the blocks - defined by the previous three macros. To do this, the above - macros return a small-integer "block handle". You can pass - this block handle to VALGRIND_DISCARD. After - doing so, Valgrind will no longer be able to relate - addressing errors to the user-defined block associated with - the handle. The permissions settings associated with the - handle remain in place; this just affects how errors are - reported, not whether they are reported. Returns 1 for an - invalid handle and 0 for a valid handle (although passing - invalid handles is harmless). Always returns 0 when not run - on Valgrind. -
-
VALGRIND_CHECK_NOACCESS, - VALGRIND_CHECK_WRITABLE and - VALGRIND_CHECK_READABLE: check immediately - whether or not the given address range has the relevant - property, and if not, print an error message. Also, for the - convenience of the client, returns zero if the relevant - property holds; otherwise, the returned value is the address - of the first byte for which the property is not true. - Always returns 0 when not run on Valgrind. -
-
VALGRIND_CHECK_NOACCESS: a quick and easy way - to find out whether Valgrind thinks a particular variable - (lvalue, to be precise) is addressible and defined. Prints - an error message if not. Returns no value. -
-
VALGRIND_MAKE_NOACCESS_STACK: a highly - experimental feature. Similarly to - VALGRIND_MAKE_NOACCESS, this marks an address - range as inaccessible, so that subsequent accesses to an - address in the range gives an error. However, this macro - does not return a block handle. Instead, all annotations - created like this are reviewed at each client - ret (subroutine return) instruction, and those - which now define an address range block the client's stack - pointer register (%esp) are automatically - deleted. -
- In other words, this macro allows the client to tell - Valgrind about red-zones on its own stack. Valgrind - automatically discards this information when the stack - retreats past such blocks. Beware: hacky and flaky, and - probably interacts badly with the new pthread support. -
-
RUNNING_ON_VALGRIND: returns 1 if running on - Valgrind, 0 if running on the real CPU. -
-
VALGRIND_DO_LEAK_CHECK: run the memory leak detector - right now. Returns no value. I guess this could be used to - incrementally check for leaks between arbitrary places in the - program's execution. Warning: not properly tested! -
-
VALGRIND_DISCARD_TRANSLATIONS: discard translations - of code in the specified address range. Useful if you are - debugging a JITter or some other dynamic code generation system. - After this call, attempts to execute code in the invalidated - address range will cause valgrind to make new translations of that - code, which is probably the semantics you want. Note that this is - implemented naively, and involves checking all 200191 entries in - the translation table to see if any of them overlap the specified - address range. So try not to call it often, or performance will - nosedive. Note that you can be clever about this: you only need - to call it when an area which previously contained code is - overwritten with new code. You can choose to write code into - fresh memory, and just call this occasionally to discard large - chunks of old code all at once. -
- Warning: minimally tested, especially for the cache simulator. -

-

- - - -

2.9 Support for POSIX Pthreads

- -As of late April 02, Valgrind supports programs which use POSIX -pthreads. Doing this has proved technically challenging but is now -mostly complete. It works well enough for significant threaded -applications to work. -

-It works as follows: threaded apps are (dynamically) linked against -libpthread.so. Usually this is the one installed with -your Linux distribution. Valgrind, however, supplies its own -libpthread.so and automatically connects your program to -it instead. -

-The fake libpthread.so and Valgrind cooperate to -implement a user-space pthreads package. This approach avoids the -horrible implementation problems of implementing a truly -multiprocessor version of Valgrind, but it does mean that threaded -apps run only on one CPU, even if you have a multiprocessor machine. -

-Valgrind schedules your threads in a round-robin fashion, with all -threads having equal priority. It switches threads every 50000 basic -blocks (typically around 300000 x86 instructions), which means you'll -get a much finer interleaving of thread executions than when run -natively. This in itself may cause your program to behave differently -if you have some kind of concurrency, critical race, locking, or -similar, bugs. -

-The current (valgrind-1.0 release) state of pthread support is as -follows: -

Mutexes, condition variables, thread-specific data, - pthread_once, reader-writer locks, semaphores, - cleanup stacks, cancellation and thread detaching currently work. - Various attribute-like calls are handled but ignored; you get a - warning message. -
-
Currently the following syscalls are thread-safe (nonblocking): - write read nanosleep - sleep select poll - recvmsg and - accept. -
-
Signals in pthreads are now handled properly(ish): - pthread_sigmask, pthread_kill, - sigwait and raise are now implemented. - Each thread has its own signal mask, as POSIX requires. - It's a bit kludgey -- there's a system-wide pending signal set, - rather than one for each thread. But hey. -

- - -As of 18 May 02, the following threaded programs now work fine on my -RedHat 7.2 box: Opera 6.0Beta2, KNode in KDE 3.0, Mozilla-0.9.2.1 and -Galeon-0.11.3, both as supplied with RedHat 7.2. Also Mozilla 1.0RC2. -OpenOffice 1.0. MySQL 3.something (the current stable release). - - -

2.10 Building and installing

- -We now use the standard Unix ./configure, -make, make install mechanism, and I have -attempted to ensure that it works on machines with kernel 2.2 or 2.4 -and glibc 2.1.X or 2.2.X. I don't think there is much else to say. -There are no options apart from the usual --prefix that -you should give to ./configure. - -

-The configure script tests the version of the X server -currently indicated by the current $DISPLAY. This is a -known bug. The intention was to detect the version of the current -XFree86 client libraries, so that correct suppressions could be -selected for them, but instead the test checks the server version. -This is just plain wrong. - -

-If you are building a binary package of Valgrind for distribution, -please read README_PACKAGERS. It contains some important -information. - -

-Apart from that there is no excitement here. Let me know if you have -build problems. - - - - -

2.11 If you have problems

-Mail me (jseward@acm.org). - -

See Section 4 for the known limitations of -Valgrind, and for a list of programs which are known not to work on -it. - -

The translator/instrumentor has a lot of assertions in it. They -are permanently enabled, and I have no plans to disable them. If one -of these breaks, please mail me! - -

If you get an assertion failure on the expression -chunkSane(ch) in vg_free() in -vg_malloc.c, this may have happened because your program -wrote off the end of a malloc'd block, or before its beginning. -Valgrind should have emitted a proper message to that effect before -dying in this way. This is a known problem which I should fix. -

- -

- - -

3 Details of the checking machinery

- -Read this section if you want to know, in detail, exactly what and how -Valgrind is checking. - - -

3.1 Valid-value (V) bits

- -It is simplest to think of Valgrind implementing a synthetic Intel x86 -CPU which is identical to a real CPU, except for one crucial detail. -Every bit (literally) of data processed, stored and handled by the -real CPU has, in the synthetic CPU, an associated "valid-value" bit, -which says whether or not the accompanying bit has a legitimate value. -In the discussions which follow, this bit is referred to as the V -(valid-value) bit. - -

Each byte in the system therefore has a 8 V bits which follow -it wherever it goes. For example, when the CPU loads a word-size item -(4 bytes) from memory, it also loads the corresponding 32 V bits from -a bitmap which stores the V bits for the process' entire address -space. If the CPU should later write the whole or some part of that -value to memory at a different address, the relevant V bits will be -stored back in the V-bit bitmap. - -

In short, each bit in the system has an associated V bit, which -follows it around everywhere, even inside the CPU. Yes, the CPU's -(integer and %eflags) registers have their own V bit -vectors. - -

Copying values around does not cause Valgrind to check for, or -report on, errors. However, when a value is used in a way which might -conceivably affect the outcome of your program's computation, the -associated V bits are immediately checked. If any of these indicate -that the value is undefined, an error is reported. - -

Here's an (admittedly nonsensical) example: -

-  int i, j;
-  int a[10], b[10];
-  for (i = 0; i < 10; i++) {
-    j = a[i];
-    b[i] = j;
-  }
-

- -

Valgrind emits no complaints about this, since it merely copies -uninitialised values from a[] into b[], and -doesn't use them in any way. However, if the loop is changed to -

-  for (i = 0; i < 10; i++) {
-    j += a[i];
-  }
-  if (j == 77) 
-     printf("hello there\n");
-

-then Valgrind will complain, at the if, that the -condition depends on uninitialised values. - -

Most low level operations, such as adds, cause Valgrind to -use the V bits for the operands to calculate the V bits for the -result. Even if the result is partially or wholly undefined, -it does not complain. - -

Checks on definedness only occur in two places: when a value is -used to generate a memory address, and where control flow decision -needs to be made. Also, when a system call is detected, valgrind -checks definedness of parameters as required. - -

If a check should detect undefinedness, an error message is -issued. The resulting value is subsequently regarded as well-defined. -To do otherwise would give long chains of error messages. In effect, -we say that undefined values are non-infectious. - -

This sounds overcomplicated. Why not just check all reads from -memory, and complain if an undefined value is loaded into a CPU register? -Well, that doesn't work well, because perfectly legitimate C programs routinely -copy uninitialised values around in memory, and we don't want endless complaints -about that. Here's the canonical example. Consider a struct -like this: -

-  struct S { int x; char c; };
-  struct S s1, s2;
-  s1.x = 42;
-  s1.c = 'z';
-  s2 = s1;
-

- -

The question to ask is: how large is struct S, in -bytes? An int is 4 bytes and a char one byte, so perhaps a struct S -occupies 5 bytes? Wrong. All (non-toy) compilers I know of will -round the size of struct S up to a whole number of words, -in this case 8 bytes. Not doing this forces compilers to generate -truly appalling code for subscripting arrays of struct -S's. - -

So s1 occupies 8 bytes, yet only 5 of them will be initialised. -For the assignment s2 = s1, gcc generates code to copy -all 8 bytes wholesale into s2 without regard for their -meaning. If Valgrind simply checked values as they came out of -memory, it would yelp every time a structure assignment like this -happened. So the more complicated semantics described above is -necessary. This allows gcc to copy s1 into -s2 any way it likes, and a warning will only be emitted -if the uninitialised values are later used. - -

One final twist to this story. The above scheme allows garbage to -pass through the CPU's integer registers without complaint. It does -this by giving the integer registers V tags, passing these around in -the expected way. This complicated and computationally expensive to -do, but is necessary. Valgrind is more simplistic about -floating-point loads and stores. In particular, V bits for data read -as a result of floating-point loads are checked at the load -instruction. So if your program uses the floating-point registers to -do memory-to-memory copies, you will get complaints about -uninitialised values. Fortunately, I have not yet encountered a -program which (ab)uses the floating-point registers in this way. - - -

3.2 Valid-address (A) bits

- -Notice that the previous section describes how the validity of values -is established and maintained without having to say whether the -program does or does not have the right to access any particular -memory location. We now consider the latter issue. - -

As described above, every bit in memory or in the CPU has an -associated valid-value (V) bit. In addition, all bytes in memory, but -not in the CPU, have an associated valid-address (A) bit. This -indicates whether or not the program can legitimately read or write -that location. It does not give any indication of the validity or the -data at that location -- that's the job of the V bits -- only whether -or not the location may be accessed. - -

Every time your program reads or writes memory, Valgrind checks the -A bits associated with the address. If any of them indicate an -invalid address, an error is emitted. Note that the reads and writes -themselves do not change the A bits, only consult them. - -

So how do the A bits get set/cleared? Like this: - -

When the program starts, all the global data areas are marked as - accessible.

- -

When the program does malloc/new, the A bits for the exactly the - area allocated, and not a byte more, are marked as accessible. - Upon freeing the area the A bits are changed to indicate - inaccessibility.

- -

When the stack pointer register (%esp) moves up or down, A bits - are set. The rule is that the area from %esp up to the base of - the stack is marked as accessible, and below %esp is - inaccessible. (If that sounds illogical, bear in mind that the - stack grows down, not up, on almost all Unix systems, including - GNU/Linux.) Tracking %esp like this has the useful side-effect - that the section of stack used by a function for local variables - etc is automatically marked accessible on function entry and - inaccessible on exit.

- -

When doing system calls, A bits are changed appropriately. For - example, mmap() magically makes files appear in the process's - address space, so the A bits must be updated if mmap() - succeeds.

- -

Optionally, your program can tell Valgrind about such changes - explicitly, using the client request mechanism described above. -

- - - -

3.3 Putting it all together

-Valgrind's checking machinery can be summarised as follows: - -

Each byte in memory has 8 associated V (valid-value) bits, - saying whether or not the byte has a defined value, and a single - A (valid-address) bit, saying whether or not the program - currently has the right to read/write that address.

- -

When memory is read or written, the relevant A bits are - consulted. If they indicate an invalid address, Valgrind emits - an Invalid read or Invalid write error.

- -

When memory is read into the CPU's integer registers, the - relevant V bits are fetched from memory and stored in the - simulated CPU. They are not consulted.

- -

When an integer register is written out to memory, the V bits - for that register are written back to memory too.

- -

When memory is read into the CPU's floating point registers, the - relevant V bits are read from memory and they are immediately - checked. If any are invalid, an uninitialised value error is - emitted. This precludes using the floating-point registers to - copy possibly-uninitialised memory, but simplifies Valgrind in - that it does not have to track the validity status of the - floating-point registers.

- -

As a result, when a floating-point register is written to - memory, the associated V bits are set to indicate a valid - value.

- -

When values in integer CPU registers are used to generate a - memory address, or to determine the outcome of a conditional - branch, the V bits for those values are checked, and an error - emitted if any of them are undefined.

- -

When values in integer CPU registers are used for any other - purpose, Valgrind computes the V bits for the result, but does - not check them.

- -

One the V bits for a value in the CPU have been checked, they - are then set to indicate validity. This avoids long chains of - errors.

- -

When values are loaded from memory, valgrind checks the A bits - for that location and issues an illegal-address warning if - needed. In that case, the V bits loaded are forced to indicate - Valid, despite the location being invalid. -
- This apparently strange choice reduces the amount of confusing - information presented to the user. It avoids the - unpleasant phenomenon in which memory is read from a place which - is both unaddressible and contains invalid values, and, as a - result, you get not only an invalid-address (read/write) error, - but also a potentially large set of uninitialised-value errors, - one for every time the value is used. -
- There is a hazy boundary case to do with multi-byte loads from - addresses which are partially valid and partially invalid. See - details of the flag --partial-loads-ok for details. -

- -Valgrind intercepts calls to malloc, calloc, realloc, valloc, -memalign, free, new and delete. The behaviour you get is: - -

malloc/new: the returned memory is marked as addressible but not - having valid values. This means you have to write on it before - you can read it.

- -

calloc: returned memory is marked both addressible and valid, - since calloc() clears the area to zero.

- -

realloc: if the new size is larger than the old, the new section - is addressible but invalid, as with malloc.

- -

If the new size is smaller, the dropped-off section is marked as - unaddressible. You may only pass to realloc a pointer - previously issued to you by malloc/calloc/realloc.

- -

free/delete: you may only pass to free a pointer previously - issued to you by malloc/calloc/realloc, or the value - NULL. Otherwise, Valgrind complains. If the pointer is indeed - valid, Valgrind marks the entire area it points at as - unaddressible, and places the block in the freed-blocks-queue. - The aim is to defer as long as possible reallocation of this - block. Until that happens, all attempts to access it will - elicit an invalid-address error, as you would hope.

- - - - -

3.4 Signals

- -Valgrind provides suitable handling of signals, so, provided you stick -to POSIX stuff, you should be ok. Basic sigaction() and sigprocmask() -are handled. Signal handlers may return in the normal way or do -longjmp(); both should work ok. As specified by POSIX, a signal is -blocked in its own handler. Default actions for signals should work -as before. Etc, etc. - -

Under the hood, dealing with signals is a real pain, and Valgrind's -simulation leaves much to be desired. If your program does -way-strange stuff with signals, bad things may happen. If so, let me -know. I don't promise to fix it, but I'd at least like to be aware of -it. - - - -

3.5 Memory leak detection

- -Valgrind keeps track of all memory blocks issued in response to calls -to malloc/calloc/realloc/new. So when the program exits, it knows -which blocks are still outstanding -- have not been returned, in other -words. Ideally, you want your program to have no blocks still in use -at exit. But many programs do. - -

For each such block, Valgrind scans the entire address space of the -process, looking for pointers to the block. One of three situations -may result: - -

A pointer to the start of the block is found. This usually - indicates programming sloppiness; since the block is still - pointed at, the programmer could, at least in principle, free'd - it before program exit.

- -

A pointer to the interior of the block is found. The pointer - might originally have pointed to the start and have been moved - along, or it might be entirely unrelated. Valgrind deems such a - block as "dubious", that is, possibly leaked, - because it's unclear whether or - not a pointer to it still exists.

- -

The worst outcome is that no pointer to the block can be found. - The block is classified as "leaked", because the - programmer could not possibly have free'd it at program exit, - since no pointer to it exists. This might be a symptom of - having lost the pointer at some earlier point in the - program.

- -Valgrind reports summaries about leaked and dubious blocks. -For each such block, it will also tell you where the block was -allocated. This should help you figure out why the pointer to it has -been lost. In general, you should attempt to ensure your programs do -not have any leaked or dubious blocks at exit. - -

The precise area of memory in which Valgrind searches for pointers -is: all naturally-aligned 4-byte words for which all A bits indicate -addressibility and all V bits indicated that the stored value is -actually valid. - -

- - - -

4 Limitations

- -The following list of limitations seems depressingly long. However, -most programs actually work fine. - -

Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on -a kernel 2.2.X or 2.4.X system, subject to the following constraints: - -

No MMX, SSE, SSE2, 3DNow instructions. If the translator - encounters these, Valgrind will simply give up. It may be - possible to add support for them at a later time. Intel added a - few instructions such as "cmov" to the integer instruction set - on Pentium and later processors, and these are supported. - Nevertheless it's safest to think of Valgrind as implementing - the 486 instruction set.

- -

Pthreads support is improving, but there are still significant - limitations in that department. See the section above on - Pthreads. Note that your program must be dynamically linked - against libpthread.so, so that Valgrind can - substitute its own implementation at program startup time. If - you're statically linked against it, things will fail - badly.

- -

Valgrind assumes that the floating point registers are not used - as intermediaries in memory-to-memory copies, so it immediately - checks V bits in floating-point loads/stores. If you want to - write code which copies around possibly-uninitialised values, - you must ensure these travel through the integer registers, not - the FPU.

- -

If your program does its own memory management, rather than - using malloc/new/free/delete, it should still work, but - Valgrind's error checking won't be so effective.

- -

Valgrind's signal simulation is not as robust as it could be. - Basic POSIX-compliant sigaction and sigprocmask functionality is - supplied, but it's conceivable that things could go badly awry - if you do wierd things with signals. Workaround: don't. - Programs that do non-POSIX signal tricks are in any case - inherently unportable, so should be avoided if - possible.

- -

Programs which switch stacks are not well handled. Valgrind - does have support for this, but I don't have great faith in it. - It's difficult -- there's no cast-iron way to decide whether a - large change in %esp is as a result of the program switching - stacks, or merely allocating a large object temporarily on the - current stack -- yet Valgrind needs to handle the two situations - differently. 1 May 02: this probably interacts badly with the - new pthread support. I haven't checked properly.

- -

x86 instructions, and system calls, have been implemented on - demand. So it's possible, although unlikely, that a program - will fall over with a message to that effect. If this happens, - please mail me ALL the details printed out, so I can try and - implement the missing feature.

- -

x86 floating point works correctly, but floating-point code may - run even more slowly than integer code, due to my simplistic - approach to FPU emulation.

- -

You can't Valgrind-ize statically linked binaries. Valgrind - relies on the dynamic-link mechanism to gain control at - startup.

- -

Memory consumption of your program is majorly increased whilst - running under Valgrind. This is due to the large amount of - adminstrative information maintained behind the scenes. Another - cause is that Valgrind dynamically translates the original - executable. Translated, instrumented code is 14-16 times larger - than the original (!) so you can easily end up with 30+ MB of - translations when running (eg) a web browser. -

- -Programs which are known not to work are: - -

emacs starts up but immediately concludes it is out of memory - and aborts. Emacs has it's own memory-management scheme, but I - don't understand why this should interact so badly with - Valgrind. Emacs works fine if you build it to use the standard - malloc/free routines.

-

- -Known platform-specific limitations, as of release 1.0.0: - -

On Red Hat 7.3, there have been reports of link errors (at - program start time) for threaded programs using - __pthread_clock_gettime and - __pthread_clock_settime. This appears to be due to - /lib/librt-2.2.5.so needing them. Unfortunately I - do not understand enough about this problem to fix it properly, - and I can't reproduce it on my test RedHat 7.3 system. Please - mail me if you have more information / understanding.

-

- 1.0.0 now partially works on Red Hat 7.3.92 ("Limbo" - public beta). However, don't expect a smooth ride. - Basically valgrind won't work as-is with any - glibc-2.3 based system. Limbo is just a little pre glibc-2.3 - and it just about works. Limbo is also gcc-3.1 based and so - suffers from the problems in the following point.

-

- Inlining of string functions with gcc-3.1 or above causes a - large number of false reports of uninitialised value uses. I - know what the problem is and roughly how to fix it, but I need - to devise a reasonably efficient fix. Try to reduce the - optimisation level, or use -fno-builtin-strlen in - the meantime. Or use an earlier gcc.

-

- - -

- - - -

5 How it works -- a rough overview

-Some gory details, for those with a passion for gory details. You -don't need to read this section if all you want to do is use Valgrind. - - -

5.1 Getting started

- -Valgrind is compiled into a shared object, valgrind.so. The shell -script valgrind sets the LD_PRELOAD environment variable to point to -valgrind.so. This causes the .so to be loaded as an extra library to -any subsequently executed dynamically-linked ELF binary, viz, the -program you want to debug. - -

The dynamic linker allows each .so in the process image to have an -initialisation function which is run before main(). It also allows -each .so to have a finalisation function run after main() exits. - -

When valgrind.so's initialisation function is called by the dynamic -linker, the synthetic CPU to starts up. The real CPU remains locked -in valgrind.so for the entire rest of the program, but the synthetic -CPU returns from the initialisation function. Startup of the program -now continues as usual -- the dynamic linker calls all the other .so's -initialisation routines, and eventually runs main(). This all runs on -the synthetic CPU, not the real one, but the client program cannot -tell the difference. - -

Eventually main() exits, so the synthetic CPU calls valgrind.so's -finalisation function. Valgrind detects this, and uses it as its cue -to exit. It prints summaries of all errors detected, possibly checks -for memory leaks, and then exits the finalisation routine, but now on -the real CPU. The synthetic CPU has now lost control -- permanently --- so the program exits back to the OS on the real CPU, just as it -would have done anyway. - -

On entry, Valgrind switches stacks, so it runs on its own stack. -On exit, it switches back. This means that the client program -continues to run on its own stack, so we can switch back and forth -between running it on the simulated and real CPUs without difficulty. -This was an important design decision, because it makes it easy (well, -significantly less difficult) to debug the synthetic CPU. - - - -

5.2 The translation/instrumentation engine

- -Valgrind does not directly run any of the original program's code. Only -instrumented translations are run. Valgrind maintains a translation -table, which allows it to find the translation quickly for any branch -target (code address). If no translation has yet been made, the -translator - a just-in-time translator - is summoned. This makes an -instrumented translation, which is added to the collection of -translations. Subsequent jumps to that address will use this -translation. - -

Valgrind no longer directly supports detection of self-modifying -code. Such checking is expensive, and in practice (fortunately) -almost no applications need it. However, to help people who are -debugging dynamic code generation systems, there is a Client Request -(basically a macro you can put in your program) which directs Valgrind -to discard translations in a given address range. So Valgrind can -still work in this situation provided the client tells it when -code has become out-of-date and needs to be retranslated. - -

The JITter translates basic blocks -- blocks of straight-line-code --- as single entities. To minimise the considerable difficulties of -dealing with the x86 instruction set, x86 instructions are first -translated to a RISC-like intermediate code, similar to sparc code, -but with an infinite number of virtual integer registers. Initially -each insn is translated seperately, and there is no attempt at -instrumentation. - -

The intermediate code is improved, mostly so as to try and cache -the simulated machine's registers in the real machine's registers over -several simulated instructions. This is often very effective. Also, -we try to remove redundant updates of the simulated machines's -condition-code register. - -

The intermediate code is then instrumented, giving more -intermediate code. There are a few extra intermediate-code operations -to support instrumentation; it is all refreshingly simple. After -instrumentation there is a cleanup pass to remove redundant value -checks. - -

This gives instrumented intermediate code which mentions arbitrary -numbers of virtual registers. A linear-scan register allocator is -used to assign real registers and possibly generate spill code. All -of this is still phrased in terms of the intermediate code. This -machinery is inspired by the work of Reuben Thomas (MITE). - -

Then, and only then, is the final x86 code emitted. The -intermediate code is carefully designed so that x86 code can be -generated from it without need for spare registers or other -inconveniences. - -

The translations are managed using a traditional LRU-based caching -scheme. The translation cache has a default size of about 14MB. - - - -

5.3 Tracking the status of memory

Each byte in the -process' address space has nine bits associated with it: one A bit and -eight V bits. The A and V bits for each byte are stored using a -sparse array, which flexibly and efficiently covers arbitrary parts of -the 32-bit address space without imposing significant space or -performance overheads for the parts of the address space never -visited. The scheme used, and speedup hacks, are described in detail -at the top of the source file vg_memory.c, so you should read that for -the gory details. - - - -

5.4 System calls

-All system calls are intercepted. The memory status map is consulted -before and updated after each call. It's all rather tiresome. See -vg_syscall_mem.c for details. - - - -

5.5 Signals

-All system calls to sigaction() and sigprocmask() are intercepted. If -the client program is trying to set a signal handler, Valgrind makes a -note of the handler address and which signal it is for. Valgrind then -arranges for the same signal to be delivered to its own handler. - -

When such a signal arrives, Valgrind's own handler catches it, and -notes the fact. At a convenient safe point in execution, Valgrind -builds a signal delivery frame on the client's stack and runs its -handler. If the handler longjmp()s, there is nothing more to be said. -If the handler returns, Valgrind notices this, zaps the delivery -frame, and carries on where it left off before delivering the signal. - -

The purpose of this nonsense is that setting signal handlers -essentially amounts to giving callback addresses to the Linux kernel. -We can't allow this to happen, because if it did, signal handlers -would run on the real CPU, not the simulated one. This means the -checking machinery would not operate during the handler run, and, -worse, memory permissions maps would not be updated, which could cause -spurious error reports once the handler had returned. - -

An even worse thing would happen if the signal handler longjmp'd -rather than returned: Valgrind would completely lose control of the -client program. - -

Upshot: we can't allow the client to install signal handlers -directly. Instead, Valgrind must catch, on behalf of the client, any -signal the client asks to catch, and must delivery it to the client on -the simulated CPU, not the real one. This involves considerable -gruesome fakery; see vg_signals.c for details. -

- -

- - -

6 Example

-This is the log for a run of a small program. The program is in fact -correct, and the reported error is as the result of a potentially serious -code generation bug in GNU g++ (snapshot 20010527). -

-sewardj@phoenix:~/newmat10$
-~/Valgrind-6/valgrind -v ./bogon 
-==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
-==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
-==25832== Startup, with flags:
-==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
-==25832== reading syms from /lib/ld-linux.so.2
-==25832== reading syms from /lib/libc.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
-==25832== reading syms from /lib/libm.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
-==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
-==25832== reading syms from /proc/self/exe
-==25832== loaded 5950 symbols, 142333 line number locations
-==25832== 
-==25832== Invalid read of size 4
-==25832==    at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45)
-==25832==    by 0x80487AF: main (bogon.cpp:66)
-==25832==    by 0x40371E5E: __libc_start_main (libc-start.c:129)
-==25832==    by 0x80485D1: (within /home/sewardj/newmat10/bogon)
-==25832==    Address 0xBFFFF74C is not stack'd, malloc'd or free'd
-==25832==
-==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
-==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
-==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
-==25832== For a detailed leak analysis, rerun with: --leak-check=yes
-==25832==
-==25832== exiting, did 1881 basic blocks, 0 misses.
-==25832== 223 translations, 3626 bytes in, 56801 bytes out.
-

-

The GCC folks fixed this about a week before gcc-3.0 shipped. -

-

- - - - -

7 Cache profiling

-As well as memory debugging, Valgrind also allows you to do cache simulations -and annotate your source line-by-line with the number of cache misses. In -particular, it records: -

L1 instruction cache reads and misses; -
L1 data cache reads and read misses, writes and write misses; -
L2 unified cache reads and read misses, writes and writes misses. -

-On a modern x86 machine, an L1 miss will typically cost around 10 cycles, -and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program.

- -Also, since one instruction cache read is performed per instruction executed, -you can find out how many instructions are executed per line, which can be -useful for traditional profiling and test coverage.

- -Any feedback, bug-fixes, suggestions, etc, welcome. - - -

7.1 Overview

-First off, as for normal Valgrind use, you probably want to turn on debugging -info (the -g flag). But by contrast with normal Valgrind use, you -probably do want to turn optimisation on, since you should profile your -program as it will be normally run. - -The two steps are: -

Run your program with cachegrind in front of the - normal command line invocation. When the program finishes, - Valgrind will print summary cache statistics. It also collects - line-by-line information in a file - cachegrind.out.pid, where pid - is the program's process id. -
- This step should be done every time you want to collect - information about a new program, a changed program, or about the - same program with different input. -

-

Generate a function-by-function summary, and possibly annotate - source files with 'vg_annotate'. Source files to annotate can be - specified manually, or manually on the command line, or - "interesting" source files can be annotated automatically with - the --auto=yes option. You can annotate C/C++ - files or assembly language files equally easily. -
- This step can be performed as many times as you like for each - Step 2. You may want to do multiple annotations showing - different information each time.
-

- -The steps are described in detail in the following sections.

- - -

7.2 Cache simulation specifics

- -Cachegrind uses a simulation for a machine with a split L1 cache and a unified -L2 cache. This configuration is used for all (modern) x86-based machines we -are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are -ancient history now.

- -The more specific characteristics of the simulation are as follows. - -

Write-allocate: when a write miss occurs, the block written to - is brought into the D1 cache. Most modern caches have this - property.

- -

Bit-selection hash function: the line(s) in the cache to which a - memory block maps is chosen by the middle bits M--(M+N-1) of the - byte address, where: -
- line size = 2^M bytes
- (cache size / line size) = 2^N bytes

- -

Inclusive L2 cache: the L2 cache replicates all the entries of - the L1 cache. This is standard on Pentium chips, but AMD - Athlons use an exclusive L2 cache that only holds blocks evicted - from L1. Ditto AMD Durons and most modern VIAs.

-

- -The cache configuration simulated (cache size, associativity and line size) is -determined automagically using the CPUID instruction. If you have an old -machine that (a) doesn't support the CPUID instruction, or (b) supports it in -an early incarnation that doesn't give any cache information, then Cachegrind -will fall back to using a default configuration (that of a model 3/4 Athlon). -Cachegrind will tell you if this happens. You can manually specify one, two or -all three levels (I1/D1/L2) of the cache from the command line using the ---I1, --D1 and --L2 options.

- -Other noteworthy behaviour: - -

References that straddle two cache lines are treated as follows: -
- If both blocks hit --> counted as one hit
- If one block hits, the other misses --> counted as one miss
- If both blocks miss --> counted as one miss (not two)
Instructions that modify a memory location (eg. inc and - dec) are counted as doing just a read, ie. a single data - reference. This may seem strange, but since the write can never cause a - miss (the read guarantees the block is in the cache) it's not very - interesting.
- - Thus it measures not the number of times the data cache is accessed, but - the number of times a data cache miss could occur.
-

- -If you are interested in simulating a cache with different properties, it is -not particularly hard to write your own cache simulator, or to modify the -existing ones in vg_cachesim_I1.c, vg_cachesim_D1.c, -vg_cachesim_L2.c and vg_cachesim_gen.c. We'd be -interested to hear from anyone who does. - - -

7.3 Profiling programs

- -Cache profiling is enabled by using the --cachesim=yes -option to the valgrind shell script. Alternatively, it -is probably more convenient to use the cachegrind script. -Either way automatically turns off Valgrind's memory checking functions, -since the cache simulation is slow enough already, and you probably -don't want to do both at once. -

-To gather cache profiling information about the program ls --l, type: - -

cachegrind ls -l

- -The program will execute (slowly). Upon completion, summary statistics -that look like this will be printed: - -

-==31751== I   refs:      27,742,716
-==31751== I1  misses:           276
-==31751== L2  misses:           275
-==31751== I1  miss rate:        0.0%
-==31751== L2i miss rate:        0.0%
-==31751== 
-==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
-==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
-==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
-==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
-==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
-==31751== 
-==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
-==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
-

- -Cache accesses for instruction fetches are summarised first, giving the -number of fetches made (this is the number of instructions executed, which -can be useful to know in its own right), the number of I1 misses, and the -number of L2 instruction (L2i) misses.

- -Cache accesses for data follow. The information is similar to that of the -instruction fetches, except that the values are also shown split between reads -and writes (note each row's rd and wr values add up -to the row's total).

- -Combined instruction and data figures for the L2 cache follow that.

- - -

7.4 Output file

- -As well as printing summary information, Cachegrind also writes -line-by-line cache profiling information to a file named -cachegrind.out.pid. This file is human-readable, but is -best interpreted by the accompanying program vg_annotate, -described in the next section. -

-Things to note about the cachegrind.out.pid file: -

It is written every time valgrind --cachesim=yes or - cachegrind is run, and will overwrite any existing - cachegrind.out.pid in the current directory (but - that won't happen very often because it takes some time for process ids - to be recycled).

-

It can be huge: ls -l generates a file of about - 350KB. Browsing a few files and web pages with a Konqueror - built with full debugging information generates a file - of around 15 MB.

- -Note that older versions of Cachegrind used a log file named -cachegrind.out (i.e. no .pid suffix). -The suffix serves two purposes. Firstly, it means you don't have to rename old -log files that you don't want to overwrite. Secondly, and more importantly, -it allows correct profiling with the --trace-children=yes option -of programs that spawn child processes. - - -

7.5 Cachegrind options

-Cachegrind accepts all the options that Valgrind does, although some of them -(ones related to memory checking) don't do anything when cache profiling.

- -The interesting cache-simulation specific options are: - -

--I1=<size>,<associativity>,<line_size>
- --D1=<size>,<associativity>,<line_size>
- --L2=<size>,<associativity>,<line_size>
- [default: uses CPUID for automagic cache configuration]
- - Manually specifies the I1/D1/L2 cache configuration, where - size and line_size are measured in bytes. The - three items must be comma-separated, but with no spaces, eg: - -
cachegrind --I1=65535,2,64
- - You can specify one, two or three of the I1/D1/L2 caches. Any level not - manually specified will be simulated using the configuration found in the - normal way (via the CPUID instruction, or failing that, via defaults). -

- - - -

7.6 Annotating C/C++ programs

- -Before using vg_annotate, it is worth widening your -window to be at least 120-characters wide if possible, as the output -lines can be quite long. -

-To get a function-by-function summary, run vg_annotate ---pid in a directory containing a -cachegrind.out.pid file. The --pid -is required so that vg_annotate knows which log file to use when -several are present. -

-The output looks like this: - -

---------------------------------------------------------------------------------
-I1 cache:              65536 B, 64 B, 2-way associative
-D1 cache:              65536 B, 64 B, 2-way associative
-L2 cache:              262144 B, 64 B, 8-way associative
-Command:               concord vg_to_ucode.c
-Events recorded:       Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Events shown:          Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Event sort order:      Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-Threshold:             99%
-Chosen for annotation:
-Auto-annotation:       on
-
---------------------------------------------------------------------------------
-Ir         I1mr I2mr Dr         D1mr   D2mr  Dw        D1mw   D2mw
---------------------------------------------------------------------------------
-27,742,716  276  275 10,955,517 21,905 3,987 4,474,773 19,280 19,098  PROGRAM TOTALS
-
---------------------------------------------------------------------------------
-Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
---------------------------------------------------------------------------------
-8,821,482    5    5 2,242,702 1,621    73 1,794,230      0      0  getc.c:_IO_getc
-5,222,023    4    4 2,276,334    16    12   875,959      1      1  concord.c:get_word
-2,649,248    2    2 1,344,810 7,326 1,385         .      .      .  vg_main.c:strcmp
-2,521,927    2    2   591,215     0     0   179,398      0      0  concord.c:hash
-2,242,740    2    2 1,046,612   568    22   448,548      0      0  ctype.c:tolower
-1,496,937    4    4   630,874 9,000 1,400   279,388      0      0  concord.c:insert
-  897,991   51   51   897,831    95    30        62      1      1  ???:???
-  598,068    1    1   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__flockfile
-  598,068    0    0   299,034     0     0   149,517      0      0  ../sysdeps/generic/lockfile.c:__funlockfile
-  598,024    4    4   213,580    35    16   149,506      0      0  vg_clientmalloc.c:malloc
-  446,587    1    1   215,973 2,167   430   129,948 14,057 13,957  concord.c:add_existing
-  341,760    2    2   128,160     0     0   128,160      0      0  vg_clientmalloc.c:vg_trap_here_WRAPPER
-  320,782    4    4   150,711   276     0    56,027     53     53  concord.c:init_hash_table
-  298,998    1    1   106,785     0     0    64,071      1      1  concord.c:create
-  149,518    0    0   149,516     0     0         1      0      0  ???:tolower@@GLIBC_2.0
-  149,518    0    0   149,516     0     0         1      0      0  ???:fgetc@@GLIBC_2.0
-   95,983    4    4    38,031     0     0    34,409  3,152  3,150  concord.c:new_word_node
-   85,440    0    0    42,720     0     0    21,360      0      0  vg_clientmalloc.c:vg_bogus_epilogue
-

- -First up is a summary of the annotation options: - -

I1 cache, D1 cache, L2 cache: cache configuration. So you know the - configuration with which these results were obtained.

- -

Command: the command line invocation of the program under - examination.

- -

Events recorded: event abbreviations are:
-
- Ir : I cache reads (ie. instructions executed)
- I1mr: I1 cache read misses
- I2mr: L2 cache instruction read misses
- Dr : D cache reads (ie. memory reads)
- D1mr: D1 cache read misses
- D2mr: L2 cache data read misses
- Dw : D cache writes (ie. memory writes)
- D1mw: D1 cache write misses
- D2mw: L2 cache data write misses
- Note that D1 total accesses is given by D1mr + - D1mw, and that L2 total accesses is given by - I2mr + D2mr + D2mw.

- -

Events shown: the events shown (a subset of events gathered). This can - be adjusted with the --show option.

- -

Event sort order: the sort order in which functions are shown. For - example, in this case the functions are sorted from highest - Ir counts to lowest. If two functions have identical - Ir counts, they will then be sorted by I1mr - counts, and so on. This order can be adjusted with the - --sort option.
- - Note that this dictates the order the functions appear. It is not - the order in which the columns appear; that is dictated by the "events - shown" line (and can be changed with the --show option). -

- -

Threshold: vg_annotate by default omits functions - that cause very low numbers of misses to avoid drowning you in - information. In this case, vg_annotate shows summaries the - functions that account for 99% of the Ir counts; - Ir is chosen as the threshold event since it is the - primary sort event. The threshold can be adjusted with the - --threshold option.

- -

Chosen for annotation: names of files specified manually for annotation; - in this case none.

- -

Auto-annotation: whether auto-annotation was requested via the - --auto=yes option. In this case no.

-

- -Then follows summary statistics for the whole program. These are similar -to the summary provided when running cachegrind.

- -Then follows function-by-function statistics. Each function is -identified by a file_name:function_name pair. If a column -contains only a dot it means the function never performs -that event (eg. the third row shows that strcmp() -contains no instructions that write to memory). The name -??? is used if the the file name and/or function name -could not be determined from debugging information. If most of the -entries have the form ???:??? the program probably wasn't -compiled with -g. If any code was invalidated (either due to -self-modifying code or unloading of shared objects) its counts are aggregated -into a single cost centre written as (discarded):(discarded).

- -It is worth noting that functions will come from three types of source files: -

From the profiled program (concord.c in this example).
From libraries (eg. getc.c)
From Valgrind's implementation of some libc functions (eg. - vg_clientmalloc.c:malloc). These are recognisable because - the filename begins with vg_, and is probably one of - vg_main.c, vg_clientmalloc.c or - vg_mylibc.c. -

- -There are two ways to annotate source files -- by choosing them -manually, or with the --auto=yes option. To do it -manually, just specify the filenames as arguments to -vg_annotate. For example, the output from running -vg_annotate concord.c for our example produces the same -output as above followed by an annotated version of -concord.c, a section of which looks like: - -

---------------------------------------------------------------------------------
--- User-annotated source: concord.c
---------------------------------------------------------------------------------
-Ir        I1mr I2mr Dr      D1mr  D2mr  Dw      D1mw   D2mw
-
-[snip]
-
-        .    .    .       .     .     .       .      .      .  void init_hash_table(char *file_name, Word_Node *table[])
-        3    1    1       .     .     .       1      0      0  {
-        .    .    .       .     .     .       .      .      .      FILE *file_ptr;
-        .    .    .       .     .     .       .      .      .      Word_Info *data;
-        1    0    0       .     .     .       1      1      1      int line = 1, i;
-        .    .    .       .     .     .       .      .      .
-        5    0    0       .     .     .       3      0      0      data = (Word_Info *) create(sizeof(Word_Info));
-        .    .    .       .     .     .       .      .      .
-    4,991    0    0   1,995     0     0     998      0      0      for (i = 0; i < TABLE_SIZE; i++)
-    3,988    1    1   1,994     0     0     997     53     52          table[i] = NULL;
-        .    .    .       .     .     .       .      .      .
-        .    .    .       .     .     .       .      .      .      /* Open file, check it. */
-        6    0    0       1     0     0       4      0      0      file_ptr = fopen(file_name, "r");
-        2    0    0       1     0     0       .      .      .      if (!(file_ptr)) {
-        .    .    .       .     .     .       .      .      .          fprintf(stderr, "Couldn't open '%s'.\n", file_name);
-        1    1    1       .     .     .       .      .      .          exit(EXIT_FAILURE);
-        .    .    .       .     .     .       .      .      .      }
-        .    .    .       .     .     .       .      .      .
-  165,062    1    1  73,360     0     0  91,700      0      0      while ((line = get_word(data, line, file_ptr)) != EOF)
-  146,712    0    0  73,356     0     0  73,356      0      0          insert(data->;word, data->line, table);
-        .    .    .       .     .     .       .      .      .
-        4    0    0       1     0     0       2      0      0      free(data);
-        4    0    0       1     0     0       2      0      0      fclose(file_ptr);
-        3    0    0       2     0     0       .      .      .  }
-

- -(Although column widths are automatically minimised, a wide terminal is clearly -useful.)

- -Each source file is clearly marked (User-annotated source) as -having been chosen manually for annotation. If the file was found in one of -the directories specified with the -I/--include -option, the directory and file are both given.

- -Each line is annotated with its event counts. Events not applicable for a line -are represented by a `.'; this is useful for distinguishing between an event -which cannot happen, and one which can but did not.

- -Sometimes only a small section of a source file is executed. To minimise -uninteresting output, Valgrind only shows annotated lines and lines within a -small distance of annotated lines. Gaps are marked with the line numbers so -you know which part of a file the shown code comes from, eg: - -

-(figures and code for line 704)
--- line 704 ----------------------------------------
--- line 878 ----------------------------------------
-(figures and code for line 878)
-

- -The amount of context to show around annotated lines is controlled by the ---context option.

- -To get automatic annotation, run vg_annotate --auto=yes. -vg_annotate will automatically annotate every source file it can find that is -mentioned in the function-by-function summary. Therefore, the files chosen for -auto-annotation are affected by the --sort and ---threshold options. Each source file is clearly marked -(Auto-annotated source) as being chosen automatically. Any files -that could not be found are mentioned at the end of the output, eg: - -

---------------------------------------------------------------------------------
-The following files chosen for auto-annotation could not be found:
---------------------------------------------------------------------------------
-  getc.c
-  ctype.c
-  ../sysdeps/generic/lockfile.c
-

- -This is quite common for library files, since libraries are usually compiled -with debugging information, but the source files are often not present on a -system. If a file is chosen for annotation both manually and -automatically, it is marked as User-annotated source. - -Use the -I/--include option to tell Valgrind where to look for -source files if the filenames found from the debugging information aren't -specific enough. - -Beware that vg_annotate can take some time to digest large -cachegrind.out.pid files, e.g. 30 seconds or more. Also -beware that auto-annotation can produce a lot of output if your program is -large! - - -

7.7 Annotating assembler programs

- -Valgrind can annotate assembler programs too, or annotate the -assembler generated for your C program. Sometimes this is useful for -understanding what is really happening when an interesting line of C -code is translated into multiple instructions.

- -To do this, you just need to assemble your .s files with -assembler-level debug information. gcc doesn't do this, but you can -use the GNU assembler with the --gstabs option to -generate object files with this information, eg: - -

as --gstabs foo.s

- -You can then profile and annotate source files in the same way as for C/C++ -programs. - - -

7.8 `vg_annotate` options

-

--pid

- - Indicates which cachegrind.out.pid file to read. - Not actually an option -- it is required. - -

-h, --help

-

-v, --version
- - Help and version, as usual.
--sort=A,B,C [default: order in - cachegrind.out.pid]
- Specifies the events upon which the sorting of the function-by-function - entries will be based. Useful if you want to concentrate on eg. I cache - misses (--sort=I1mr,I2mr), or D cache misses - (--sort=D1mr,D2mr), or L2 misses - (--sort=D2mr,I2mr).

- -

--show=A,B,C [default: all, using order in - cachegrind.out.pid]
- Specifies which events to show (and the column order). Default is to use - all present in the cachegrind.out.pid file (and use - the order in the file).

- -

--threshold=X [default: 99%]
- Sets the threshold for the function-by-function summary. Functions are - shown that account for more than X% of the primary sort event. If - auto-annotating, also affects which files are annotated. - - Note: thresholds can be set for more than one of the events by appending - any events for the --sort option with a colon and a number - (no spaces, though). E.g. if you want to see the functions that cover - 99% of L2 read misses and 99% of L2 write misses, use this option: - -
--sort=D2mr:99,D2mw:99
-

- -

--auto=no [default]
- --auto=yes
- When enabled, automatically annotates every file that is mentioned in the - function-by-function summary that can be found. Also gives a list of - those that couldn't be found. - -
--context=N [default: 8]
- Print N lines of context before and after each annotated line. Avoids - printing large sections of source files that were not executed. Use a - large number (eg. 10,000) to show all source lines. -

- -

-I=<dir>, --include=<dir> - [default: empty string]
- Adds a directory to the list in which to search for files. Multiple - -I/--include options can be given to add multiple directories. -

- - -

7.9 Warnings

-There are a couple of situations in which vg_annotate issues warnings. - -

If a source file is more recent than the - cachegrind.out.pid file. This is because the - information in cachegrind.out.pid is only recorded - with line numbers, so if the line numbers change at all in the source - (eg. lines added, deleted, swapped), any annotations will be - incorrect.
- -
If information is recorded about line numbers past the end of a file. - This can be caused by the above problem, ie. shortening the source file - while using an old cachegrind.out.pid file. If this - happens, the figures for the bogus lines are printed anyway (clearly - marked as bogus) in case they are important.

-

- - -

7.10 Things to watch out for

-Some odd things that can occur during annotation: - -

If annotating at the assembler level, you might see something like this: - -

-      1    0    0  .    .    .  .    .    .          leal -12(%ebp),%eax
-      1    0    0  .    .    .  1    0    0          movl %eax,84(%ebx)
-      2    0    0  0    0    0  1    0    0          movl $1,-20(%ebp)
-      .    .    .  .    .    .  .    .    .          .align 4,0x90
-      1    0    0  .    .    .  .    .    .          movl $.LnrB,%eax
-      1    0    0  .    .    .  1    0    0          movl %eax,-16(%ebp)
-

- - How can the third instruction be executed twice when the others are - executed only once? As it turns out, it isn't. Here's a dump of the - executable, using objdump -d: - -

-      8048f25:       8d 45 f4                lea    0xfffffff4(%ebp),%eax
-      8048f28:       89 43 54                mov    %eax,0x54(%ebx)
-      8048f2b:       c7 45 ec 01 00 00 00    movl   $0x1,0xffffffec(%ebp)
-      8048f32:       89 f6                   mov    %esi,%esi
-      8048f34:       b8 08 8b 07 08          mov    $0x8078b08,%eax
-      8048f39:       89 45 f0                mov    %eax,0xfffffff0(%ebp)
-

- - Notice the extra mov %esi,%esi instruction. Where did this - come from? The GNU assembler inserted it to serve as the two bytes of - padding needed to align the movl $.LnrB,%eax instruction on - a four-byte boundary, but pretended it didn't exist when adding debug - information. Thus when Valgrind reads the debug info it thinks that the - movl $0x1,0xffffffec(%ebp) instruction covers the address - range 0x8048f2b--0x804833 by itself, and attributes the counts for the - mov %esi,%esi to it.

-

Inlined functions can cause strange results in the function-by-function - summary. If a function inline_me() is defined in - foo.h and inlined in the functions f1(), - f2() and f3() in bar.c, there will - not be a foo.h:inline_me() function entry. Instead, there - will be separate function entries for each inlining site, ie. - foo.h:f1(), foo.h:f2() and - foo.h:f3(). To find the total counts for - foo.h:inline_me(), add up the counts from each entry.
- - The reason for this is that although the debug info output by gcc - indicates the switch from bar.c to foo.h, it - doesn't indicate the name of the function in foo.h, so - Valgrind keeps using the old one.
- -
Sometimes, the same filename might be represented with a relative name - and with an absolute name in different parts of the debug info, eg: - /home/user/proj/proj.h and ../proj.h. In this - case, if you use auto-annotation, the file will be annotated twice with - the counts split between the two.
-
Files with more than 65,535 lines cause difficulties for the stabs debug - info reader. This is because the line number in the struct - nlist defined in a.out.h under Linux is only a 16-bit - value. Valgrind can handle some files with more than 65,535 lines - correctly by making some guesses to identify line number overflows. But - some cases are beyond it, in which case you'll get a warning message - explaining that annotations for the file might be incorrect.
-
If you compile some files with -g and some without, some - events that take place in a file without debug info could be attributed - to the last line of a file with debug info (whichever one gets placed - before the non-debug-info file in the executable).
-

- -This list looks long, but these cases should be fairly rare.

- -Note: stabs is not an easy format to read. If you come across bizarre -annotations that look like might be caused by a bug in the stabs reader, -please let us know.

- - -

7.11 Accuracy

-Valgrind's cache profiling has a number of shortcomings: - -

It doesn't account for kernel activity -- the effect of system calls on - the cache contents is ignored.

- -

It doesn't account for other process activity (although this is probably - desirable when considering a single program).

- -

It doesn't account for virtual-to-physical address mappings; hence the - entire simulation is not a true representation of what's happening in the - cache.

- -

It doesn't account for cache misses not visible at the instruction level, - eg. those arising from TLB misses, or speculative execution.

- -

Valgrind's custom malloc() will allocate memory in different - ways to the standard malloc(), which could warp the results. -

- -

Valgrind's custom threads implementation will schedule threads - differently to the standard one. This too could warp the results for - threaded programs. -

- -

The instructions bts, btr and btc - will incorrectly be counted as doing a data read if both the arguments - are registers, eg: - -
btsl %eax, %edx
- - This should only happen rarely. -

- -

FPU instructions with data sizes of 28 and 108 bytes (e.g. - fsave) are treated as though they only access 16 bytes. - These instructions seem to be rare so hopefully this won't affect - accuracy much. -

-

- -Another thing worth nothing is that results are very sensitive. Changing the -size of the valgrind.so file, the size of the program being -profiled, or even the length of its name can perturb the results. Variations -will be small, but don't expect perfectly repeatable results if your program -changes at all.

- -While these factors mean you shouldn't trust the results to be super-accurate, -hopefully they should be close enough to be useful.

- - -

7.12 Todo

-

Program start-up/shut-down calls a lot of functions that aren't - interesting and just complicate the output. Would be nice to exclude - these somehow.

-

-

- - - diff --git a/docs/nav.html b/docs/nav.html deleted file mode 100644 index ad920ad44..000000000 --- a/docs/nav.html +++ /dev/null @@ -1,72 +0,0 @@ - - - Valgrind - - - - - -
- Contents of this manual
- 1 Introduction
- 1.1 What Valgrind is for
- 1.2 What it does with - your program -

- 2 How to use it, and how to - make sense of the results
- 2.1 Getting started
- 2.2 The commentary
- 2.3 Reporting of errors
- 2.4 Suppressing errors
- 2.5 Command-line flags
- 2.6 Explanation of error messages
- 2.7 Writing suppressions files
- 2.8 The Client Request mechanism
- 2.9 Support for POSIX pthreads
- 2.10 Building and installing
- 2.11 If you have problems -

- 3 Details of the checking machinery
- 3.1 Valid-value (V) bits
- 3.2 Valid-address (A) bits
- 3.3 Putting it all together
- 3.4 Signals
- 3.5 Memory leak detection -

- 4 Limitations
-

- 5 How it works -- a rough overview
- 5.1 Getting started
- 5.2 The translation/instrumentation engine
- 5.3 Tracking the status of memory
- 5.4 System calls
- 5.5 Signals -

- 6 An example
-

- 7 Cache profiling -

- 8 The design and implementation of Valgrind
- - - diff --git a/docs/techdocs.html b/docs/techdocs.html deleted file mode 100644 index 2e1cc8b7e..000000000 --- a/docs/techdocs.html +++ /dev/null @@ -1,2524 +0,0 @@ - - - - The design and implementation of Valgrind - - - - - -

The design and implementation of Valgrind

- - -Detailed technical notes for hackers, maintainers and the -overly-curious
-These notes pertain to snapshot 20020306
-

-jseward@acm.org
-http://developer.kde.org/~sewardj
-Copyright © 2000-2002 Julian Seward -

-Valgrind is licensed under the GNU General Public License, -version 2
-An open-source tool for finding memory-management problems in -x86 GNU/Linux executables. -

- -

- - - - -

- -

Introduction

- -This document contains a detailed, highly-technical description of the -internals of Valgrind. This is not the user manual; if you are an -end-user of Valgrind, you do not want to read this. Conversely, if -you really are a hacker-type and want to know how it works, I assume -that you have read the user manual thoroughly. -

-You may need to read this document several times, and carefully. Some -important things, I only say once. - - -

History

- -Valgrind came into public view in late Feb 2002. However, it has been -under contemplation for a very long time, perhaps seriously for about -five years. Somewhat over two years ago, I started working on the x86 -code generator for the Glasgow Haskell Compiler -(http://www.haskell.org/ghc), gaining familiarity with x86 internals -on the way. I then did Cacheprof (http://www.cacheprof.org), gaining -further x86 experience. Some time around Feb 2000 I started -experimenting with a user-space x86 interpreter for x86-Linux. This -worked, but it was clear that a JIT-based scheme would be necessary to -give reasonable performance for Valgrind. Design work for the JITter -started in earnest in Oct 2000, and by early 2001 I had an x86-to-x86 -dynamic translator which could run quite large programs. This -translator was in a sense pointless, since it did not do any -instrumentation or checking. - -

-Most of the rest of 2001 was taken up designing and implementing the -instrumentation scheme. The main difficulty, which consumed a lot -of effort, was to design a scheme which did not generate large numbers -of false uninitialised-value warnings. By late 2001 a satisfactory -scheme had been arrived at, and I started to test it on ever-larger -programs, with an eventual eye to making it work well enough so that -it was helpful to folks debugging the upcoming version 3 of KDE. I've -used KDE since before version 1.0, and wanted to Valgrind to be an -indirect contribution to the KDE 3 development effort. At the start of -Feb 02 the kde-core-devel crew started using it, and gave a huge -amount of helpful feedback and patches in the space of three weeks. -Snapshot 20020306 is the result. - -

-In the best Unix tradition, or perhaps in the spirit of Fred Brooks' -depressing-but-completely-accurate epitaph "build one to throw away; -you will anyway", much of Valgrind is a second or third rendition of -the initial idea. The instrumentation machinery -(vg_translate.c, vg_memory.c) and core CPU -simulation (vg_to_ucode.c, vg_from_ucode.c) -have had three redesigns and rewrites; the register allocator, -low-level memory manager (vg_malloc2.c) and symbol table -reader (vg_symtab2.c) are on the second rewrite. In a -sense, this document serves to record some of the knowledge gained as -a result. - - -

Design overview

- -Valgrind is compiled into a Linux shared object, -valgrind.so, and also a dummy one, -valgrinq.so, of which more later. The -valgrind shell script adds valgrind.so to -the LD_PRELOAD list of extra libraries to be -loaded with any dynamically linked library. This is a standard trick, -one which I assume the LD_PRELOAD mechanism was developed -to support. - -

-valgrind.so -is linked with the -z initfirst flag, which requests that -its initialisation code is run before that of any other object in the -executable image. When this happens, valgrind gains control. The -real CPU becomes "trapped" in valgrind.so and the -translations it generates. The synthetic CPU provided by Valgrind -does, however, return from this initialisation function. So the -normal startup actions, orchestrated by the dynamic linker -ld.so, continue as usual, except on the synthetic CPU, -not the real one. Eventually main is run and returns, -and then the finalisation code of the shared objects is run, -presumably in inverse order to which they were initialised. Remember, -this is still all happening on the simulated CPU. Eventually -valgrind.so's own finalisation code is called. It spots -this event, shuts down the simulated CPU, prints any error summaries -and/or does leak detection, and returns from the initialisation code -on the real CPU. At this point, in effect the real and synthetic CPUs -have merged back into one, Valgrind has lost control of the program, -and the program finally exit()s back to the kernel in the -usual way. - -

-The normal course of activity, one Valgrind has started up, is as -follows. Valgrind never runs any part of your program (usually -referred to as the "client"), not a single byte of it, directly. -Instead it uses function VG_(translate) to translate -basic blocks (BBs, straight-line sequences of code) into instrumented -translations, and those are run instead. The translations are stored -in the translation cache (TC), vg_tc, with the -translation table (TT), vg_tt supplying the -original-to-translation code address mapping. Auxiliary array -VG_(tt_fast) is used as a direct-map cache for fast -lookups in TT; it usually achieves a hit rate of around 98% and -facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad. - -

-Function VG_(dispatch) in vg_dispatch.S is -the heart of the JIT dispatcher. Once a translated code address has -been found, it is executed simply by an x86 call -to the translation. At the end of the translation, the next -original code addr is loaded into %eax, and the -translation then does a ret, taking it back to the -dispatch loop, with, interestingly, zero branch mispredictions. -The address requested in %eax is looked up first in -VG_(tt_fast), and, if not found, by calling C helper -VG_(search_transtab). If there is still no translation -available, VG_(dispatch) exits back to the top-level -C dispatcher VG_(toploop), which arranges for -VG_(translate) to make a new translation. All fairly -unsurprising, really. There are various complexities described below. - -

-The translator, orchestrated by VG_(translate), is -complicated but entirely self-contained. It is described in great -detail in subsequent sections. Translations are stored in TC, with TT -tracking administrative information. The translations are subject to -an approximate LRU-based management scheme. With the current -settings, the TC can hold at most about 15MB of translations, and LRU -passes prune it to about 13.5MB. Given that the -orig-to-translation expansion ratio is about 13:1 to 14:1, this means -TC holds translations for more or less a megabyte of original code, -which generally comes to about 70000 basic blocks for C++ compiled -with optimisation on. Generating new translations is expensive, so it -is worth having a large TC to minimise the (capacity) miss rate. - -

-The dispatcher, VG_(dispatch), receives hints from -the translations which allow it to cheaply spot all control -transfers corresponding to x86 call and ret -instructions. It has to do this in order to spot some special events: -

Calls to VG_(shutdown). This is Valgrind's cue to - exit. NOTE: actually this is done a different way; it should be - cleaned up. -
-
Returns of system call handlers, to the return address - VG_(signalreturn_bogusRA). The signal simulator - needs to know when a signal handler is returning, so we spot - jumps (returns) to this address. -
-
Calls to vg_trap_here. All malloc, - free, etc calls that the client program makes are - eventually routed to a call to vg_trap_here, - and Valgrind does its own special thing with these calls. - In effect this provides a trapdoor, by which Valgrind can - intercept certain calls on the simulated CPU, run the call as it - sees fit itself (on the real CPU), and return the result to - the simulated CPU, quite transparently to the client program. -

-Valgrind intercepts the client's malloc, -free, etc, -calls, so that it can store additional information. Each block -malloc'd by the client gives rise to a shadow block -in which Valgrind stores the call stack at the time of the -malloc -call. When the client calls free, Valgrind tries to -find the shadow block corresponding to the address passed to -free, and emits an error message if none can be found. -If it is found, the block is placed on the freed blocks queue -vg_freed_list, it is marked as inaccessible, and -its shadow block now records the call stack at the time of the -free call. Keeping free'd blocks in -this queue allows Valgrind to spot all (presumably invalid) accesses -to them. However, once the volume of blocks in the free queue -exceeds VG_(clo_freelist_vol), blocks are finally -removed from the queue. - -

-Keeping track of A and V bits (note: if you don't know what these are, -you haven't read the user guide carefully enough) for memory is done -in vg_memory.c. This implements a sparse array structure -which covers the entire 4G address space in a way which is reasonably -fast and reasonably space efficient. The 4G address space is divided -up into 64K sections, each covering 64Kb of address space. Given a -32-bit address, the top 16 bits are used to select one of the 65536 -entries in VG_(primary_map). The resulting "secondary" -(SecMap) holds A and V bits for the 64k of address space -chunk corresponding to the lower 16 bits of the address. - - -

Design decisions

- -Some design decisions were motivated by the need to make Valgrind -debuggable. Imagine you are writing a CPU simulator. It works fairly -well. However, you run some large program, like Netscape, and after -tens of millions of instructions, it crashes. How can you figure out -where in your simulator the bug is? - -

-Valgrind's answer is: cheat. Valgrind is designed so that it is -possible to switch back to running the client program on the real -CPU at any point. Using the --stop-after= flag, you can -ask Valgrind to run just some number of basic blocks, and then -run the rest of the way on the real CPU. If you are searching for -a bug in the simulated CPU, you can use this to do a binary search, -which quickly leads you to the specific basic block which is -causing the problem. - -

-This is all very handy. It does constrain the design in certain -unimportant ways. Firstly, the layout of memory, when viewed from the -client's point of view, must be identical regardless of whether it is -running on the real or simulated CPU. This means that Valgrind can't -do pointer swizzling -- well, no great loss -- and it can't run on -the same stack as the client -- again, no great loss. -Valgrind operates on its own stack, VG_(stack), which -it switches to at startup, temporarily switching back to the client's -stack when doing system calls for the client. - -

-Valgrind also receives signals on its own stack, -VG_(sigstack), but for different gruesome reasons -discussed below. - -

-This nice clean switch-back-to-the-real-CPU-whenever-you-like story -is muddied by signals. Problem is that signals arrive at arbitrary -times and tend to slightly perturb the basic block count, with the -result that you can get close to the basic block causing a problem but -can't home in on it exactly. My kludgey hack is to define -SIGNAL_SIMULATION to 1 towards the bottom of -vg_syscall_mem.c, so that signal handlers are run on the -real CPU and don't change the BB counts. - -

-A second hole in the switch-back-to-real-CPU story is that Valgrind's -way of delivering signals to the client is different from that of the -kernel. Specifically, the layout of the signal delivery frame, and -the mechanism used to detect a sighandler returning, are different. -So you can't expect to make the transition inside a sighandler and -still have things working, but in practice that's not much of a -restriction. - -

-Valgrind's implementation of malloc, free, -etc, (in vg_clientmalloc.c, not the low-level stuff in -vg_malloc2.c) is somewhat complicated by the need to -handle switching back at arbitrary points. It does work tho. - - - -

Correctness

- -There's only one of me, and I have a Real Life (tm) as well as hacking -Valgrind [allegedly :-]. That means I don't have time to waste -chasing endless bugs in Valgrind. My emphasis is therefore on doing -everything as simply as possible, with correctness, stability and -robustness being the number one priority, more important than -performance or functionality. As a result: -

The code is absolutely loaded with assertions, and these are - permanently enabled. I have no plan to remove or disable - them later. Over the past couple of months, as valgrind has - become more widely used, they have shown their worth, pulling - up various bugs which would otherwise have appeared as - hard-to-find segmentation faults. -
- I am of the view that it's acceptable to spend 5% of the total - running time of your valgrindified program doing assertion checks - and other internal sanity checks. -
-
Aside from the assertions, valgrind contains various sets of - internal sanity checks, which get run at varying frequencies - during normal operation. VG_(do_sanity_checks) - runs every 1000 basic blocks, which means 500 to 2000 times/second - for typical machines at present. It checks that Valgrind hasn't - overrun its private stack, and does some simple checks on the - memory permissions maps. Once every 25 calls it does some more - extensive checks on those maps. Etc, etc. -
- The following components also have sanity check code, which can - be enabled to aid debugging: -
- The low-level memory-manager - (VG_(mallocSanityCheckArena)). This does a - complete check of all blocks and chains in an arena, which - is very slow. Is not engaged by default. -
  -
- The symbol table reader(s): various checks to ensure - uniqueness of mappings; see VG_(read_symbols) - for a start. Is permanently engaged. -
  -
- The A and V bit tracking stuff in vg_memory.c. - This can be compiled with cpp symbol - VG_DEBUG_MEMORY defined, which removes all the - fast, optimised cases, and uses simple-but-slow fallbacks - instead. Not engaged by default. -
  -
- Ditto VG_DEBUG_LEAKCHECK. -
  -
- The JITter parses x86 basic blocks into sequences of - UCode instructions. It then sanity checks each one with - VG_(saneUInstr) and sanity checks the sequence - as a whole with VG_(saneUCodeBlock). This stuff - is engaged by default, and has caught some way-obscure bugs - in the simulated CPU machinery in its time. -
  -
- The system call wrapper does - VG_(first_and_last_secondaries_look_plausible) after - every syscall; this is known to pick up bugs in the syscall - wrappers. Engaged by default. -
  -
- The main dispatch loop, in VG_(dispatch), checks - that translations do not set %ebp to any value - different from VG_EBP_DISPATCH_CHECKED or - & VG_(baseBlock). In effect this test is free, - and is permanently engaged. -
  -
- There are a couple of ifdefed-out consistency checks I - inserted whilst debugging the new register allocater, - vg_do_register_allocation. -
-
-
I try to avoid techniques, algorithms, mechanisms, etc, for which - I can supply neither a convincing argument that they are correct, - nor sanity-check code which might pick up bugs in my - implementation. I don't always succeed in this, but I try. - Basically the idea is: avoid techniques which are, in practice, - unverifiable, in some sense. When doing anything, always have in - mind: "how can I verify that this is correct?" -

- -

-Some more specific things are: - -

Valgrind runs in the same namespace as the client, at least from - ld.so's point of view, and it therefore absolutely - had better not export any symbol with a name which could clash - with that of the client or any of its libraries. Therefore, all - globally visible symbols exported from valgrind.so - are defined using the VG_ CPP macro. As you'll see - from vg_constants.h, this appends some arbitrary - prefix to the symbol, in order that it be, we hope, globally - unique. Currently the prefix is vgPlain_. For - convenience there are also VGM_, VGP_ - and VGOFF_. All locally defined symbols are declared - static and do not appear in the final shared object. -
- To check this, I periodically do - nm valgrind.so | grep " T ", - which shows you all the globally exported text symbols. - They should all have an approved prefix, except for those like - malloc, free, etc, which we deliberately - want to shadow and take precedence over the same names exported - from glibc.so, so that valgrind can intercept those - calls easily. Similarly, nm valgrind.so | grep " D " - allows you to find any rogue data-segment symbol names. -
-
Valgrind tries, and almost succeeds, in being completely - independent of all other shared objects, in particular of - glibc.so. For example, we have our own low-level - memory manager in vg_malloc2.c, which is a fairly - standard malloc/free scheme augmented with arenas, and - vg_mylibc.c exports reimplementations of various bits - and pieces you'd normally get from the C library. -
- Why all the hassle? Because imagine the potential chaos of both - the simulated and real CPUs executing in glibc.so. - It just seems simpler and cleaner to be completely self-contained, - so that only the simulated CPU visits glibc.so. In - practice it's not much hassle anyway. Also, valgrind starts up - before glibc has a chance to initialise itself, and who knows what - difficulties that could lead to. Finally, glibc has definitions - for some types, specifically sigset_t, which conflict - (are different from) the Linux kernel's idea of same. When - Valgrind wants to fiddle around with signal stuff, it wants to - use the kernel's definitions, not glibc's definitions. So it's - simplest just to keep glibc out of the picture entirely. -
- To find out which glibc symbols are used by Valgrind, reinstate - the link flags -nostdlib -Wl,-no-undefined. This - causes linking to fail, but will tell you what you depend on. - I have mostly, but not entirely, got rid of the glibc - dependencies; what remains is, IMO, fairly harmless. AFAIK the - current dependencies are: memset, - memcmp, stat, system, - sbrk, setjmp and longjmp. - -
-
Similarly, valgrind should not really import any headers other - than the Linux kernel headers, since it knows of no API other than - the kernel interface to talk to. At the moment this is really not - in a good state, and vg_syscall_mem imports, via - vg_unsafe.h, a significant number of C-library - headers so as to know the sizes of various structs passed across - the kernel boundary. This is of course completely bogus, since - there is no guarantee that the C library's definitions of these - structs matches those of the kernel. I have started to sort this - out using vg_kerneliface.h, into which I had intended - to copy all kernel definitions which valgrind could need, but this - has not gotten very far. At the moment it mostly contains - definitions for sigset_t and struct - sigaction, since the kernel's definition for these really - does clash with glibc's. I plan to use a vki_ prefix - on all these types and constants, to denote the fact that they - pertain to Valgrind's Kernel Interface. -
- Another advantage of having a vg_kerneliface.h file - is that it makes it simpler to interface to a different kernel. - Once can, for example, easily imagine writing a new - vg_kerneliface.h for FreeBSD, or x86 NetBSD. - -

- -

Current limitations

- -No threads. I think fixing this is close to a research-grade problem. -

-No MMX. Fixing this should be relatively easy, using the same giant -trick used for x86 FPU instructions. See below. -

-Support for weird (non-POSIX) signal stuff is patchy. Does anybody -care? -

- - - - -

- -

The instrumenting JITter

- -This really is the heart of the matter. We begin with various side -issues. - -

Run-time storage, and the use of host registers

- -Valgrind translates client (original) basic blocks into instrumented -basic blocks, which live in the translation cache TC, until either the -client finishes or the translations are ejected from TC to make room -for newer ones. -

-Since it generates x86 code in memory, Valgrind has complete control -of the use of registers in the translations. Now pay attention. I -shall say this only once, and it is important you understand this. In -what follows I will refer to registers in the host (real) cpu using -their standard names, %eax, %edi, etc. I -refer to registers in the simulated CPU by capitalising them: -%EAX, %EDI, etc. These two sets of -registers usually bear no direct relationship to each other; there is -no fixed mapping between them. This naming scheme is used fairly -consistently in the comments in the sources. -

-Host registers, once things are up and running, are used as follows: -

%esp, the real stack pointer, points - somewhere in Valgrind's private stack area, - VG_(stack) or, transiently, into its signal delivery - stack, VG_(sigstack). -
-
%edi is used as a temporary in code generation; it - is almost always dead, except when used for the Left - value-tag operations. -
-
%eax, %ebx, %ecx, - %edx and %esi are available to - Valgrind's register allocator. They are dead (carry unimportant - values) in between translations, and are live only in - translations. The one exception to this is %eax, - which, as mentioned far above, has a special significance to the - dispatch loop VG_(dispatch): when a translation - returns to the dispatch loop, %eax is expected to - contain the original-code-address of the next translation to run. - The register allocator is so good at minimising spill code that - using five regs and not having to save/restore %edi - actually gives better code than allocating to %edi - as well, but then having to push/pop it around special uses. -
-
%ebp points permanently at - VG_(baseBlock). Valgrind's translations are - position-independent, partly because this is convenient, but also - because translations get moved around in TC as part of the LRUing - activity. All static entities which need to be referred to - from generated code, whether data or helper functions, are stored - starting at VG_(baseBlock) and are therefore reached - by indexing from %ebp. There is but one exception, - which is that by placing the value - VG_EBP_DISPATCH_CHECKED - in %ebp just before a return to the dispatcher, - the dispatcher is informed that the next address to run, - in %eax, requires special treatment. -
-
The real machine's FPU state is pretty much unimportant, for - reasons which will become obvious. Ditto its %eflags - register. -

- -

-The state of the simulated CPU is stored in memory, in -VG_(baseBlock), which is a block of 200 words IIRC. -Recall that %ebp points permanently at the start of this -block. Function vg_init_baseBlock decides what the -offsets of various entities in VG_(baseBlock) are to be, -and allocates word offsets for them. The code generator then emits -%ebp relative addresses to get at those things. The -sequence in which entities are allocated has been carefully chosen so -that the 32 most popular entities come first, because this means 8-bit -offsets can be used in the generated code. - -

-If I was clever, I could make %ebp point 32 words along -VG_(baseBlock), so that I'd have another 32 words of -short-form offsets available, but that's just complicated, and it's -not important -- the first 32 words take 99% (or whatever) of the -traffic. - -

-Currently, the sequence of stuff in VG_(baseBlock) is as -follows: -

9 words, holding the simulated integer registers, - %EAX .. %EDI, and the simulated flags, - %EFLAGS. -
-
Another 9 words, holding the V bit "shadows" for the above 9 regs. -
-
The addresses of various helper routines called from - generated code: - VG_(helper_value_check4_fail), - VG_(helper_value_check0_fail), - which register V-check failures, - VG_(helperc_STOREV4), - VG_(helperc_STOREV1), - VG_(helperc_LOADV4), - VG_(helperc_LOADV1), - which do stores and loads of V bits to/from the - sparse array which keeps track of V bits in memory, - and - VGM_(handle_esp_assignment), which messes with - memory addressibility resulting from changes in %ESP. -
-
The simulated %EIP. -
-
24 spill words, for when the register allocator can't make it work - with 5 measly registers. -
-
Addresses of helpers VG_(helperc_STOREV2), - VG_(helperc_LOADV2). These are here because 2-byte - loads and stores are relatively rare, so are placed above the - magic 32-word offset boundary. -
-
For similar reasons, addresses of helper functions - VGM_(fpu_write_check) and - VGM_(fpu_read_check), which handle the A/V maps - testing and changes required by FPU writes/reads. -
-
Some other boring helper addresses: - VG_(helper_value_check2_fail) and - VG_(helper_value_check1_fail). These are probably - never emitted now, and should be removed. -
-
The entire state of the simulated FPU, which I believe to be - 108 bytes long. -
-
Finally, the addresses of various other helper functions in - vg_helpers.S, which deal with rare situations which - are tedious or difficult to generate code in-line for. -

- -

-As a general rule, the simulated machine's state lives permanently in -memory at VG_(baseBlock). However, the JITter does some -optimisations which allow the simulated integer registers to be -cached in real registers over multiple simulated instructions within -the same basic block. These are always flushed back into memory at -the end of every basic block, so that the in-memory state is -up-to-date between basic blocks. (This flushing is implied by the -statement above that the real machine's allocatable registers are -dead in between simulated blocks). - - -

Startup, shutdown, and system calls

- -Getting into of Valgrind (VG_(startup), called from -valgrind.so's initialisation section), really means -copying the real CPU's state into VG_(baseBlock), and -then installing our own stack pointer, etc, into the real CPU, and -then starting up the JITter. Exiting valgrind involves copying the -simulated state back to the real state. - -

-Unfortunately, there's a complication at startup time. Problem is -that at the point where we need to take a snapshot of the real CPU's -state, the offsets in VG_(baseBlock) are not set up yet, -because to do so would involve disrupting the real machine's state -significantly. The way round this is to dump the real machine's state -into a temporary, static block of memory, -VG_(m_state_static). We can then set up the -VG_(baseBlock) offsets at our leisure, and copy into it -from VG_(m_state_static) at some convenient later time. -This copying is done by -VG_(copy_m_state_static_to_baseBlock). - -

-On exit, the inverse transformation is (rather unnecessarily) used: -stuff in VG_(baseBlock) is copied to -VG_(m_state_static), and the assembly stub then copies -from VG_(m_state_static) into the real machine registers. - -

-Doing system calls on behalf of the client (vg_syscall.S) -is something of a half-way house. We have to make the world look -sufficiently like that which the client would normally have to make -the syscall actually work properly, but we can't afford to lose -control. So the trick is to copy all of the client's state, except -its program counter, into the real CPU, do the system call, and -copy the state back out. Note that the client's state includes its -stack pointer register, so one effect of this partial restoration is -to cause the system call to be run on the client's stack, as it should -be. - -

-As ever there are complications. We have to save some of our own state -somewhere when restoring the client's state into the CPU, so that we -can keep going sensibly afterwards. In fact the only thing which is -important is our own stack pointer, but for paranoia reasons I save -and restore our own FPU state as well, even though that's probably -pointless. - -

-The complication on the above complication is, that for horrible -reasons to do with signals, we may have to handle a second client -system call whilst the client is blocked inside some other system -call (unbelievable!). That means there's two sets of places to -dump Valgrind's stack pointer and FPU state across the syscall, -and we decide which to use by consulting -VG_(syscall_depth), which is in turn maintained by -VG_(wrap_syscall). - - - -

Introduction to UCode

- -UCode lies at the heart of the x86-to-x86 JITter. The basic premise -is that dealing the the x86 instruction set head-on is just too darn -complicated, so we do the traditional compiler-writer's trick and -translate it into a simpler, easier-to-deal-with form. - -

-In normal operation, translation proceeds through six stages, -coordinated by VG_(translate): -

Parsing of an x86 basic block into a sequence of UCode - instructions (VG_(disBB)). -
-
UCode optimisation (vg_improve), with the aim of - caching simulated registers in real registers over multiple - simulated instructions, and removing redundant simulated - %EFLAGS saving/restoring. -
-
UCode instrumentation (vg_instrument), which adds - value and address checking code. -
-
Post-instrumentation cleanup (vg_cleanup), removing - redundant value-check computations. -
-
Register allocation (vg_do_register_allocation), - which, note, is done on UCode. -
-
Emission of final instrumented x86 code - (VG_(emit_code)). -

- -

-Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode -transformation passes, all on straight-line blocks of UCode (type -UCodeBlock). Steps 2 and 4 are optimisation passes and -can be disabled for debugging purposes, with ---optimise=no and --cleanup=no respectively. - -

-Valgrind can also run in a no-instrumentation mode, given ---instrument=no. This is useful for debugging the JITter -quickly without having to deal with the complexity of the -instrumentation mechanism too. In this mode, steps 3 and 4 are -omitted. - -

-These flags combine, so that --instrument=no together with ---optimise=no means only steps 1, 5 and 6 are used. ---single-step=yes causes each x86 instruction to be -treated as a single basic block. The translations are terrible but -this is sometimes instructive. - -

-The --stop-after=N flag switches back to the real CPU -after N basic blocks. It also re-JITs the final basic -block executed and prints the debugging info resulting, so this -gives you a way to get a quick snapshot of how a basic block looks as -it passes through the six stages mentioned above. If you want to -see full information for every block translated (probably not, but -still ...) find, in VG_(translate), the lines -
dis = True; -
dis = debugging_translation; -
-and comment out the second line. This will spew out debugging -junk faster than you can possibly imagine. - - - -

UCode operand tags: type `Tag`

- -UCode is, more or less, a simple two-address RISC-like code. In -keeping with the x86 AT&T assembly syntax, generally speaking the -first operand is the source operand, and the second is the destination -operand, which is modified when the uinstr is notionally executed. - -

-UCode instructions have up to three operand fields, each of which has -a corresponding Tag describing it. Possible values for -the tag are: - -

NoValue: indicates that the field is not in use. -
-
Lit16: the field contains a 16-bit literal. -
-
Literal: the field denotes a 32-bit literal, whose - value is stored in the lit32 field of the uinstr - itself. Since there is only one lit32 for the whole - uinstr, only one operand field may contain this tag. -
-
SpillNo: the field contains a spill slot number, in - the range 0 to 23 inclusive, denoting one of the spill slots - contained inside VG_(baseBlock). Such tags only - exist after register allocation. -
-
RealReg: the field contains a number in the range 0 - to 7 denoting an integer x86 ("real") register on the host. The - number is the Intel encoding for integer registers. Such tags - only exist after register allocation. -
-
ArchReg: the field contains a number in the range 0 - to 7 denoting an integer x86 register on the simulated CPU. In - reality this means a reference to one of the first 8 words of - VG_(baseBlock). Such tags can exist at any point in - the translation process. -
-
Last, but not least, TempReg. The field contains the - number of one of an infinite set of virtual (integer) - registers. TempRegs are used everywhere throughout - the translation process; you can have as many as you want. The - register allocator maps as many as it can into - RealRegs and turns the rest into - SpillNos, so TempRegs should not exist - after the register allocation phase. -
- TempRegs are always 32 bits long, even if the data - they hold is logically shorter. In that case the upper unused - bits are required, and, I think, generally assumed, to be zero. - TempRegs holding V bits for quantities shorter than - 32 bits are expected to have ones in the unused places, since a - one denotes "undefined". -

- - -

UCode instructions: type `UInstr`

- -

-UCode was carefully designed to make it possible to do register -allocation on UCode and then translate the result into x86 code -without needing any extra registers ... well, that was the original -plan, anyway. Things have gotten a little more complicated since -then. In what follows, UCode instructions are referred to as uinstrs, -to distinguish them from x86 instructions. Uinstrs of course have -uopcodes which are (naturally) different from x86 opcodes. - -

-A uinstr (type UInstr) contains -various fields, not all of which are used by any one uopcode: -

Three 16-bit operand fields, val1, val2 - and val3. -
-
Three tag fields, tag1, tag2 - and tag3. Each of these has a value of type - Tag, - and they describe what the val1, val2 - and val3 fields contain. -
-
A 32-bit literal field. -
-
Two FlagSets, specifying which x86 condition codes are - read and written by the uinstr. -
-
An opcode byte, containing a value of type Opcode. -
-
A size field, indicating the data transfer size (1/2/4/8/10) in - cases where this makes sense, or zero otherwise. -
-
A condition-code field, which, for jumps, holds a - value of type Condcode, indicating the condition - which applies. The encoding is as it is in the x86 insn stream, - except we add a 17th value CondAlways to indicate - an unconditional transfer. -
-
Various 1-bit flags, indicating whether this insn pertains to an - x86 CALL or RET instruction, whether a widening is signed or not, - etc. -

- -

-UOpcodes (type Opcode) are divided into two groups: those -necessary merely to express the functionality of the x86 code, and -extra uopcodes needed to express the instrumentation. The former -group contains: -

GET and PUT, which move values from the - simulated CPU's integer registers (ArchRegs) into - TempRegs, and back. GETF and - PUTF do the corresponding thing for the simulated - %EFLAGS. There are no corresponding insns for the - FPU register stack, since we don't explicitly simulate its - registers. -
-
LOAD and STORE, which, in RISC-like - fashion, are the only uinstrs able to interact with memory. -
-
MOV and CMOV allow unconditional and - conditional moves of values between TempRegs. -
-
ALU operations. Again in RISC-like fashion, these only operate on - TempRegs (before reg-alloc) or RealRegs - (after reg-alloc). These are: ADD, ADC, - AND, OR, XOR, - SUB, SBB, SHL, - SHR, SAR, ROL, - ROR, RCL, RCR, - NOT, NEG, INC, - DEC, BSWAP, CC2VAL and - WIDEN. WIDEN does signed or unsigned - value widening. CC2VAL is used to convert condition - codes into a value, zero or one. The rest are obvious. -
- To allow for more efficient code generation, we bend slightly the - restriction at the start of the previous para: for - ADD, ADC, XOR, - SUB and SBB, we allow the first (source) - operand to also be an ArchReg, that is, one of the - simulated machine's registers. Also, many of these ALU ops allow - the source operand to be a literal. See - VG_(saneUInstr) for the final word on the allowable - forms of uinstrs. -
-
LEA1 and LEA2 are not strictly - necessary, but allow faciliate better translations. They - record the fancy x86 addressing modes in a direct way, which - allows those amodes to be emitted back into the final - instruction stream more or less verbatim. -
-
CALLM calls a machine-code helper, one of the methods - whose address is stored at some VG_(baseBlock) - offset. PUSH and POP move values - to/from TempReg to the real (Valgrind's) stack, and - CLEAR removes values from the stack. - CALLM_S and CALLM_E delimit the - boundaries of call setups and clearings, for the benefit of the - instrumentation passes. Getting this right is critical, and so - VG_(saneUCodeBlock) makes various checks on the use - of these uopcodes. -
- It is important to understand that these uopcodes have nothing to - do with the x86 call, return, - push or pop instructions, and are not - used to implement them. Those guys turn into combinations of - GET, PUT, LOAD, - STORE, ADD, SUB, and - JMP. What these uopcodes support is calling of - helper functions such as VG_(helper_imul_32_64), - which do stuff which is too difficult or tedious to emit inline. -
-
FPU, FPU_R and FPU_W. - Valgrind doesn't attempt to simulate the internal state of the - FPU at all. Consequently it only needs to be able to distinguish - FPU ops which read and write memory from those that don't, and - for those which do, it needs to know the effective address and - data transfer size. This is made easier because the x86 FP - instruction encoding is very regular, basically consisting of - 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode - for a memory FPU insn. So our FPU uinstr carries - the 16 bits in its val1 field. And - FPU_R and FPU_W carry 11 bits in that - field, together with the identity of a TempReg or - (later) RealReg which contains the address. -
-
JIFZ is unique, in that it allows a control-flow - transfer which is not deemed to end a basic block. It causes a - jump to a literal (original) address if the specified argument - is zero. -
-
Finally, INCEIP advances the simulated - %EIP by the specified literal amount. This supports - lazy %EIP updating, as described below. -

- -

-Stages 1 and 2 of the 6-stage translation process mentioned above -deal purely with these uopcodes, and no others. They are -sufficient to express pretty much all the x86 32-bit protected-mode -instruction set, at -least everything understood by a pre-MMX original Pentium (P54C). - -

-Stages 3, 4, 5 and 6 also deal with the following extra -"instrumentation" uopcodes. They are used to express all the -definedness-tracking and -checking machinery which valgrind does. In -later sections we show how to create checking code for each of the -uopcodes above. Note that these instrumentation uopcodes, although -some appearing complicated, have been carefully chosen so that -efficient x86 code can be generated for them. GNU superopt v2.5 did a -great job helping out here. Anyways, the uopcodes are as follows: - -

GETV and PUTV are analogues to - GET and PUT above. They are identical - except that they move the V bits for the specified values back and - forth to TempRegs, rather than moving the values - themselves. -
-
Similarly, LOADV and STOREV read and - write V bits from the synthesised shadow memory that Valgrind - maintains. In fact they do more than that, since they also do - address-validity checks, and emit complaints if the read/written - addresses are unaddressible. -
-
TESTV, whose parameters are a TempReg - and a size, tests the V bits in the TempReg, at the - specified operation size (0/1/2/4 byte) and emits an error if any - of them indicate undefinedness. This is the only uopcode capable - of doing such tests. -
-
SETV, whose parameters are also TempReg - and a size, makes the V bits in the TempReg indicated - definedness, at the specified operation size. This is usually - used to generate the correct V bits for a literal value, which is - of course fully defined. -
-
GETVF and PUTVF are analogues to - GETF and PUTF. They move the single V - bit used to model definedness of %EFLAGS between its - home in VG_(baseBlock) and the specified - TempReg. -
-
TAG1 denotes one of a family of unary operations on - TempRegs containing V bits. Similarly, - TAG2 denotes one in a family of binary operations on - V bits. -

- -

-These 10 uopcodes are sufficient to express Valgrind's entire -definedness-checking semantics. In fact most of the interesting magic -is done by the TAG1 and TAG2 -suboperations. - -

-First, however, I need to explain about V-vector operation sizes. -There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32 -V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations. -However there is also the mysterious size 0, which really means a -single V bit. Single V bits are used in various circumstances; in -particular, the definedness of %EFLAGS is modelled with a -single V bit. Now might be a good time to also point out that for -V bits, 1 means "undefined" and 0 means "defined". Similarly, for A -bits, 1 means "invalid address" and 0 means "valid address". This -seems counterintuitive (and so it is), but testing against zero on -x86s saves instructions compared to testing against all 1s, because -many ALU operations set the Z flag for free, so to speak. - -

-With that in mind, the tag ops are: - -

(UNARY) Pessimising casts: VgT_PCast40, - VgT_PCast20, VgT_PCast10, - VgT_PCast01, VgT_PCast02 and - VgT_PCast04. A "pessimising cast" takes a V-bit - vector at one size, and creates a new one at another size, - pessimised in the sense that if any of the bits in the source - vector indicate undefinedness, then all the bits in the result - indicate undefinedness. In this case the casts are all to or from - a single V bit, so for example VgT_PCast40 is a - pessimising cast from 32 bits to 1, whereas - VgT_PCast04 simply copies the single source V bit - into all 32 bit positions in the result. Surprisingly, these ops - can all be implemented very efficiently. -
- There are also the pessimising casts VgT_PCast14, - from 8 bits to 32, VgT_PCast12, from 8 bits to 16, - and VgT_PCast11, from 8 bits to 8. This last one - seems nonsensical, but in fact it isn't a no-op because, as - mentioned above, any undefined (1) bits in the source infect the - entire result. -
-
(UNARY) Propagating undefinedness upwards in a word: - VgT_Left4, VgT_Left2 and - VgT_Left1. These are used to simulate the worst-case - effects of carry propagation in adds and subtracts. They return a - V vector identical to the original, except that if the original - contained any undefined bits, then it and all bits above it are - marked as undefined too. Hence the Left bit in the names. -
-
(UNARY) Signed and unsigned value widening: - VgT_SWiden14, VgT_SWiden24, - VgT_SWiden12, VgT_ZWiden14, - VgT_ZWiden24 and VgT_ZWiden12. These - mimic the definedness effects of standard signed and unsigned - integer widening. Unsigned widening creates zero bits in the new - positions, so VgT_ZWiden* accordingly park mark - those parts of their argument as defined. Signed widening copies - the sign bit into the new positions, so VgT_SWiden* - copies the definedness of the sign bit into the new positions. - Because 1 means undefined and 0 means defined, these operations - can (fascinatingly) be done by the same operations which they - mimic. Go figure. -
-
(BINARY) Undefined-if-either-Undefined, - Defined-if-either-Defined: VgT_UifU4, - VgT_UifU2, VgT_UifU1, - VgT_UifU0, VgT_DifD4, - VgT_DifD2, VgT_DifD1. These do simple - bitwise operations on pairs of V-bit vectors, with - UifU giving undefined if either arg bit is - undefined, and DifD giving defined if either arg bit - is defined. Abstract interpretation junkies, if any make it this - far, may like to think of them as meets and joins (or is it joins - and meets) in the definedness lattices. -
-
(BINARY; one value, one V bits) Generate argument improvement - terms for AND and OR: VgT_ImproveAND4_TQ, - VgT_ImproveAND2_TQ, VgT_ImproveAND1_TQ, - VgT_ImproveOR4_TQ, VgT_ImproveOR2_TQ, - VgT_ImproveOR1_TQ. These help out with AND and OR - operations. AND and OR have the inconvenient property that the - definedness of the result depends on the actual values of the - arguments as well as their definedness. At the bit level: -
1 AND undefined = undefined, but -
0 AND undefined = 0, and similarly -
0 OR undefined = undefined, but -
1 OR undefined = 1. -
-
- It turns out that gcc (quite legitimately) generates code which - relies on this fact, so we have to model it properly in order to - avoid flooding users with spurious value errors. The ultimate - definedness result of AND and OR is calculated using - UifU on the definedness of the arguments, but we - also DifD in some "improvement" terms which - take into account the above phenomena. -
- ImproveAND takes as its first argument the actual - value of an argument to AND (the T) and the definedness of that - argument (the Q), and returns a V-bit vector which is defined (0) - for bits which have value 0 and are defined; this, when - DifD into the final result causes those bits to be - defined even if the corresponding bit in the other argument is undefined. -
- The ImproveOR ops do the dual thing for OR - arguments. Note that XOR does not have this property that one - argument can make the other irrelevant, so there is no need for - such complexity for XOR. -

- -

-That's all the tag ops. If you stare at this long enough, and then -run Valgrind and stare at the pre- and post-instrumented ucode, it -should be fairly obvious how the instrumentation machinery hangs -together. - -

-One point, if you do this: in order to make it easy to differentiate -TempRegs carrying values from TempRegs -carrying V bit vectors, Valgrind prints the former as (for example) -t28 and the latter as q28; the fact that -they carry the same number serves to indicate their relationship. -This is purely for the convenience of the human reader; the register -allocator and code generator don't regard them as different. - - -

Translation into UCode

- -VG_(disBB) allocates a new UCodeBlock and -then uses disInstr to translate x86 instructions one at a -time into UCode, dumping the result in the UCodeBlock. -This goes on until a control-flow transfer instruction is encountered. - -

-Despite the large size of vg_to_ucode.c, this translation -is really very simple. Each x86 instruction is translated entirely -independently of its neighbours, merrily allocating new -TempRegs as it goes. The idea is to have a simple -translator -- in reality, no more than a macro-expander -- and the -- -resulting bad UCode translation is cleaned up by the UCode -optimisation phase which follows. To give you an idea of some x86 -instructions and their translations (this is a complete basic block, -as Valgrind sees it): -

-        0x40435A50:  incl %edx
-
-           0: GETL      %EDX, t0
-           1: INCL      t0  (-wOSZAP)
-           2: PUTL      t0, %EDX
-
-        0x40435A51:  movsbl (%edx),%eax
-
-           3: GETL      %EDX, t2
-           4: LDB       (t2), t2
-           5: WIDENL_Bs t2
-           6: PUTL      t2, %EAX
-
-        0x40435A54:  testb $0x20, 1(%ecx,%eax,2)
-
-           7: GETL      %EAX, t6
-           8: GETL      %ECX, t8
-           9: LEA2L     1(t8,t6,2), t4
-          10: LDB       (t4), t10
-          11: MOVB      $0x20, t12
-          12: ANDB      t12, t10  (-wOSZACP)
-          13: INCEIPo   $9
-
-        0x40435A59:  jnz-8 0x40435A50
-
-          14: Jnzo      $0x40435A50  (-rOSZACP)
-          15: JMPo      $0x40435A5B
-

- -

-Notice how the block always ends with an unconditional jump to the -next block. This is a bit unnecessary, but makes many things simpler. - -

-Most x86 instructions turn into sequences of GET, -PUT, LEA1, LEA2, -LOAD and STORE. Some complicated ones -however rely on calling helper bits of code in -vg_helpers.S. The ucode instructions PUSH, -POP, CALL, CALLM_S and -CALLM_E support this. The calling convention is somewhat -ad-hoc and is not the C calling convention. The helper routines must -save all integer registers, and the flags, that they use. Args are -passed on the stack underneath the return address, as usual, and if -result(s) are to be returned, it (they) are either placed in dummy arg -slots created by the ucode PUSH sequence, or just -overwrite the incoming args. - -

-In order that the instrumentation mechanism can handle calls to these -helpers, VG_(saneUCodeBlock) enforces the following -restrictions on calls to helpers: - -

Each CALL uinstr must be bracketed by a preceding - CALLM_S marker (dummy uinstr) and a trailing - CALLM_E marker. These markers are used by the - instrumentation mechanism later to establish the boundaries of the - PUSH, POP and CLEAR - sequences for the call. -
-
PUSH, POP and CLEAR - may only appear inside sections bracketed by CALLM_S - and CALLM_E, and nowhere else. -
-
In any such bracketed section, no two PUSH insns may - push the same TempReg. Dually, no two two - POPs may pop the same TempReg. -
-
Finally, although this is not checked, args should be removed from - the stack with CLEAR, rather than POPs - into a TempReg which is not subsequently used. This - is because the instrumentation mechanism assumes that all values - POPped from the stack are actually used. -

- -Some of the translations may appear to have redundant -TempReg-to-TempReg moves. This helps the -next phase, UCode optimisation, to generate better code. - - - -

UCode optimisation

- -UCode is then subjected to an improvement pass -(vg_improve()), which blurs the boundaries between the -translations of the original x86 instructions. It's pretty -straightforward. Three transformations are done: - -

Redundant GET elimination. Actually, more general - than that -- eliminates redundant fetches of ArchRegs. In our - running example, uinstr 3 GETs %EDX into - t2 despite the fact that, by looking at the previous - uinstr, it is already in t0. The GET is - therefore removed, and t2 renamed to t0. - Assuming t0 is allocated to a host register, it means - the simulated %EDX will exist in a host CPU register - for more than one simulated x86 instruction, which seems to me to - be a highly desirable property. -
- There is some mucking around to do with subregisters; - %AL vs %AH %AX vs - %EAX etc. I can't remember how it works, but in - general we are very conservative, and these tend to invalidate the - caching. -
-
Redundant PUT elimination. This annuls - PUTs of values back to simulated CPU registers if a - later PUT would overwrite the earlier - PUT value, and there is no intervening reads of the - simulated register (ArchReg). -
- As before, we are paranoid when faced with subregister references. - Also, PUTs of %ESP are never annulled, - because it is vital the instrumenter always has an up-to-date - %ESP value available, %ESP changes - affect addressibility of the memory around the simulated stack - pointer. -
- The implication of the above paragraph is that the simulated - machine's registers are only lazily updated once the above two - optimisation phases have run, with the exception of - %ESP. TempRegs go dead at the end of - every basic block, from which is is inferrable that any - TempReg caching a simulated CPU reg is flushed (back - into the relevant VG_(baseBlock) slot) at the end of - every basic block. The further implication is that the simulated - registers are only up-to-date at in between basic blocks, and not - at arbitrary points inside basic blocks. And the consequence of - that is that we can only deliver signals to the client in between - basic blocks. None of this seems any problem in practice. -
-
Finally there is a simple def-use thing for condition codes. If - an earlier uinstr writes the condition codes, and the next uinsn - along which actually cares about the condition codes writes the - same or larger set of them, but does not read any, the earlier - uinsn is marked as not writing any condition codes. This saves - a lot of redundant cond-code saving and restoring. -

- -The effect of these transformations on our short block is rather -unexciting, and shown below. On longer basic blocks they can -dramatically improve code quality. - -

-at 3: delete GET, rename t2 to t0 in (4 .. 6)
-at 7: delete GET, rename t6 to t0 in (8 .. 9)
-at 1: annul flag write OSZAP due to later OSZACP
-
-Improved code:
-           0: GETL      %EDX, t0
-           1: INCL      t0
-           2: PUTL      t0, %EDX
-           4: LDB       (t0), t0
-           5: WIDENL_Bs t0
-           6: PUTL      t0, %EAX
-           8: GETL      %ECX, t8
-           9: LEA2L     1(t8,t0,2), t4
-          10: LDB       (t4), t10
-          11: MOVB      $0x20, t12
-          12: ANDB      t12, t10  (-wOSZACP)
-          13: INCEIPo   $9
-          14: Jnzo      $0x40435A50  (-rOSZACP)
-          15: JMPo      $0x40435A5B
-

- -

UCode instrumentation

- -Once you understand the meaning of the instrumentation uinstrs, -discussed in detail above, the instrumentation scheme is fairly -straighforward. Each uinstr is instrumented in isolation, and the -instrumentation uinstrs are placed before the original uinstr. -Our running example continues below. I have placed a blank line -after every original ucode, to make it easier to see which -instrumentation uinstrs correspond to which originals. - -

-As mentioned somewhere above, TempRegs carrying values -have names like t28, and each one has a shadow carrying -its V bits, with names like q28. This pairing aids in -reading instrumented ucode. - -

-One decision about all this is where to have "observation points", -that is, where to check that V bits are valid. I use a minimalistic -scheme, only checking where a failure of validity could cause the -original program to (seg)fault. So the use of values as memory -addresses causes a check, as do conditional jumps (these cause a check -on the definedness of the condition codes). And arguments -PUSHed for helper calls are checked, hence the wierd -restrictions on help call preambles described above. - -

-Another decision is that once a value is tested, it is thereafter -regarded as defined, so that we do not emit multiple undefined-value -errors for the same undefined value. That means that -TESTV uinstrs are always followed by SETV -on the same (shadow) TempRegs. Most of these -SETVs are redundant and are removed by the -post-instrumentation cleanup phase. - -

-The instrumentation for calling helper functions deserves further -comment. The definedness of results from a helper is modelled using -just one V bit. So, in short, we do pessimising casts of the -definedness of all the args, down to a single bit, and then -UifU these bits together. So this single V bit will say -"undefined" if any part of any arg is undefined. This V bit is then -pessimally cast back up to the result(s) sizes, as needed. If, by -seeing that all the args are got rid of with CLEAR and -none with POP, Valgrind sees that the result of the call -is not actually used, it immediately examines the result V bit with a -TESTV -- SETV pair. If it did not do this, -there would be no observation point to detect that the some of the -args to the helper were undefined. Of course, if the helper's results -are indeed used, we don't do this, since the result usage will -presumably cause the result definedness to be checked at some suitable -future point. - -

-In general Valgrind tries to track definedness on a bit-for-bit basis, -but as the above para shows, for calls to helpers we throw in the -towel and approximate down to a single bit. This is because it's too -complex and difficult to track bit-level definedness through complex -ops such as integer multiply and divide, and in any case there is no -reasonable code fragments which attempt to (eg) multiply two -partially-defined values and end up with something meaningful, so -there seems little point in modelling multiplies, divides, etc, in -that level of detail. - -

-Integer loads and stores are instrumented with firstly a test of the -definedness of the address, followed by a LOADV or -STOREV respectively. These turn into calls to -(for example) VG_(helperc_LOADV4). These helpers do two -things: they perform an address-valid check, and they load or store V -bits from/to the relevant address in the (simulated V-bit) memory. - -

-FPU loads and stores are different. As above the definedness of the -address is first tested. However, the helper routine for FPU loads -(VGM_(fpu_read_check)) emits an error if either the -address is invalid or the referenced area contains undefined values. -It has to do this because we do not simulate the FPU at all, and so -cannot track definedness of values loaded into it from memory, so we -have to check them as soon as they are loaded into the FPU, ie, at -this point. We notionally assume that everything in the FPU is -defined. - -

-It follows therefore that FPU writes first check the definedness of -the address, then the validity of the address, and finally mark the -written bytes as well-defined. - -

-If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest -you use the same trick. It works provided that the FPU/MMX unit is -not used to merely as a conduit to copy partially undefined data from -one place in memory to another. Unfortunately the integer CPU is used -like that (when copying C structs with holes, for example) and this is -the cause of much of the elaborateness of the instrumentation here -described. - -

-vg_instrument() in vg_translate.c actually -does the instrumentation. There are comments explaining how each -uinstr is handled, so we do not repeat that here. As explained -already, it is bit-accurate, except for calls to helper functions. -Unfortunately the x86 insns bt/bts/btc/btr are done by -helper fns, so bit-level accuracy is lost there. This should be fixed -by doing them inline; it will probably require adding a couple new -uinstrs. Also, left and right rotates through the carry flag (x86 -rcl and rcr) are approximated via a single -V bit; so far this has not caused anyone to complain. The -non-carry rotates, rol and ror, are much -more common and are done exactly. Re-visiting the instrumentation for -AND and OR, they seem rather verbose, and I wonder if it could be done -more concisely now. - -

-The lowercase o on many of the uopcodes in the running -example indicates that the size field is zero, usually meaning a -single-bit operation. - -

-Anyroads, the post-instrumented version of our running example looks -like this: - -

-Instrumented code:
-           0: GETVL     %EDX, q0
-           1: GETL      %EDX, t0
-
-           2: TAG1o     q0 = Left4 ( q0 )
-           3: INCL      t0
-
-           4: PUTVL     q0, %EDX
-           5: PUTL      t0, %EDX
-
-           6: TESTVL    q0
-           7: SETVL     q0
-           8: LOADVB    (t0), q0
-           9: LDB       (t0), t0
-
-          10: TAG1o     q0 = SWiden14 ( q0 )
-          11: WIDENL_Bs t0
-
-          12: PUTVL     q0, %EAX
-          13: PUTL      t0, %EAX
-
-          14: GETVL     %ECX, q8
-          15: GETL      %ECX, t8
-
-          16: MOVL      q0, q4
-          17: SHLL      $0x1, q4
-          18: TAG2o     q4 = UifU4 ( q8, q4 )
-          19: TAG1o     q4 = Left4 ( q4 )
-          20: LEA2L     1(t8,t0,2), t4
-
-          21: TESTVL    q4
-          22: SETVL     q4
-          23: LOADVB    (t4), q10
-          24: LDB       (t4), t10
-
-          25: SETVB     q12
-          26: MOVB      $0x20, t12
-
-          27: MOVL      q10, q14
-          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
-          29: TAG2o     q10 = UifU1 ( q12, q10 )
-          30: TAG2o     q10 = DifD1 ( q14, q10 )
-          31: MOVL      q12, q14
-          32: TAG2o     q14 = ImproveAND1_TQ ( t12, q14 )
-          33: TAG2o     q10 = DifD1 ( q14, q10 )
-          34: MOVL      q10, q16
-          35: TAG1o     q16 = PCast10 ( q16 )
-          36: PUTVFo    q16
-          37: ANDB      t12, t10  (-wOSZACP)
-
-          38: INCEIPo   $9
-
-          39: GETVFo    q18
-          40: TESTVo    q18
-          41: SETVo     q18
-          42: Jnzo      $0x40435A50  (-rOSZACP)
-
-          43: JMPo      $0x40435A5B
-

- - -

UCode post-instrumentation cleanup

- -

-This pass, coordinated by vg_cleanup(), removes redundant -definedness computation created by the simplistic instrumentation -pass. It consists of two passes, -vg_propagate_definedness() followed by -vg_delete_redundant_SETVs. - -

-vg_propagate_definedness() is a simple -constant-propagation and constant-folding pass. It tries to determine -which TempRegs containing V bits will always indicate -"fully defined", and it propagates this information as far as it can, -and folds out as many operations as possible. For example, the -instrumentation for an ADD of a literal to a variable quantity will be -reduced down so that the definedness of the result is simply the -definedness of the variable quantity, since the literal is by -definition fully defined. - -

-vg_delete_redundant_SETVs removes SETVs on -shadow TempRegs for which the next action is a write. -I don't think there's anything else worth saying about this; it is -simple. Read the sources for details. - -

-So the cleaned-up running example looks like this. As above, I have -inserted line breaks after every original (non-instrumentation) uinstr -to aid readability. As with straightforward ucode optimisation, the -results in this block are undramatic because it is so short; longer -blocks benefit more because they have more redundancy which gets -eliminated. - - -

-at 29: delete UifU1 due to defd arg1
-at 32: change ImproveAND1_TQ to MOV due to defd arg2
-at 41: delete SETV
-at 31: delete MOV
-at 25: delete SETV
-at 22: delete SETV
-at 7: delete SETV
-
-           0: GETVL     %EDX, q0
-           1: GETL      %EDX, t0
-
-           2: TAG1o     q0 = Left4 ( q0 )
-           3: INCL      t0
-
-           4: PUTVL     q0, %EDX
-           5: PUTL      t0, %EDX
-
-           6: TESTVL    q0
-           8: LOADVB    (t0), q0
-           9: LDB       (t0), t0
-
-          10: TAG1o     q0 = SWiden14 ( q0 )
-          11: WIDENL_Bs t0
-
-          12: PUTVL     q0, %EAX
-          13: PUTL      t0, %EAX
-
-          14: GETVL     %ECX, q8
-          15: GETL      %ECX, t8
-
-          16: MOVL      q0, q4
-          17: SHLL      $0x1, q4
-          18: TAG2o     q4 = UifU4 ( q8, q4 )
-          19: TAG1o     q4 = Left4 ( q4 )
-          20: LEA2L     1(t8,t0,2), t4
-
-          21: TESTVL    q4
-          23: LOADVB    (t4), q10
-          24: LDB       (t4), t10
-
-          26: MOVB      $0x20, t12
-
-          27: MOVL      q10, q14
-          28: TAG2o     q14 = ImproveAND1_TQ ( t10, q14 )
-          30: TAG2o     q10 = DifD1 ( q14, q10 )
-          32: MOVL      t12, q14
-          33: TAG2o     q10 = DifD1 ( q14, q10 )
-          34: MOVL      q10, q16
-          35: TAG1o     q16 = PCast10 ( q16 )
-          36: PUTVFo    q16
-          37: ANDB      t12, t10  (-wOSZACP)
-
-          38: INCEIPo   $9
-          39: GETVFo    q18
-          40: TESTVo    q18
-          42: Jnzo      $0x40435A50  (-rOSZACP)
-
-          43: JMPo      $0x40435A5B
-

- - -

Translation from UCode

- -This is all very simple, even though vg_from_ucode.c -is a big file. Position-independent x86 code is generated into -a dynamically allocated array emitted_code; this is -doubled in size when it overflows. Eventually the array is handed -back to the caller of VG_(translate), who must copy -the result into TC and TT, and free the array. - -

-This file is structured into four layers of abstraction, which, -thankfully, are glued back together with extensive -__inline__ directives. From the bottom upwards: - -

Address-mode emitters, emit_amode_regmem_reg et al. -
-
Emitters for specific x86 instructions. There are quite a lot of - these, with names such as emit_movv_offregmem_reg. - The v suffix is Intel parlance for a 16/32 bit insn; - there are also b suffixes for 8 bit insns. -
-
The next level up are the synth_* functions, which - synthesise possibly a sequence of raw x86 instructions to do some - simple task. Some of these are quite complex because they have to - work around Intel's silly restrictions on subregister naming. See - synth_nonshiftop_reg_reg for example. -
-
Finally, at the top of the heap, we have - emitUInstr(), - which emits code for a single uinstr. -

- -

-Some comments: -

The hack for FPU instructions becomes apparent here. To do a - FPU ucode instruction, we load the simulated FPU's - state into from its VG_(baseBlock) into the real FPU - using an x86 frstor insn, do the ucode - FPU insn on the real CPU, and write the updated FPU - state back into VG_(baseBlock) using an - fnsave instruction. This is pretty brutal, but is - simple and it works, and even seems tolerably efficient. There is - no attempt to cache the simulated FPU state in the real FPU over - multiple back-to-back ucode FPU instructions. -
- FPU_R and FPU_W are also done this way, - with the minor complication that we need to patch in some - addressing mode bits so the resulting insn knows the effective - address to use. This is easy because of the regularity of the x86 - FPU instruction encodings. -
-
An analogous trick is done with ucode insns which claim, in their - flags_r and flags_w fields, that they - read or write the simulated %EFLAGS. For such cases - we first copy the simulated %EFLAGS into the real - %eflags, then do the insn, then, if the insn says it - writes the flags, copy back to %EFLAGS. This is a - bit expensive, which is why the ucode optimisation pass goes to - some effort to remove redundant flag-update annotations. -

- -

-And so ... that's the end of the documentation for the instrumentating -translator! It's really not that complex, because it's composed as a -sequence of simple(ish) self-contained transformations on -straight-line blocks of code. - - -

Top-level dispatch loop

- -Urk. In VG_(toploop). This is basically boring and -unsurprising, not to mention fiddly and fragile. It needs to be -cleaned up. - -

-The only perhaps surprise is that the whole thing is run -on top of a setjmp-installed exception handler, because, -supposing a translation got a segfault, we have to bail out of the -Valgrind-supplied exception handler VG_(oursignalhandler) -and immediately start running the client's segfault handler, if it has -one. In particular we can't finish the current basic block and then -deliver the signal at some convenient future point, because signals -like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not -simply be re-tried. (I'm sure there is a clearer way to explain this). - - -

Exceptions, creating new translations

-

Self-modifying code

- -

Lazy updates of the simulated program counter

- -Simulated %EIP is not updated after every simulated x86 -insn as this was regarded as too expensive. Instead ucode -INCEIP insns move it along as and when necessary. -Currently we don't allow it to fall more than 4 bytes behind reality -(see VG_(disBB) for the way this works). -

-Note that %EIP is always brought up to date by the inner -dispatch loop in VG_(dispatch), so that if the client -takes a fault we know at least which basic block this happened in. - - -

The translation cache and translation table

- -

Signals

- -Horrible, horrible. vg_signals.c. -Basically, since we have to intercept all system -calls anyway, we can see when the client tries to install a signal -handler. If it does so, we make a note of what the client asked to -happen, and ask the kernel to route the signal to our own signal -handler, VG_(oursignalhandler). This simply notes the -delivery of signals, and returns. - -

-Every 1000 basic blocks, we see if more signals have arrived. If so, -VG_(deliver_signals) builds signal delivery frames on the -client's stack, and allows their handlers to be run. Valgrind places -in these signal delivery frames a bogus return address, -VG_(signalreturn_bogusRA), and checks all jumps to see -if any jump to it. If so, this is a sign that a signal handler is -returning, and if so Valgrind removes the relevant signal frame from -the client's stack, restores the from the signal frame the simulated -state before the signal was delivered, and allows the client to run -onwards. We have to do it this way because some signal handlers never -return, they just longjmp(), which nukes the signal -delivery frame. - -

-The Linux kernel has a different but equally horrible hack for -detecting signal handler returns. Discovering it is left as an -exercise for the reader. - - - -

Errors, error contexts, error reporting, suppressions

-

Client malloc/free

-

Low-level memory management

-

A and V bitmaps

-

Symbol table management

-

Dealing with system calls

-

Namespace management

-

GDB attaching

-

Non-dependence on glibc or anything else

-

The leak detector

-

Performance problems

-

Continuous sanity checking

-

Tracing, or not tracing, child processes

-

Assembly glue for syscalls

- - -

- -

Extensions

- -Some comments about Stuff To Do. - -

Bugs

- -Stephan Kulow and Marc Mutz report problems with kmail in KDE 3 CVS -(RC2 ish) when run on Valgrind. Stephan has it deadlocking; Marc has -it looping at startup. I can't repro either behaviour. Needs -repro-ing and fixing. - - -

Threads

- -Doing a good job of thread support strikes me as almost a -research-level problem. The central issues are how to do fast cheap -locking of the VG_(primary_map) structure, whether or not -accesses to the individual secondary maps need locking, what -race-condition issues result, and whether the already-nasty mess that -is the signal simulator needs further hackery. - -

-I realise that threads are the most-frequently-requested feature, and -I am thinking about it all. If you have guru-level understanding of -fast mutual exclusion mechanisms and race conditions, I would be -interested in hearing from you. - - -

Verification suite

- -Directory tests/ contains various ad-hoc tests for -Valgrind. However, there is no systematic verification or regression -suite, that, for example, exercises all the stuff in -vg_memory.c, to ensure that illegal memory accesses and -undefined value uses are detected as they should be. It would be good -to have such a suite. - - -

Porting to other platforms

- -It would be great if Valgrind was ported to FreeBSD and x86 NetBSD, -and to x86 OpenBSD, if it's possible (doesn't OpenBSD use a.out-style -executables, not ELF ?) - -

-The main difficulties, for an x86-ELF platform, seem to be: - -

You'd need to rewrite the /proc/self/maps parser - (vg_procselfmaps.c). - Easy. -
-
You'd need to rewrite vg_syscall_mem.c, or, more - specifically, provide one for your OS. This is tedious, but you - can implement syscalls on demand, and the Linux kernel interface - is, for the most part, going to look very similar to the *BSD - interfaces, so it's really a copy-paste-and-modify-on-demand job. - As part of this, you'd need to supply a new - vg_kerneliface.h file. -
-
You'd also need to change the syscall wrappers for Valgrind's - internal use, in vg_mylibc.c. -

- -All in all, I think a port to x86-ELF *BSDs is not really very -difficult, and in some ways I would like to see it happen, because -that would force a more clear factoring of Valgrind into platform -dependent and independent pieces. Not to mention, *BSD folks also -deserve to use Valgrind just as much as the Linux crew do. - - -

-

- -

Easy stuff which ought to be done

- -

MMX instructions

- -MMX insns should be supported, using the same trick as for FPU insns. -If the MMX registers are not used to copy uninitialised junk from one -place to another in memory, this means we don't have to actually -simulate the internal MMX unit state, so the FPU hack applies. This -should be fairly easy. - - - -

Fix stabs-info reader

- -The machinery in vg_symtab2.c which reads "stabs" style -debugging info is pretty weak. It usually correctly translates -simulated program counter values into line numbers and procedure -names, but the file name is often completely wrong. I think the -logic used to parse "stabs" entries is weak. It should be fixed. -The simplest solution, IMO, is to copy either the logic or simply the -code out of GNU binutils which does this; since GDB can clearly get it -right, binutils (or GDB?) must have code to do this somewhere. - - - - - -

BT/BTC/BTS/BTR

- -These are x86 instructions which test, complement, set, or reset, a -single bit in a word. At the moment they are both incorrectly -implemented and incorrectly instrumented. - -

-The incorrect instrumentation is due to use of helper functions. This -means we lose bit-level definedness tracking, which could wind up -giving spurious uninitialised-value use errors. The Right Thing to do -is to invent a couple of new UOpcodes, I think GET_BIT -and SET_BIT, which can be used to implement all 4 x86 -insns, get rid of the helpers, and give bit-accurate instrumentation -rules for the two new UOpcodes. - -

-I realised the other day that they are mis-implemented too. The x86 -insns take a bit-index and a register or memory location to access. -For registers the bit index clearly can only be in the range zero to -register-width minus 1, and I assumed the same applied to memory -locations too. But evidently not; for memory locations the index can -be arbitrary, and the processor will index arbitrarily into memory as -a result. This too should be fixed. Sigh. Presumably indexing -outside the immediate word is not actually used by any programs yet -tested on Valgrind, for otherwise they (presumably) would simply not -work at all. If you plan to hack on this, first check the Intel docs -to make sure my understanding is really correct. - - - -

Using PREFETCH instructions

- -Here's a small but potentially interesting project for performance -junkies. Experiments with valgrind's code generator and optimiser(s) -suggest that reducing the number of instructions executed in the -translations and mem-check helpers gives disappointingly small -performance improvements. Perhaps this is because performance of -Valgrindified code is limited by cache misses. After all, each read -in the original program now gives rise to at least three reads, one -for the VG_(primary_map), one of the resulting -secondary, and the original. Not to mention, the instrumented -translations are 13 to 14 times larger than the originals. All in all -one would expect the memory system to be hammered to hell and then -some. - -

-So here's an idea. An x86 insn involving a read from memory, after -instrumentation, will turn into ucode of the following form: -

-    ... calculate effective addr, into ta and qa ...
-    TESTVL qa             -- is the addr defined?
-    LOADV (ta), qloaded   -- fetch V bits for the addr
-    LOAD  (ta), tloaded   -- do the original load
-

-At the point where the LOADV is done, we know the actual -address (ta) from which the real LOAD will -be done. We also know that the LOADV will take around -20 x86 insns to do. So it seems plausible that doing a prefetch of -ta just before the LOADV might just avoid a -miss at the LOAD point, and that might be a significant -performance win. - -

-Prefetch insns are notoriously tempermental, more often than not -making things worse rather than better, so this would require -considerable fiddling around. It's complicated because Intels and -AMDs have different prefetch insns with different semantics, so that -too needs to be taken into account. As a general rule, even placing -the prefetches before the LOADV insn is too near the -LOAD; the ideal distance is apparently circa 200 CPU -cycles. So it might be worth having another analysis/transformation -pass which pushes prefetches as far back as possible, hopefully -immediately after the effective address becomes available. - -

-Doing too many prefetches is also bad because they soak up bus -bandwidth / cpu resources, so some cleverness in deciding which loads -to prefetch and which to not might be helpful. One can imagine not -prefetching client-stack-relative (%EBP or -%ESP) accesses, since the stack in general tends to show -good locality anyway. - -

-There's quite a lot of experimentation to do here, but I think it -might make an interesting week's work for someone. - -

-As of 15-ish March 2002, I've started to experiment with this, using -the AMD prefetch/prefetchw insns. - - - -

User-defined permission ranges

- -This is quite a large project -- perhaps a month's hacking for a -capable hacker to do a good job -- but it's potentially very -interesting. The outcome would be that Valgrind could detect a -whole class of bugs which it currently cannot. - -

-The presentation falls into two pieces. - -

-Part 1: user-defined address-range permission setting -

- -Valgrind intercepts the client's malloc, -free, etc calls, watches system calls, and watches the -stack pointer move. This is currently the only way it knows about -which addresses are valid and which not. Sometimes the client program -knows extra information about its memory areas. For example, the -client could at some point know that all elements of an array are -out-of-date. We would like to be able to convey to Valgrind this -information that the array is now addressable-but-uninitialised, so -that Valgrind can then warn if elements are used before they get new -values. - -

-What I would like are some macros like this: -

-   VALGRIND_MAKE_NOACCESS(addr, len)
-   VALGRIND_MAKE_WRITABLE(addr, len)
-   VALGRIND_MAKE_READABLE(addr, len)
-

-and also, to check that memory is addressible/initialised, -

-   VALGRIND_CHECK_ADDRESSIBLE(addr, len)
-   VALGRIND_CHECK_INITIALISED(addr, len)
-

- -

-I then include in my sources a header defining these macros, rebuild -my app, run under Valgrind, and get user-defined checks. - -

-Now here's a neat trick. It's a nuisance to have to re-link the app -with some new library which implements the above macros. So the idea -is to define the macros so that the resulting executable is still -completely stand-alone, and can be run without Valgrind, in which case -the macros do nothing, but when run on Valgrind, the Right Thing -happens. How to do this? The idea is for these macros to turn into a -piece of inline assembly code, which (1) has no effect when run on the -real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane -person would ever write, which is important for avoiding false matches -in (2). So here's a suggestion: -

-   VALGRIND_MAKE_NOACCESS(addr, len)
-

-becomes (roughly speaking) -

-   movl addr, %eax
-   movl len,  %ebx
-   movl $1,   %ecx   -- 1 describes the action; MAKE_WRITABLE might be
-                     -- 2, etc
-   rorl $13, %ecx
-   rorl $19, %ecx
-   rorl $11, %eax
-   rorl $21, %eax
-

-The rotate sequences have no effect, and it's unlikely they would -appear for any other reason, but they define a unique byte-sequence -which the JITter can easily spot. Using the operand constraints -section at the end of a gcc inline-assembly statement, we can tell gcc -that the assembly fragment kills %eax, %ebx, -%ecx and the condition codes, so this fragment is made -harmless when not running on Valgrind, runs quickly when not on -Valgrind, and does not require any other library support. - - -

-Part 2: using it to detect interference between stack variables -

- -Currently Valgrind cannot detect errors of the following form: -

-void fooble ( void )
-{
-   int a[10];
-   int b[10];
-   a[10] = 99;
-}
-

-Now imagine rewriting this as -

-void fooble ( void )
-{
-   int spacer0;
-   int a[10];
-   int spacer1;
-   int b[10];
-   int spacer2;
-   VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
-   VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
-   VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
-   a[10] = 99;
-}
-

-Now the invalid write is certain to hit spacer0 or -spacer1, so Valgrind will spot the error. - -

-There are two complications. - -

-The first is that we don't want to annotate sources by hand, so the -Right Thing to do is to write a C/C++ parser, annotator, prettyprinter -which does this automatically, and run it on post-CPP'd C/C++ source. -See http://www.cacheprof.org for an example of a system which -transparently inserts another phase into the gcc/g++ compilation -route. The parser/prettyprinter is probably not as hard as it sounds; -I would write it in Haskell, a powerful functional language well -suited to doing symbolic computation, with which I am intimately -familar. There is already a C parser written in Haskell by someone in -the Haskell community, and that would probably be a good starting -point. - -

-The second complication is how to get rid of these -NOACCESS records inside Valgrind when the instrumented -function exits; after all, these refer to stack addresses and will -make no sense whatever when some other function happens to re-use the -same stack address range, probably shortly afterwards. I think I -would be inclined to define a special stack-specific macro -

-   VALGRIND_MAKE_NOACCESS_STACK(addr, len)
-

-which causes Valgrind to record the client's %ESP at the -time it is executed. Valgrind will then watch for changes in -%ESP and discard such records as soon as the protected -area is uncovered by an increase in %ESP. I hesitate -with this scheme only because it is potentially expensive, if there -are hundreds of such records, and considering that changes in -%ESP already require expensive messing with stack access -permissions. - -

-This is probably easier and more robust than for the instrumenter -program to try and spot all exit points for the procedure and place -suitable deallocation annotations there. Plus C++ procedures can -bomb out at any point if they get an exception, so spotting return -points at the source level just won't work at all. - -

-Although some work, it's all eminently doable, and it would make -Valgrind into an even-more-useful tool. - - -

- - -

- -

Cache profiling

-Valgrind is a very nice platform for doing cache profiling and other kinds of -simulation, because it converts horrible x86 instructions into nice clean -RISC-like UCode. For example, for cache profiling we are interested in -instructions that read and write memory; in UCode there are only four -instructions that do this: LOAD, STORE, -FPU_R and FPU_W. By contrast, because of the x86 -addressing modes, almost every instruction can read or write memory.

- -Most of the cache profiling machinery is in the file -vg_cachesim.c.

- -These notes are a somewhat haphazard guide to how Valgrind's cache profiling -works.

- -

Cost centres

-Valgrind gathers cache profiling about every instruction executed, -individually. Each instruction has a cost centre associated with it. -There are two kinds of cost centre: one for instructions that don't reference -memory (iCC), and one for instructions that do -(idCC): - -

-typedef struct _CC {
-   ULong a;
-   ULong m1;
-   ULong m2;
-} CC;
-
-typedef struct _iCC {
-   /* word 1 */
-   UChar tag;
-   UChar instr_size;
-
-   /* words 2+ */
-   Addr instr_addr;
-   CC I;
-} iCC;
-   
-typedef struct _idCC {
-   /* word 1 */
-   UChar tag;
-   UChar instr_size;
-   UChar data_size;
-
-   /* words 2+ */
-   Addr instr_addr;
-   CC I; 
-   CC D; 
-} idCC; 
-

- -Each CC has three fields a, m1, -m2 for recording references, level 1 misses and level 2 misses. -Each of these is a 64-bit ULong -- the numbers can get very large, -ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

- -A iCC has one CC for instruction cache accesses. A -idCC has two, one for instruction cache accesses, and one for data -cache accesses.

- -The iCC and dCC structs also store unchanging -information about the instruction: -

An instruction-type identification tag (explained below)

-

Instruction size

-

Data reference size (idCC only)

-

Instruction address

-

- -Note that data address is not one of the fields for idCC. This is -because for many memory-referencing instructions the data address can change -each time it's executed (eg. if it uses register-offset addressing). We have -to give this item to the cache simulation in a different way (see -Instrumentation section below). Some memory-referencing instructions do always -reference the same address, but we don't try to treat them specialy in order to -keep things simple.

- -Also note that there is only room for recording info about one data cache -access in an idCC. So what about instructions that do a read then -a write, such as: - -

inc %(esi)

- -In a write-allocate cache, as simulated by Valgrind, the write cannot miss, -since it immediately follows the read which will drag the block into the cache -if it's not already there. So the write access isn't really interesting, and -Valgrind doesn't record it. This means that Valgrind doesn't measure -memory references, but rather memory references that could miss in the cache. -This behaviour is the same as that used by the AMD Athlon hardware counters. -It also has the benefit of simplifying the implementation -- instructions that -read and write memory can be treated like instructions that read memory.

- -

Storing cost-centres

-Cost centres are stored in a way that makes them very cheap to lookup, which is -important since one is looked up for every original x86 instruction -executed.

- -Valgrind does JIT translations at the basic block level, and cost centres are -also setup and stored at the basic block level. By doing things carefully, we -store all the cost centres for a basic block in a contiguous array, and lookup -comes almost for free.

- -Consider this part of a basic block (for exposition purposes, pretend it's an -entire basic block): - -

-movl $0x0,%eax
-movl $0x99, -4(%ebp)
-

- -The translation to UCode looks like this: - -

-MOVL      $0x0, t20
-PUTL      t20, %EAX
-INCEIPo   $5
-
-LEA1L     -4(t4), t14
-MOVL      $0x99, t18
-STL       t18, (t14)
-INCEIPo   $7
-

- -The first step is to allocate the cost centres. This requires a preliminary -pass to count how many x86 instructions were in the basic block, and their -types (and thus sizes). UCode translations for single x86 instructions are -delimited by the INCEIPo instruction, the argument of which gives -the byte size of the instruction (note that lazy INCEIP updating is turned off -to allow this).

- -We can tell if an x86 instruction references memory by looking for -LDL and STL UCode instructions, and thus what kind of -cost centre is required. From this we can determine how many cost centres we -need for the basic block, and their sizes. We can then allocate them in a -single array.

- -Consider the example code above. After the preliminary pass, we know we need -two cost centres, one iCC and one dCC. So we -allocate an array to store these which looks like this: - -

-|(uninit)|      tag         (1 byte)
-|(uninit)|      instr_size  (1 bytes)
-|(uninit)|      (padding)   (2 bytes)
-|(uninit)|      instr_addr  (4 bytes)
-|(uninit)|      I.a         (8 bytes)
-|(uninit)|      I.m1        (8 bytes)
-|(uninit)|      I.m2        (8 bytes)
-
-|(uninit)|      tag         (1 byte)
-|(uninit)|      instr_size  (1 byte)
-|(uninit)|      data_size   (1 byte)
-|(uninit)|      (padding)   (1 byte)
-|(uninit)|      instr_addr  (4 bytes)
-|(uninit)|      I.a         (8 bytes)
-|(uninit)|      I.m1        (8 bytes)
-|(uninit)|      I.m2        (8 bytes)
-|(uninit)|      D.a         (8 bytes)
-|(uninit)|      D.m1        (8 bytes)
-|(uninit)|      D.m2        (8 bytes)
-

- -(We can see now why we need tags to distinguish between the two types of cost -centres.)

- -We also record the size of the array. We look up the debug info of the first -instruction in the basic block, and then stick the array into a table indexed -by filename and function name. This makes it easy to dump the information -quickly to file at the end.

- -

Instrumentation

-The instrumentation pass has two main jobs: - -

Fill in the gaps in the allocated cost centres.

-

Add UCode to call the cache simulator for each instruction.

-

- -The instrumentation pass steps through the UCode and the cost centres in -tandem. As each original x86 instruction's UCode is processed, the appropriate -gaps in the instructions cost centre are filled in, for example: - -

-|INSTR_CC|      tag         (1 byte)
-|5       |      instr_size  (1 bytes)
-|(uninit)|      (padding)   (2 bytes)
-|i_addr1 |      instr_addr  (4 bytes)
-|0       |      I.a         (8 bytes)
-|0       |      I.m1        (8 bytes)
-|0       |      I.m2        (8 bytes)
-
-|WRITE_CC|      tag         (1 byte)
-|7       |      instr_size  (1 byte)
-|4       |      data_size   (1 byte)
-|(uninit)|      (padding)   (1 byte)
-|i_addr2 |      instr_addr  (4 bytes)
-|0       |      I.a         (8 bytes)
-|0       |      I.m1        (8 bytes)
-|0       |      I.m2        (8 bytes)
-|0       |      D.a         (8 bytes)
-|0       |      D.m1        (8 bytes)
-|0       |      D.m2        (8 bytes)
-

- -(Note that this step is not performed if a basic block is re-translated; see -here for more information.)

- -GCC inserts padding before the instr_size field so that it is word -aligned.

- -The instrumentation added to call the cache simulation function looks like this -(instrumentation is indented to distinguish it from the original UCode): - -

-MOVL      $0x0, t20
-PUTL      t20, %EAX
-  PUSHL     %eax
-  PUSHL     %ecx
-  PUSHL     %edx
-  MOVL      $0x4091F8A4, t46  # address of 1st CC
-  PUSHL     t46
-  CALLMo    $0x12             # second cachesim function
-  CLEARo    $0x4
-  POPL      %edx
-  POPL      %ecx
-  POPL      %eax
-INCEIPo   $5
-
-LEA1L     -4(t4), t14
-MOVL      $0x99, t18
-  MOVL      t14, t42
-STL       t18, (t14)
-  PUSHL     %eax
-  PUSHL     %ecx
-  PUSHL     %edx
-  PUSHL     t42
-  MOVL      $0x4091F8C4, t44  # address of 2nd CC
-  PUSHL     t44
-  CALLMo    $0x13             # second cachesim function
-  CLEARo    $0x8
-  POPL      %edx
-  POPL      %ecx
-  POPL      %eax
-INCEIPo   $7
-

- -Consider the first instruction's UCode. Each call is surrounded by three -PUSHL and POPL instructions to save and restore the -caller-save registers. Then the address of the instruction's cost centre is -pushed onto the stack, to be the first argument to the cache simulation -function. The address is known at this point because we are doing a -simultaneous pass through the cost centre array. This means the cost centre -lookup for each instruction is almost free (just the cost of pushing an -argument for a function call). Then the call to the cache simulation function -for non-memory-reference instructions is made (note that the -CALLMo UInstruction takes an offset into a table of predefined -functions; it is not an absolute address), and the single argument is -CLEARed from the stack.

- -The second instruction's UCode is similar. The only difference is that, as -mentioned before, we have to pass the address of the data item referenced to -the cache simulation function too. This explains the MOVL t14, -t42 and PUSHL t42 UInstructions. (Note that the seemingly -redundant MOVing will probably be optimised away during register -allocation.)

- -Note that instead of storing unchanging information about each instruction -(instruction size, data size, etc) in its cost centre, we could have passed in -these arguments to the simulation function. But this would slow the calls down -(two or three extra arguments pushed onto the stack). Also it would bloat the -UCode instrumentation by amounts similar to the space required for them in the -cost centre; bloated UCode would also fill the translation cache more quickly, -requiring more translations for large programs and slowing them down more.

- - -

Handling basic block retranslations

-The above description ignores one complication. Valgrind has a limited size -cache for basic block translations; if it fills up, old translations are -discarded. If a discarded basic block is executed again, it must be -re-translated.

- -However, we can't use this approach for profiling -- we can't throw away cost -centres for instructions in the middle of execution! So when a basic block is -translated, we first look for its cost centre array in the hash table. If -there is no cost centre array, it must be the first translation, so we proceed -as described above. But if there is a cost centre array already, it must be a -retranslation. In this case, we skip the cost centre allocation and -initialisation steps, but still do the UCode instrumentation step.

- -

The cache simulation

-The cache simulation is fairly straightforward. It just tracks which memory -blocks are in the cache at the moment (it doesn't track the contents, since -that is irrelevant).

- -The interface to the simulation is quite clean. The functions called from the -UCode contain calls to the simulation functions in the files -vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only -one function call is done per simulated x86 instruction. The file -vg_cachesim.c simply #includes the three files -containing the simulation, which makes plugging in new cache simulations is -very easy -- you just replace the three files and recompile.

- -

Output

-Output is fairly straightforward, basically printing the cost centre for every -instruction, grouped by files and functions. Total counts (eg. total cache -accesses, total L1 misses) are calculated when traversing this structure rather -than during execution, to save time; the cache simulation functions are called -so often that even one or two extra adds can make a sizeable difference.

- -Input file has the following format: - -

-file         ::= desc_line* cmd_line events_line data_line+ summary_line
-desc_line    ::= "desc:" ws? non_nl_string
-cmd_line     ::= "cmd:" ws? cmd
-events_line  ::= "events:" ws? (event ws)+
-data_line    ::= file_line | fn_line | count_line
-file_line    ::= ("fl=" | "fi=" | "fe=") filename
-fn_line      ::= "fn=" fn_name
-count_line   ::= line_num ws? (count ws)+
-summary_line ::= "summary:" ws? (count ws)+
-count        ::= num | "."
-

- -Where: - -

non_nl_string is any string not containing a newline.

-

cmd is a command line invocation.

-

filename and fn_name can be anything.

-

num and line_num are decimal numbers.

-

ws is whitespace.

-

nl is a newline.

-

- -The contents of the "desc:" lines is printed out at the top of the summary. -This is a generic way of providing simulation specific information, eg. for -giving the cache configuration for cache simulation.

- -Counts can be "." to represent "N/A", eg. the number of write misses for an -instruction that doesn't write to memory.

- -The number of counts in each line and the -summary_line should not exceed the number of events in the -event_line. If the number in each line is less, -vg_annotate treats those missing as though they were a "." entry.

- -A file_line changes the current file name. A fn_line -changes the current function name. A count_line contains counts -that pertain to the current filename/fn_name. A "fn=" file_line -and a fn_line must appear before any count_lines to -give the context of the first count_lines.

- -Each file_line should be immediately followed by a -fn_line. "fi=" file_lines are used to switch -filenames for inlined functions; "fe=" file_lines are similar, but -are put at the end of a basic block in which the file name hasn't been switched -back to the original file name. (fi and fe lines behave the same, they are -only distinguished to help debugging.)

- - -

Summary of performance features

-Quite a lot of work has gone into making the profiling as fast as possible. -This is a summary of the important features: - -

The basic block-level cost centre storage allows almost free cost centre - lookup.

- -

Only one function call is made per instruction simulated; even this - accounts for a sizeable percentage of execution time, but it seems - unavoidable if we want flexibility in the cache simulator.

- -

Unchanging information about an instruction is stored in its cost centre, - avoiding unnecessary argument pushing, and minimising UCode - instrumentation bloat.

- -

Summary counts are calculated at the end, rather than during - execution.

- -

The cachegrind.out output files can contain huge amounts of - information; file format was carefully chosen to minimise file - sizes.

-

- - -

Annotation

-Annotation is done by vg_annotate. It is a fairly straightforward Perl script -that slurps up all the cost centres, and then runs through all the chosen -source files, printing out cost centres with them. It too has been carefully -optimised. - - -

Similar work, extensions

-It would be relatively straightforward to do other simulations and obtain -line-by-line information about interesting events. A good example would be -branch prediction -- all branches could be instrumented to interact with a -branch prediction simulator, using very similar techniques to those described -above.

- -In particular, vg_annotate would not need to change -- the file format is such -that it is not specific to the cache simulation, but could be used for any kind -of line-by-line information. The only part of vg_annotate that is specific to -the cache simulation is the name of the input file -(cachegrind.out), although it would be very simple to add an -option to control this.

- - -