diff --git a/docs/Makefile.am b/docs/Makefile.am index e8a58fa18..39b9008b6 100644 --- a/docs/Makefile.am +++ b/docs/Makefile.am @@ -1,5 +1,5 @@ docdir = $(datadir)/doc/valgrind -doc_DATA = index.html manual.html nav.html techdocs.html +doc_DATA = index.html EXTRA_DIST = $(doc_DATA) diff --git a/docs/index.html b/docs/index.html index 111170256..d4db7c868 100644 --- a/docs/index.html +++ b/docs/index.html @@ -1,26 +1,33 @@ - +
+- -
-Valgrind is licensed under the GNU General Public License,
-version 2
-An open-source tool for finding memory-management problems in
-Linux-x86 executables.
-
- -
-Valgrind is closely tied to details of the CPU, operating system and
-to a less extent, compiler and basic C libraries. This makes it
-difficult to make it portable, so I have chosen at the outset to
-concentrate on what I believe to be a widely used platform: Linux on
-x86s. Valgrind uses the standard Unix ./configure,
-make, make install mechanism, and I have
-attempted to ensure that it works on machines with kernel 2.2 or 2.4
-and glibc 2.1.X or 2.2.X. This should cover the vast majority of
-modern Linux installations.
-
-
-
-Valgrind is licensed under the GNU General Public License, version
-2. Read the file LICENSE in the source distribution for details. Some
-of the PThreads test cases, test/pth_*.c, are taken from
-"Pthreads Programming" by Bradford Nichols, Dick Buttlar & Jacqueline
-Proulx Farrell, ISBN 1-56592-115-1, published by O'Reilly &
-Associates, Inc.
-
-
-
-
valgrind at the start of the command line
-normally used to run the program. So, for example, if you want to run
-the command ls -l on Valgrind, simply issue the
-command: valgrind ls -l.
-
-Valgrind takes control of your program before it starts. Debugging -information is read from the executable and associated libraries, so -that error messages can be phrased in terms of source code -locations. Your program is then run on a synthetic x86 CPU which -checks every memory access. All detected errors are written to a -log. When the program finishes, Valgrind searches for and reports on -leaked memory. - -
You can run pretty much any dynamically linked ELF x86 executable -using Valgrind. Programs run 25 to 50 times slower, and take a lot -more memory, than they usually would. It works well enough to run -large programs. For example, the Konqueror web browser from the KDE -Desktop Environment, version 3.0, runs slowly but usably on Valgrind. - -
Valgrind simulates every single instruction your program executes.
-Because of this, it finds errors not only in your application but also
-in all supporting dynamically-linked (.so-format)
-libraries, including the GNU C library, the X client libraries, Qt, if
-you work with KDE, and so on. That often includes libraries, for
-example the GNU C library, which contain memory access violations, but
-which you cannot or do not want to fix.
-
-
Rather than swamping you with errors in which you are not -interested, Valgrind allows you to selectively suppress errors, by -recording them in a suppressions file which is read when Valgrind -starts up. The build mechanism attempts to select suppressions which -give reasonable behaviour for the libc and XFree86 versions detected -on your machine. - - -
Section 6 shows an example of use. -
-
-g flag). You don't have to
-do this, but doing so helps Valgrind produce more accurate and less
-confusing error reports. Chances are you're set up like this already,
-if you intended to debug your program with GNU gdb, or some other
-debugger.
-
-
-A plausible compromise is to use -g -O.
-Optimisation levels above -O have been observed, on very
-rare occasions, to cause gcc to generate code which fools Valgrind's
-error tracking machinery into wrongly reporting uninitialised value
-errors. -O gets you the vast majority of the benefits of
-higher optimisation levels anyway, so you don't lose much there.
-
-
-Valgrind understands both the older "stabs" debugging format, used by -gcc versions prior to 3.1, and the newer DWARF2 format used by gcc 3.1 -and later. - -
-Then just run your application, but place the word
-valgrind in front of your usual command-line invokation.
-Note that you should run the real (machine-code) executable here. If
-your application is started by, for example, a shell or perl script,
-you'll need to modify it to invoke Valgrind on the real executables.
-Running such scripts directly under Valgrind will result in you
-getting error reports pertaining to /bin/sh,
-/usr/bin/perl, or whatever interpreter you're using.
-This almost certainly isn't what you want and can be confusing.
-
-
-
All lines in the commentary are of the following form:
-
- ==12345== some-message-from-Valgrind --
The 12345 is the process ID. This scheme makes it easy
-to distinguish program output from Valgrind commentary, and also easy
-to differentiate commentaries from different processes which have
-become merged together, for whatever reason.
-
-
By default, Valgrind writes only essential messages to the commentary,
-so as to avoid flooding you with information of secondary importance.
-If you want more information about what is happening, re-run, passing
-the -v flag to Valgrind.
-
-
-
-
- ==25832== Invalid read of size 4 - ==25832== at 0x8048724: BandMatrix::ReSize(int, int, int) (bogon.cpp:45) - ==25832== by 0x80487AF: main (bogon.cpp:66) - ==25832== by 0x40371E5E: __libc_start_main (libc-start.c:129) - ==25832== by 0x80485D1: (within /home/sewardj/newmat10/bogon) - ==25832== Address 0xBFFFF74C is not stack'd, malloc'd or free'd -- -
This message says that the program did an illegal 4-byte read of
-address 0xBFFFF74C, which, as far as it can tell, is not a valid stack
-address, nor corresponds to any currently malloc'd or free'd blocks.
-The read is happening at line 45 of bogon.cpp, called
-from line 66 of the same file, etc. For errors associated with an
-identified malloc'd/free'd block, for example reading free'd memory,
-Valgrind reports not only the location where the error happened, but
-also where the associated block was malloc'd/free'd.
-
-
Valgrind remembers all error reports. When an error is detected, -it is compared against old reports, to see if it is a duplicate. If -so, the error is noted, but no further commentary is emitted. This -avoids you being swamped with bazillions of duplicate error reports. - -
If you want to know how many times each error occurred, run with
-the -v option. When execution finishes, all the reports
-are printed out, along with, and sorted by, their occurrence counts.
-This makes it easy to see which errors have occurred most frequently.
-
-
Errors are reported before the associated operation actually -happens. For example, if you program decides to read from address -zero, Valgrind will emit a message to this effect, and the program -will then duly die with a segmentation fault. - -
In general, you should try and fix errors in the order that they -are reported. Not doing so can be confusing. For example, a program -which copies uninitialised values to several memory locations, and -later uses them, will generate several error messages. The first such -error message may well give the most direct clue to the root cause of -the problem. - -
The process of detecting duplicate errors is quite an expensive
-one and can become a significant performance overhead if your program
-generates huge quantities of errors. To avoid serious problems here,
-Valgrind will simply stop collecting errors after 300 different errors
-have been seen, or 30000 errors in total have been seen. In this
-situation you might as well stop your program and fix it, because
-Valgrind won't tell you anything else useful after this. Note that
-the 300/30000 limits apply after suppressed errors are removed. These
-limits are defined in vg_include.h and can be increased
-if necessary.
-
-
To avoid this cutoff you can use the
---error-limit=no flag. Then valgrind will always show
-errors, regardless of how many there are. Use this flag carefully,
-since it may have a dire effect on performance.
-
-
-
-
./configure script.
-
-You can modify and add to the suppressions file at your leisure, -or, better, write your own. Multiple suppression files are allowed. -This is useful if part of your project contains errors you can't or -don't want to fix, yet you don't want to continuously be reminded of -them. - -
Each error to be suppressed is described very specifically, to -minimise the possibility that a suppression-directive inadvertantly -suppresses a bunch of similar errors which you did want to see. The -suppression mechanism is designed to allow precise yet flexible -specification of errors to suppress. - -
If you use the -v flag, at the end of execution, Valgrind
-prints out one line for each used suppression, giving its name and the
-number of times it got used. Here's the suppressions used by a run of
-ls -l:
-
- --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r - --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getpwuid_r - --27579-- supp: 6 strrchr/_dl_map_object_from_fd/_dl_map_object -- - -
- valgrind [options-for-Valgrind] your-prog [options for your-prog] -- -
Note that Valgrind also reads options from the environment variable
-$VALGRIND_OPTS, and processes them before the command-line
-options.
-
-
Valgrind's default settings succeed in giving reasonable behaviour -in most cases. Available options, in no particular order, are as -follows: -
--help--versionThe usual deal.
- -
-v --verboseBe more verbose. Gives extra information on various aspects - of your program, such as: the shared objects loaded, the - suppressions used, the progress of the instrumentation engine, - and warnings about unusual behaviour. -
- -
-q --quietRun silently, and only print error messages. Useful if you - are running regression tests or have some other automated test - machinery. -
- -
--demangle=no--demangle=yes [the default]
- Disable/enable automatic demangling (decoding) of C++ names. - Enabled by default. When enabled, Valgrind will attempt to - translate encoded C++ procedure names back to something - approaching the original. The demangler handles symbols mangled - by g++ versions 2.X and 3.X. - -
An important fact about demangling is that function - names mentioned in suppressions files should be in their mangled - form. Valgrind does not demangle function names when searching - for applicable suppressions, because to do otherwise would make - suppressions file contents dependent on the state of Valgrind's - demangling machinery, and would also be slow and pointless. -
- -
--num-callers=<number> [default=4]By default, Valgrind shows four levels of function call names - to help you identify program locations. You can change that - number with this option. This can help in determining the - program's location in deeply-nested call chains. Note that errors - are commoned up using only the top three function locations (the - place in the current function, and that of its two immediate - callers). So this doesn't affect the total number of errors - reported. -
- The maximum value for this is 50. Note that higher settings - will make Valgrind run a bit more slowly and take a bit more - memory, but can be useful when working with programs with - deeply-nested call chains. -
- -
--gdb-attach=no [the default]--gdb-attach=yes
- When enabled, Valgrind will pause after every error shown,
- and print the line
-
- ---- Attach to GDB ? --- [Return/N/n/Y/y/C/c] ----
-
- Pressing Ret, or N Ret
- or n Ret, causes Valgrind not to
- start GDB for this error.
-
- Y Ret
- or y Ret causes Valgrind to
- start GDB, for the program at this point. When you have
- finished with GDB, quit from it, and the program will continue.
- Trying to continue from inside GDB doesn't work.
-
- C Ret
- or c Ret causes Valgrind not to
- start GDB, and not to ask again.
-
- --gdb-attach=yes conflicts with
- --trace-children=yes. You can't use them together.
- Valgrind refuses to start up in this situation. 1 May 2002:
- this is a historical relic which could be easily fixed if it
- gets in your way. Mail me and complain if this is a problem for
- you.
- -
--partial-loads-ok=yes [the default]--partial-loads-ok=no
- Controls how Valgrind handles word (4-byte) loads from
- addresses for which some bytes are addressible and others
- are not. When yes (the default), such loads
- do not elicit an address error. Instead, the loaded V bytes
- corresponding to the illegal addresses indicate undefined, and
- those corresponding to legal addresses are loaded from shadow
- memory, as usual.
-
- When no, loads from partially
- invalid addresses are treated the same as loads from completely
- invalid addresses: an illegal-address error is issued,
- and the resulting V bytes indicate valid data.
-
- -
--sloppy-malloc=no [the default]--sloppy-malloc=yes
- When enabled, all requests for malloc/calloc are rounded up - to a whole number of machine words -- in other words, made - divisible by 4. For example, a request for 17 bytes of space - would result in a 20-byte area being made available. This works - around bugs in sloppy libraries which assume that they can - safely rely on malloc/calloc requests being rounded up in this - fashion. Without the workaround, these libraries tend to - generate large numbers of errors when they access the ends of - these areas. -
- Valgrind snapshots dated 17 Feb 2002 and later are
- cleverer about this problem, and you should no longer need to
- use this flag. To put it bluntly, if you do need to use this
- flag, your program violates the ANSI C semantics defined for
- malloc and free, even if it appears to
- work correctly, and you should fix it, at least if you hope for
- maximum portability.
-
- -
--alignment=<number> [default: 4]By
- default valgrind's malloc, realloc,
- etc, return 4-byte aligned addresses. These are suitable for
- any accesses on x86 processors.
- Some programs might however assume that malloc et
- al return 8- or more aligned memory.
- These programs are broken and should be fixed, but
- if this is impossible for whatever reason the alignment can be
- increased using this parameter. The supplied value must be
- between 4 and 4096 inclusive, and must be a power of two.
- -
--trace-children=no [the default]--trace-children=yes
- When enabled, Valgrind will trace into child processes. This
- is confusing and usually not what you want, so is disabled by
- default. As of 1 May 2002, tracing into a child process from a
- parent which uses libpthread.so is probably broken
- and is likely to cause breakage. Please report any such
- problems to me.
- -
--freelist-vol=<number> [default: 1000000]
- When the client program releases memory using free (in C) or - delete (C++), that memory is not immediately made available for - re-allocation. Instead it is marked inaccessible and placed in - a queue of freed blocks. The purpose is to delay the point at - which freed-up memory comes back into circulation. This - increases the chance that Valgrind will be able to detect - invalid accesses to blocks for some significant period of time - after they have been freed. -
- This flag specifies the maximum total size, in bytes, of the - blocks in the queue. The default value is one million bytes. - Increasing this increases the total amount of memory used by - Valgrind but may detect invalid uses of freed blocks which would - otherwise go undetected.
- -
--logfile-fd=<number> [default: 2, stderr]
- Specifies the file descriptor on which Valgrind communicates
- all of its messages. The default, 2, is the standard error
- channel. This may interfere with the client's own use of
- stderr. To dump Valgrind's commentary in a file without using
- stderr, something like the following works well (sh/bash
- syntax):
-
- valgrind --logfile-fd=9 my_prog 9> logfile
- That is: tell Valgrind to send all output to file descriptor 9,
- and ask the shell to route file descriptor 9 to "logfile".
-
- -
--suppressions=<filename>
- [default: $PREFIX/lib/valgrind/default.supp]
- Specifies an extra - file from which to read descriptions of errors to suppress. You - may use as many extra suppressions files as you - like.
- -
--leak-check=no [default]--leak-check=yes
- When enabled, search for memory leaks when the client program - finishes. A memory leak means a malloc'd block, which has not - yet been free'd, but to which no pointer can be found. Such a - block can never be free'd by the program, since no pointer to it - exists. Leak checking is disabled by default because it tends - to generate dozens of error messages.
- -
--show-reachable=no [default]--show-reachable=yes
- When disabled, the memory leak detector only shows blocks for
- which it cannot find a pointer to at all, or it can only find a
- pointer to the middle of. These blocks are prime candidates for
- memory leaks. When enabled, the leak detector also reports on
- blocks which it could find a pointer to. Your program could, at
- least in principle, have freed such blocks before exit.
- Contrast this to blocks for which no pointer, or only an
- interior pointer could be found: they are more likely to
- indicate memory leaks, because you do not actually have a
- pointer to the start of the block which you can hand to
- free, even if you wanted to.
- -
--leak-resolution=low [default]--leak-resolution=med --leak-resolution=high
- When doing leak checking, determines how willing Valgrind is
- to consider different backtraces to be the same. When set to
- low, the default, only the first two entries need
- match. When med, four entries have to match. When
- high, all entries need to match.
-
- For hardcore leak debugging, you probably want to use
- --leak-resolution=high together with
- --num-callers=40 or some such large number. Note
- however that this can give an overwhelming amount of
- information, which is why the defaults are 4 callers and
- low-resolution matching.
-
- Note that the --leak-resolution= setting does not
- affect Valgrind's ability to find leaks. It only changes how
- the results are presented.
-
- -
--workaround-gcc296-bugs=no [default]--workaround-gcc296-bugs=yes When enabled,
- assume that reads and writes some small distance below the stack
- pointer %esp are due to bugs in gcc 2.96, and does
- not report them. The "small distance" is 256 bytes by default.
- Note that gcc 2.96 is the default compiler on some popular Linux
- distributions (RedHat 7.X, Mandrake) and so you may well need to
- use this flag. Do not use it if you do not have to, as it can
- cause real errors to be overlooked. Another option is to use a
- gcc/g++ which does not generate accesses below the stack
- pointer. 2.95.3 seems to be a good choice in this respect.
-
- Unfortunately (27 Feb 02) it looks like g++ 3.0.4 has a similar - bug, so you may need to issue this flag if you use 3.0.4. A - while later (early Apr 02) this is confirmed as a scheduling bug - in g++-3.0.4. -
- -
--error-limit=yes [default]--error-limit=no When enabled, valgrind stops - reporting errors after 30000 in total, or 300 different ones, - have been seen. This is to stop the error tracking machinery - from becoming a huge performance overhead in programs with many - errors.
- -
--cachesim=no [default]--cachesim=yes When enabled, turns off memory - checking, and turns on cache profiling. Cache profiling is - described in detail in Section 7. -
- -
--weird-hacks=hack1,hack2,...
- Pass miscellaneous hints to Valgrind which slightly modify the
- simulated behaviour in nonstandard or dangerous ways, possibly
- to help the simulation of strange features. By default no hacks
- are enabled. Use with caution! Currently known hacks are:
- -
ioctl-VTIME Use this if you have a program
- which sets readable file descriptors to have a timeout by
- doing ioctl on them with a
- TCSETA-style command and a non-zero
- VTIME timeout value. This is considered
- potentially dangerous and therefore is not engaged by
- default, because it is (remotely) conceivable that it could
- cause threads doing read to incorrectly block
- the entire process.
-
- You probably want to try this one if you have a program
- which unexpectedly blocks in a read from a file
- descriptor which you know to have been messed with by
- ioctl. This could happen, for example, if the
- descriptor is used to read input from some kind of screen
- handling library.
-
- To find out if your program is blocking unexpectedly in the
- read system call, run with
- --trace-syscalls=yes flag.
-
-
truncate-writes Use this if you have a threaded
- program which appears to unexpectedly block whilst writing
- into a pipe. The effect is to modify all calls to
- write() so that requests to write more than
- 4096 bytes are treated as if they only requested a write of
- 4096 bytes. Valgrind does this by changing the
- count argument of write(), as
- passed to the kernel, so that it is at most 4096. The
- amount of data written will then be less than the client
- program asked for, but the client should have a loop around
- its write() call to check whether the requested
- number of bytes have been written. If not, it should issue
- further write() calls until all the data is
- written.
- - This all sounds pretty dodgy to me, which is why I've made - this behaviour only happen on request. It is not the - default behaviour. At the time of writing this (30 June - 2002) I have only seen one example where this is necessary, - so either the problem is extremely rare or nobody is using - Valgrind :-) -
- On experimentation I see that truncate-writes
- doesn't interact well with ioctl-VTIME, so you
- probably don't want to try both at once.
-
- As above, to find out if your program is blocking
- unexpectedly in the write() system call, you
- may find the --trace-syscalls=yes
- --trace-sched=yes flags useful.
-
-
--single-step=no [default]--single-step=yes
- When enabled, each x86 insn is translated seperately into - instrumented code. When disabled, translation is done on a - per-basic-block basis, giving much better translations.
- -
--optimise=no--optimise=yes [default]
- When enabled, various improvements are applied to the - intermediate code, mainly aimed at allowing the simulated CPU's - registers to be cached in the real CPU's registers over several - simulated instructions.
- -
--instrument=no--instrument=yes [default]
- When disabled, the translations don't actually contain any - instrumentation.
- -
--cleanup=no--cleanup=yes [default]
- When enabled, various improvments are applied to the - post-instrumented intermediate code, aimed at removing redundant - value checks.
- -
--trace-syscalls=no [default]--trace-syscalls=yes
- Enable/disable tracing of system call intercepts.
- -
--trace-signals=no [default]--trace-signals=yes
- Enable/disable tracing of signal handling.
- -
--trace-sched=no [default]--trace-sched=yes
- Enable/disable tracing of thread scheduling events.
- -
--trace-pthread=none [default]--trace-pthread=some --trace-pthread=all
- Specifies amount of trace detail for pthread-related events.
- -
--trace-symtab=no [default]--trace-symtab=yes
- Enable/disable tracing of symbol table reading.
- -
--trace-malloc=no [default]--trace-malloc=yes
- Enable/disable tracing of malloc/free (et al) intercepts. -
- -
--stop-after=<number>
- [default: infinity, more or less]
- After <number> basic blocks have been executed, shut down - Valgrind and switch back to running the client on the real CPU. -
- -
--dump-error=<number> [default: inactive]
- After the program has exited, show gory details of the
- translation of the basic block containing the <number>'th
- error context. When used with --single-step=yes,
- can show the exact x86 instruction causing an error. This is
- all fairly dodgy and doesn't work at all if threads are
- involved.
-
- Invalid read of size 4 - at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9) - by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9) - by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326) - by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621) - Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd -- -
This happens when your program reads or writes memory at a place -which Valgrind reckons it shouldn't. In this example, the program did -a 4-byte read at address 0xBFFFF0E0, somewhere within the -system-supplied library libpng.so.2.1.0.9, which was called from -somewhere else in the same library, called from line 326 of -qpngio.cpp, and so on. - -
Valgrind tries to establish what the illegal address might relate -to, since that's often useful. So, if it points into a block of -memory which has already been freed, you'll be informed of this, and -also where the block was free'd at. Likewise, if it should turn out -to be just off the end of a malloc'd block, a common result of -off-by-one-errors in array subscripting, you'll be informed of this -fact, and also where the block was malloc'd. - -
In this example, Valgrind can't identify the address. Actually the -address is on the stack, but, for some reason, this is not a valid -stack address -- it is below the stack pointer, %esp, and that isn't -allowed. In this particular case it's probably caused by gcc -generating invalid code, a known bug in various flavours of gcc. - -
Note that Valgrind only tells you that your program is about to -access memory at an illegal address. It can't stop the access from -happening. So, if your program makes an access which normally would -result in a segmentation fault, you program will still suffer the same -fate -- but you will get a message from Valgrind immediately prior to -this. In this particular example, reading junk on the stack is -non-fatal, and the program stays alive. - - -
- Conditional jump or move depends on uninitialised value(s) - at 0x402DFA94: _IO_vfprintf (_itoa.h:49) - by 0x402E8476: _IO_printf (printf.c:36) - by 0x8048472: main (tests/manuel1.c:8) - by 0x402A6E5E: __libc_start_main (libc-start.c:129) -- -
An uninitialised-value use error is reported when your program uses -a value which hasn't been initialised -- in other words, is undefined. -Here, the undefined value is used somewhere inside the printf() -machinery of the C library. This error was reported when running the -following small program: -
- int main()
- {
- int x;
- printf ("x = %d\n", x);
- }
-
-
-It is important to understand that your program can copy around -junk (uninitialised) data to its heart's content. Valgrind observes -this and keeps track of the data, but does not complain. A complaint -is issued only when your program attempts to make use of uninitialised -data. In this example, x is uninitialised. Valgrind observes the -value being passed to _IO_printf and thence to _IO_vfprintf, but makes -no comment. However, _IO_vfprintf has to examine the value of x so it -can turn it into the corresponding ASCII string, and it is at this -point that Valgrind complains. - -
Sources of uninitialised data tend to be: -
- -
- Invalid free() - at 0x4004FFDF: free (ut_clientmalloc.c:577) - by 0x80484C7: main (tests/doublefree.c:10) - by 0x402A6E5E: __libc_start_main (libc-start.c:129) - by 0x80483B1: (within tests/doublefree) - Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd - at 0x4004FFDF: free (ut_clientmalloc.c:577) - by 0x80484C7: main (tests/doublefree.c:10) - by 0x402A6E5E: __libc_start_main (libc-start.c:129) - by 0x80483B1: (within tests/doublefree) --
Valgrind keeps track of the blocks allocated by your program with -malloc/new, so it can know exactly whether or not the argument to -free/delete is legitimate or not. Here, this test program has -freed the same block twice. As with the illegal read/write errors, -Valgrind attempts to make sense of the address free'd. If, as -here, the address is one which has previously been freed, you wil -be told that -- making duplicate frees of the same block easy to spot. - - -
new[]
-has wrongly been deallocated with free:
-- Mismatched free() / delete / delete [] - at 0x40043249: free (vg_clientfuncs.c:171) - by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149) - by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60) - by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44) - Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd - at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152) - by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314) - by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416) - by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272) --The following was told to me be the KDE 3 developers. I didn't know -any of it myself. They also implemented the check itself. -
-In C++ it's important to deallocate memory in a way compatible with -how it was allocated. The deal is: -
malloc, calloc,
- realloc, valloc or
- memalign, you must deallocate with free.
-new[], you must deallocate with
- delete[].
-new, you must deallocate with
- delete.
-
-Pascal Massimino adds the following clarification:
-delete[] must be called associated with a
-new[] because the compiler stores the size of the array
-and the pointer-to-member to the destructor of the array's content
-just before the pointer actually returned. This implies a
-variable-sized overhead in what's returned by new or
-new[]. It rather surprising how compilers [Ed:
-runtime-support libraries?] are robust to mismatch in
-new/delete
-new[]/delete[].
-
-
-
Here's an example of a system call with an invalid parameter: -
- #include <stdlib.h>
- #include <unistd.h>
- int main( void )
- {
- char* arr = malloc(10);
- (void) write( 1 /* stdout */, arr, 10 );
- return 0;
- }
-
-
-You get this complaint ... -
- Syscall param write(buf) contains uninitialised or unaddressable byte(s) - at 0x4035E072: __libc_write - by 0x402A6E5E: __libc_start_main (libc-start.c:129) - by 0x80483B1: (within tests/badwrite) - by <bogus frame pointer> ??? - Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd - at 0x4004FEE6: malloc (ut_clientmalloc.c:539) - by 0x80484A0: main (tests/badwrite.c:6) - by 0x402A6E5E: __libc_start_main (libc-start.c:129) - by 0x80483B1: (within tests/badwrite) -- -
... because the program has tried to write uninitialised junk from -the malloc'd block to the standard output. - - -
-v):
-More than 50 errors detected. Subsequent errors
- will still be recorded, but in less detail than before.
- -
More than 300 errors detected. I'm not reporting any more.
- Final error counts may be inaccurate. Go fix your
- program!
- -
Warning: client switching stacks?
- -
Warning: client attempted to close Valgrind's logfile fd <number>
-
- --logfile-fd=<number>
- option to specify a different logfile file-descriptor number.
--
Warning: noted but unhandled ioctl <number>
- ioctl system calls, but did not modify its
- memory status info (because I have not yet got round to it).
- The call will still have gone through, but you may get spurious
- errors after this as a result of the non-update of the memory info.
--
Warning: set address range perms: large range <number>
- $PREFIX/lib/valgrind/default.supp.
-
-
-You can ask to add suppressions from another file, by specifying
---suppressions=/path/to/file.supp.
-
-
Each suppression has the following components:
-
- -
Value1,
- Value2,
- Value4 or
- Value8,
- meaning an uninitialised-value error when
- using a value of 1, 2, 4 or 8 bytes.
- Or
- Cond (or its old name, Value0),
- meaning use of an uninitialised CPU condition code. Or:
- Addr1,
- Addr2,
- Addr4 or
- Addr8, meaning an invalid address during a
- memory access of 1, 2, 4 or 8 bytes respectively. Or
- Param,
- meaning an invalid system call parameter error. Or
- Free, meaning an invalid or mismatching free.
- Or PThread, meaning any kind of complaint to do
- with the PThreads API.- -
free, __builtin_vec_delete, etc)- -
- -
-Locations may be either names of shared objects/executables or wildcards
-matching function names. They begin obj: and fun:
-respectively. Function and object names to match against may use the
-wildcard characters * and ?.
-
-A suppression only suppresses an error when the error matches all the
-details in the suppression. Here's an example:
-
- {
- __gconv_transform_ascii_internal/__mbrtowc/mbtowc
- Value4
- fun:__gconv_transform_ascii_internal
- fun:__mbr*toc
- fun:mbtowc
- }
-
-
-What is means is: suppress a use-of-uninitialised-value error, when
-the data size is 4, when it occurs in the function
-__gconv_transform_ascii_internal, when that is called
-from any function of name matching __mbr*toc,
-when that is called from
-mbtowc. It doesn't apply under any other circumstances.
-The string by which this suppression is identified to the user is
-__gconv_transform_ascii_internal/__mbrtowc/mbtowc.
-
-
Another example: -
- {
- libX11.so.6.2/libX11.so.6.2/libXaw.so.7.0
- Value4
- obj:/usr/X11R6/lib/libX11.so.6.2
- obj:/usr/X11R6/lib/libX11.so.6.2
- obj:/usr/X11R6/lib/libXaw.so.7.0
- }
-
-
-Suppress any size 4 uninitialised-value error which occurs anywhere
-in libX11.so.6.2, when called from anywhere in the same
-library, when called from anywhere in libXaw.so.7.0. The
-inexact specification of locations is regrettable, but is about all
-you can hope for, given that the X11 libraries shipped with Red Hat
-7.2 have had their symbol tables removed.
-
-
Note -- since the above two examples did not make it clear -- that
-you can freely mix the obj: and fun:
-styles of description within a single suppression record.
-
-
-
-
-For your convenience, a subset of these so-called client requests is -provided to allow you to tell Valgrind facts about the behaviour of -your program, and conversely to make queries. In particular, your -program can tell Valgrind about changes in memory range permissions -that Valgrind would not otherwise know about, and so allows clients to -get Valgrind to do arbitrary custom checks. -
-Clients need to include the header file valgrind.h to
-make this work. The macros therein have the magical property that
-they generate code in-line which Valgrind can spot. However, the code
-does nothing when not run on Valgrind, so you are not forced to run
-your program on Valgrind just because you use the macros in this file.
-Also, you are not required to link your program with any extra
-supporting libraries.
-
-A brief description of the available macros: -
VALGRIND_MAKE_NOACCESS,
- VALGRIND_MAKE_WRITABLE and
- VALGRIND_MAKE_READABLE. These mark address
- ranges as completely inaccessible, accessible but containing
- undefined data, and accessible and containing defined data,
- respectively. Subsequent errors may have their faulting
- addresses described in terms of these blocks. Returns a
- "block handle". Returns zero when not run on Valgrind.
--
VALGRIND_DISCARD: At some point you may want
- Valgrind to stop reporting errors in terms of the blocks
- defined by the previous three macros. To do this, the above
- macros return a small-integer "block handle". You can pass
- this block handle to VALGRIND_DISCARD. After
- doing so, Valgrind will no longer be able to relate
- addressing errors to the user-defined block associated with
- the handle. The permissions settings associated with the
- handle remain in place; this just affects how errors are
- reported, not whether they are reported. Returns 1 for an
- invalid handle and 0 for a valid handle (although passing
- invalid handles is harmless). Always returns 0 when not run
- on Valgrind.
--
VALGRIND_CHECK_NOACCESS,
- VALGRIND_CHECK_WRITABLE and
- VALGRIND_CHECK_READABLE: check immediately
- whether or not the given address range has the relevant
- property, and if not, print an error message. Also, for the
- convenience of the client, returns zero if the relevant
- property holds; otherwise, the returned value is the address
- of the first byte for which the property is not true.
- Always returns 0 when not run on Valgrind.
--
VALGRIND_CHECK_NOACCESS: a quick and easy way
- to find out whether Valgrind thinks a particular variable
- (lvalue, to be precise) is addressible and defined. Prints
- an error message if not. Returns no value.
--
VALGRIND_MAKE_NOACCESS_STACK: a highly
- experimental feature. Similarly to
- VALGRIND_MAKE_NOACCESS, this marks an address
- range as inaccessible, so that subsequent accesses to an
- address in the range gives an error. However, this macro
- does not return a block handle. Instead, all annotations
- created like this are reviewed at each client
- ret (subroutine return) instruction, and those
- which now define an address range block the client's stack
- pointer register (%esp) are automatically
- deleted.
- - In other words, this macro allows the client to tell - Valgrind about red-zones on its own stack. Valgrind - automatically discards this information when the stack - retreats past such blocks. Beware: hacky and flaky, and - probably interacts badly with the new pthread support. -
-
RUNNING_ON_VALGRIND: returns 1 if running on
- Valgrind, 0 if running on the real CPU.
--
VALGRIND_DO_LEAK_CHECK: run the memory leak detector
- right now. Returns no value. I guess this could be used to
- incrementally check for leaks between arbitrary places in the
- program's execution. Warning: not properly tested!
--
VALGRIND_DISCARD_TRANSLATIONS: discard translations
- of code in the specified address range. Useful if you are
- debugging a JITter or some other dynamic code generation system.
- After this call, attempts to execute code in the invalidated
- address range will cause valgrind to make new translations of that
- code, which is probably the semantics you want. Note that this is
- implemented naively, and involves checking all 200191 entries in
- the translation table to see if any of them overlap the specified
- address range. So try not to call it often, or performance will
- nosedive. Note that you can be clever about this: you only need
- to call it when an area which previously contained code is
- overwritten with new code. You can choose to write code into
- fresh memory, and just call this occasionally to discard large
- chunks of old code all at once.
- - Warning: minimally tested, especially for the cache simulator. -
-It works as follows: threaded apps are (dynamically) linked against
-libpthread.so. Usually this is the one installed with
-your Linux distribution. Valgrind, however, supplies its own
-libpthread.so and automatically connects your program to
-it instead.
-
-The fake libpthread.so and Valgrind cooperate to
-implement a user-space pthreads package. This approach avoids the
-horrible implementation problems of implementing a truly
-multiprocessor version of Valgrind, but it does mean that threaded
-apps run only on one CPU, even if you have a multiprocessor machine.
-
-Valgrind schedules your threads in a round-robin fashion, with all -threads having equal priority. It switches threads every 50000 basic -blocks (typically around 300000 x86 instructions), which means you'll -get a much finer interleaving of thread executions than when run -natively. This in itself may cause your program to behave differently -if you have some kind of concurrency, critical race, locking, or -similar, bugs. -
-The current (valgrind-1.0 release) state of pthread support is as -follows: -
pthread_once, reader-writer locks, semaphores,
- cleanup stacks, cancellation and thread detaching currently work.
- Various attribute-like calls are handled but ignored; you get a
- warning message.
--
write read nanosleep
- sleep select poll
- recvmsg and
- accept.
--
pthread_sigmask, pthread_kill,
- sigwait and raise are now implemented.
- Each thread has its own signal mask, as POSIX requires.
- It's a bit kludgey -- there's a system-wide pending signal set,
- rather than one for each thread. But hey.
-./configure,
-make, make install mechanism, and I have
-attempted to ensure that it works on machines with kernel 2.2 or 2.4
-and glibc 2.1.X or 2.2.X. I don't think there is much else to say.
-There are no options apart from the usual --prefix that
-you should give to ./configure.
-
-
-The configure script tests the version of the X server
-currently indicated by the current $DISPLAY. This is a
-known bug. The intention was to detect the version of the current
-XFree86 client libraries, so that correct suppressions could be
-selected for them, but instead the test checks the server version.
-This is just plain wrong.
-
-
-If you are building a binary package of Valgrind for distribution,
-please read README_PACKAGERS. It contains some important
-information.
-
-
-Apart from that there is no excitement here. Let me know if you have -build problems. - - - - -
See Section 4 for the known limitations of -Valgrind, and for a list of programs which are known not to work on -it. - -
The translator/instrumentor has a lot of assertions in it. They -are permanently enabled, and I have no plans to disable them. If one -of these breaks, please mail me! - -
If you get an assertion failure on the expression
-chunkSane(ch) in vg_free() in
-vg_malloc.c, this may have happened because your program
-wrote off the end of a malloc'd block, or before its beginning.
-Valgrind should have emitted a proper message to that effect before
-dying in this way. This is a known problem which I should fix.
-
- -
Each byte in the system therefore has a 8 V bits which follow -it wherever it goes. For example, when the CPU loads a word-size item -(4 bytes) from memory, it also loads the corresponding 32 V bits from -a bitmap which stores the V bits for the process' entire address -space. If the CPU should later write the whole or some part of that -value to memory at a different address, the relevant V bits will be -stored back in the V-bit bitmap. - -
In short, each bit in the system has an associated V bit, which
-follows it around everywhere, even inside the CPU. Yes, the CPU's
-(integer and %eflags) registers have their own V bit
-vectors.
-
-
Copying values around does not cause Valgrind to check for, or -report on, errors. However, when a value is used in a way which might -conceivably affect the outcome of your program's computation, the -associated V bits are immediately checked. If any of these indicate -that the value is undefined, an error is reported. - -
Here's an (admittedly nonsensical) example: -
- int i, j;
- int a[10], b[10];
- for (i = 0; i < 10; i++) {
- j = a[i];
- b[i] = j;
- }
-
-
-Valgrind emits no complaints about this, since it merely copies
-uninitialised values from a[] into b[], and
-doesn't use them in any way. However, if the loop is changed to
-
- for (i = 0; i < 10; i++) {
- j += a[i];
- }
- if (j == 77)
- printf("hello there\n");
-
-then Valgrind will complain, at the if, that the
-condition depends on uninitialised values.
-
-Most low level operations, such as adds, cause Valgrind to -use the V bits for the operands to calculate the V bits for the -result. Even if the result is partially or wholly undefined, -it does not complain. - -
Checks on definedness only occur in two places: when a value is -used to generate a memory address, and where control flow decision -needs to be made. Also, when a system call is detected, valgrind -checks definedness of parameters as required. - -
If a check should detect undefinedness, an error message is -issued. The resulting value is subsequently regarded as well-defined. -To do otherwise would give long chains of error messages. In effect, -we say that undefined values are non-infectious. - -
This sounds overcomplicated. Why not just check all reads from -memory, and complain if an undefined value is loaded into a CPU register? -Well, that doesn't work well, because perfectly legitimate C programs routinely -copy uninitialised values around in memory, and we don't want endless complaints -about that. Here's the canonical example. Consider a struct -like this: -
- struct S { int x; char c; };
- struct S s1, s2;
- s1.x = 42;
- s1.c = 'z';
- s2 = s1;
-
-
-The question to ask is: how large is struct S, in
-bytes? An int is 4 bytes and a char one byte, so perhaps a struct S
-occupies 5 bytes? Wrong. All (non-toy) compilers I know of will
-round the size of struct S up to a whole number of words,
-in this case 8 bytes. Not doing this forces compilers to generate
-truly appalling code for subscripting arrays of struct
-S's.
-
-
So s1 occupies 8 bytes, yet only 5 of them will be initialised.
-For the assignment s2 = s1, gcc generates code to copy
-all 8 bytes wholesale into s2 without regard for their
-meaning. If Valgrind simply checked values as they came out of
-memory, it would yelp every time a structure assignment like this
-happened. So the more complicated semantics described above is
-necessary. This allows gcc to copy s1 into
-s2 any way it likes, and a warning will only be emitted
-if the uninitialised values are later used.
-
-
One final twist to this story. The above scheme allows garbage to -pass through the CPU's integer registers without complaint. It does -this by giving the integer registers V tags, passing these around in -the expected way. This complicated and computationally expensive to -do, but is necessary. Valgrind is more simplistic about -floating-point loads and stores. In particular, V bits for data read -as a result of floating-point loads are checked at the load -instruction. So if your program uses the floating-point registers to -do memory-to-memory copies, you will get complaints about -uninitialised values. Fortunately, I have not yet encountered a -program which (ab)uses the floating-point registers in this way. - - -
As described above, every bit in memory or in the CPU has an -associated valid-value (V) bit. In addition, all bytes in memory, but -not in the CPU, have an associated valid-address (A) bit. This -indicates whether or not the program can legitimately read or write -that location. It does not give any indication of the validity or the -data at that location -- that's the job of the V bits -- only whether -or not the location may be accessed. - -
Every time your program reads or writes memory, Valgrind checks the -A bits associated with the address. If any of them indicate an -invalid address, an error is emitted. Note that the reads and writes -themselves do not change the A bits, only consult them. - -
So how do the A bits get set/cleared? Like this: - -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- This apparently strange choice reduces the amount of confusing - information presented to the user. It avoids the - unpleasant phenomenon in which memory is read from a place which - is both unaddressible and contains invalid values, and, as a - result, you get not only an invalid-address (read/write) error, - but also a potentially large set of uninitialised-value errors, - one for every time the value is used. -
- There is a hazy boundary case to do with multi-byte loads from
- addresses which are partially valid and partially invalid. See
- details of the flag --partial-loads-ok for details.
-
- -
- -
- -
- -
Under the hood, dealing with signals is a real pain, and Valgrind's -simulation leaves much to be desired. If your program does -way-strange stuff with signals, bad things may happen. If so, let me -know. I don't promise to fix it, but I'd at least like to be aware of -it. - - - -
For each such block, Valgrind scans the entire address space of the -process, looking for pointers to the block. One of three situations -may result: - -
- -
- -
The precise area of memory in which Valgrind searches for pointers -is: all naturally-aligned 4-byte words for which all A bits indicate -addressibility and all V bits indicated that the stored value is -actually valid. - -
Valgrind will run x86-GNU/Linux ELF dynamically linked binaries, on -a kernel 2.2.X or 2.4.X system, subject to the following constraints: - -
- -
libpthread.so, so that Valgrind can
- substitute its own implementation at program startup time. If
- you're statically linked against it, things will fail
- badly.- -
- -
- -
- -
- -
- -
- -
- -
-
__pthread_clock_gettime and
- __pthread_clock_settime. This appears to be due to
- /lib/librt-2.2.5.so needing them. Unfortunately I
- do not understand enough about this problem to fix it properly,
- and I can't reproduce it on my test RedHat 7.3 system. Please
- mail me if you have more information / understanding. -
-
-fno-builtin-strlen in
- the meantime. Or use an earlier gcc.-
The dynamic linker allows each .so in the process image to have an -initialisation function which is run before main(). It also allows -each .so to have a finalisation function run after main() exits. - -
When valgrind.so's initialisation function is called by the dynamic -linker, the synthetic CPU to starts up. The real CPU remains locked -in valgrind.so for the entire rest of the program, but the synthetic -CPU returns from the initialisation function. Startup of the program -now continues as usual -- the dynamic linker calls all the other .so's -initialisation routines, and eventually runs main(). This all runs on -the synthetic CPU, not the real one, but the client program cannot -tell the difference. - -
Eventually main() exits, so the synthetic CPU calls valgrind.so's -finalisation function. Valgrind detects this, and uses it as its cue -to exit. It prints summaries of all errors detected, possibly checks -for memory leaks, and then exits the finalisation routine, but now on -the real CPU. The synthetic CPU has now lost control -- permanently --- so the program exits back to the OS on the real CPU, just as it -would have done anyway. - -
On entry, Valgrind switches stacks, so it runs on its own stack. -On exit, it switches back. This means that the client program -continues to run on its own stack, so we can switch back and forth -between running it on the simulated and real CPUs without difficulty. -This was an important design decision, because it makes it easy (well, -significantly less difficult) to debug the synthetic CPU. - - - -
Valgrind no longer directly supports detection of self-modifying -code. Such checking is expensive, and in practice (fortunately) -almost no applications need it. However, to help people who are -debugging dynamic code generation systems, there is a Client Request -(basically a macro you can put in your program) which directs Valgrind -to discard translations in a given address range. So Valgrind can -still work in this situation provided the client tells it when -code has become out-of-date and needs to be retranslated. - -
The JITter translates basic blocks -- blocks of straight-line-code --- as single entities. To minimise the considerable difficulties of -dealing with the x86 instruction set, x86 instructions are first -translated to a RISC-like intermediate code, similar to sparc code, -but with an infinite number of virtual integer registers. Initially -each insn is translated seperately, and there is no attempt at -instrumentation. - -
The intermediate code is improved, mostly so as to try and cache -the simulated machine's registers in the real machine's registers over -several simulated instructions. This is often very effective. Also, -we try to remove redundant updates of the simulated machines's -condition-code register. - -
The intermediate code is then instrumented, giving more -intermediate code. There are a few extra intermediate-code operations -to support instrumentation; it is all refreshingly simple. After -instrumentation there is a cleanup pass to remove redundant value -checks. - -
This gives instrumented intermediate code which mentions arbitrary -numbers of virtual registers. A linear-scan register allocator is -used to assign real registers and possibly generate spill code. All -of this is still phrased in terms of the intermediate code. This -machinery is inspired by the work of Reuben Thomas (MITE). - -
Then, and only then, is the final x86 code emitted. The -intermediate code is carefully designed so that x86 code can be -generated from it without need for spare registers or other -inconveniences. - -
The translations are managed using a traditional LRU-based caching -scheme. The translation cache has a default size of about 14MB. - - - -
When such a signal arrives, Valgrind's own handler catches it, and -notes the fact. At a convenient safe point in execution, Valgrind -builds a signal delivery frame on the client's stack and runs its -handler. If the handler longjmp()s, there is nothing more to be said. -If the handler returns, Valgrind notices this, zaps the delivery -frame, and carries on where it left off before delivering the signal. - -
The purpose of this nonsense is that setting signal handlers -essentially amounts to giving callback addresses to the Linux kernel. -We can't allow this to happen, because if it did, signal handlers -would run on the real CPU, not the simulated one. This means the -checking machinery would not operate during the handler run, and, -worse, memory permissions maps would not be updated, which could cause -spurious error reports once the handler had returned. - -
An even worse thing would happen if the signal handler longjmp'd -rather than returned: Valgrind would completely lose control of the -client program. - -
Upshot: we can't allow the client to install signal handlers -directly. Instead, Valgrind must catch, on behalf of the client, any -signal the client asks to catch, and must delivery it to the client on -the simulated CPU, not the real one. This involves considerable -gruesome fakery; see vg_signals.c for details. -
- -
-sewardj@phoenix:~/newmat10$ -~/Valgrind-6/valgrind -v ./bogon -==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1. -==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward. -==25832== Startup, with flags: -==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp -==25832== reading syms from /lib/ld-linux.so.2 -==25832== reading syms from /lib/libc.so.6 -==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0 -==25832== reading syms from /lib/libm.so.6 -==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3 -==25832== reading syms from /home/sewardj/Valgrind/valgrind.so -==25832== reading syms from /proc/self/exe -==25832== loaded 5950 symbols, 142333 line number locations -==25832== -==25832== Invalid read of size 4 -==25832== at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45) -==25832== by 0x80487AF: main (bogon.cpp:66) -==25832== by 0x40371E5E: __libc_start_main (libc-start.c:129) -==25832== by 0x80485D1: (within /home/sewardj/newmat10/bogon) -==25832== Address 0xBFFFF74C is not stack'd, malloc'd or free'd -==25832== -==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) -==25832== malloc/free: in use at exit: 0 bytes in 0 blocks. -==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. -==25832== For a detailed leak analysis, rerun with: --leak-check=yes -==25832== -==25832== exiting, did 1881 basic blocks, 0 misses. -==25832== 223 translations, 3626 bytes in, 56801 bytes out. --
The GCC folks fixed this about a week before gcc-3.0 shipped. -
- -Also, since one instruction cache read is performed per instruction executed, -you can find out how many instructions are executed per line, which can be -useful for traditional profiling and test coverage.
- -Any feedback, bug-fixes, suggestions, etc, welcome. - - -
-g flag). But by contrast with normal Valgrind use, you
-probably do want to turn optimisation on, since you should profile your
-program as it will be normally run.
-
-The two steps are:
-cachegrind in front of the
- normal command line invocation. When the program finishes,
- Valgrind will print summary cache statistics. It also collects
- line-by-line information in a file
- cachegrind.out.pid, where pid
- is the program's process id.
- - This step should be done every time you want to collect - information about a new program, a changed program, or about the - same program with different input. -
-
--auto=yes option. You can annotate C/C++
- files or assembly language files equally easily.
- - This step can be performed as many times as you like for each - Step 2. You may want to do multiple annotations showing - different information each time.
-
- - -
- -The more specific characteristics of the simulation are as follows. - -
- -
- -
-
--I1, --D1 and --L2 options.- -Other noteworthy behaviour: - -
inc and
- dec) are counted as doing just a read, ie. a single data
- reference. This may seem strange, but since the write can never cause a
- miss (the read guarantees the block is in the cache) it's not very
- interesting.- - Thus it measures not the number of times the data cache is accessed, but - the number of times a data cache miss could occur.
-
vg_cachesim_I1.c, vg_cachesim_D1.c,
-vg_cachesim_L2.c and vg_cachesim_gen.c. We'd be
-interested to hear from anyone who does.
-
-
---cachesim=yes
-option to the valgrind shell script. Alternatively, it
-is probably more convenient to use the cachegrind script.
-Either way automatically turns off Valgrind's memory checking functions,
-since the cache simulation is slow enough already, and you probably
-don't want to do both at once.
-
-To gather cache profiling information about the program ls
--l, type:
-
-
cachegrind ls -l
-
-The program will execute (slowly). Upon completion, summary statistics
-that look like this will be printed:
-
--==31751== I refs: 27,742,716 -==31751== I1 misses: 276 -==31751== L2 misses: 275 -==31751== I1 miss rate: 0.0% -==31751== L2i miss rate: 0.0% -==31751== -==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr) -==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr) -==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr) -==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%) -==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%) -==31751== -==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr) -==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%) -- -Cache accesses for instruction fetches are summarised first, giving the -number of fetches made (this is the number of instructions executed, which -can be useful to know in its own right), the number of I1 misses, and the -number of L2 instruction (
L2i) misses.
-
-Cache accesses for data follow. The information is similar to that of the
-instruction fetches, except that the values are also shown split between reads
-and writes (note each row's rd and wr values add up
-to the row's total).
- -Combined instruction and data figures for the L2 cache follow that.
- - -
cachegrind.out.pid. This file is human-readable, but is
-best interpreted by the accompanying program vg_annotate,
-described in the next section.
-
-Things to note about the cachegrind.out.pid file:
-
valgrind --cachesim=yes or
- cachegrind is run, and will overwrite any existing
- cachegrind.out.pid in the current directory (but
- that won't happen very often because it takes some time for process ids
- to be recycled).-
ls -l generates a file of about
- 350KB. Browsing a few files and web pages with a Konqueror
- built with full debugging information generates a file
- of around 15 MB.cachegrind.out (i.e. no .pid suffix).
-The suffix serves two purposes. Firstly, it means you don't have to rename old
-log files that you don't want to overwrite. Secondly, and more importantly,
-it allows correct profiling with the --trace-children=yes option
-of programs that spawn child processes.
-
-
-- -The interesting cache-simulation specific options are: - -
--I1=<size>,<associativity>,<line_size>--D1=<size>,<associativity>,<line_size>--L2=<size>,<associativity>,<line_size>- [default: uses CPUID for automagic cache configuration]
-
- Manually specifies the I1/D1/L2 cache configuration, where
- size and line_size are measured in bytes. The
- three items must be comma-separated, but with no spaces, eg:
-
-
cachegrind --I1=65535,2,64
-
- You can specify one, two or three of the I1/D1/L2 caches. Any level not
- manually specified will be simulated using the configuration found in the
- normal way (via the CPUID instruction, or failing that, via defaults).
-vg_annotate, it is worth widening your
-window to be at least 120-characters wide if possible, as the output
-lines can be quite long.
-
-To get a function-by-function summary, run vg_annotate
---pid in a directory containing a
-cachegrind.out.pid file. The --pid
-is required so that vg_annotate knows which log file to use when
-several are present.
-
-The output looks like this: - -
--------------------------------------------------------------------------------- -I1 cache: 65536 B, 64 B, 2-way associative -D1 cache: 65536 B, 64 B, 2-way associative -L2 cache: 262144 B, 64 B, 8-way associative -Command: concord vg_to_ucode.c -Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw -Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw -Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw -Threshold: 99% -Chosen for annotation: -Auto-annotation: on - --------------------------------------------------------------------------------- -Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw --------------------------------------------------------------------------------- -27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS - --------------------------------------------------------------------------------- -Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function --------------------------------------------------------------------------------- -8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc -5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word -2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp -2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash -2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower -1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert - 897,991 51 51 897,831 95 30 62 1 1 ???:??? - 598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile - 598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile - 598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc - 446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing - 341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER - 320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table - 298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create - 149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0 - 149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0 - 95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node - 85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue -- -First up is a summary of the annotation options: - -
- -
- -
-
Ir : I cache reads (ie. instructions executed)I1mr: I1 cache read missesI2mr: L2 cache instruction read missesDr : D cache reads (ie. memory reads)D1mr: D1 cache read missesD2mr: L2 cache data read missesDw : D cache writes (ie. memory writes)D1mw: D1 cache write missesD2mw: L2 cache data write misses
- Note that D1 total accesses is given by D1mr +
- D1mw, and that L2 total accesses is given by
- I2mr + D2mr + D2mw.
- -
--show option.- -
Ir counts to lowest. If two functions have identical
- Ir counts, they will then be sorted by I1mr
- counts, and so on. This order can be adjusted with the
- --sort option.
-
- Note that this dictates the order the functions appear. It is not
- the order in which the columns appear; that is dictated by the "events
- shown" line (and can be changed with the --show option).
-
- -
vg_annotate by default omits functions
- that cause very low numbers of misses to avoid drowning you in
- information. In this case, vg_annotate shows summaries the
- functions that account for 99% of the Ir counts;
- Ir is chosen as the threshold event since it is the
- primary sort event. The threshold can be adjusted with the
- --threshold option.- -
- -
--auto=yes option. In this case no.-
cachegrind.
-
-Then follows function-by-function statistics. Each function is
-identified by a file_name:function_name pair. If a column
-contains only a dot it means the function never performs
-that event (eg. the third row shows that strcmp()
-contains no instructions that write to memory). The name
-??? is used if the the file name and/or function name
-could not be determined from debugging information. If most of the
-entries have the form ???:??? the program probably wasn't
-compiled with -g. If any code was invalidated (either due to
-self-modifying code or unloading of shared objects) its counts are aggregated
-into a single cost centre written as (discarded):(discarded).
- -It is worth noting that functions will come from three types of source files: -
concord.c in this example).getc.c)vg_clientmalloc.c:malloc). These are recognisable because
- the filename begins with vg_, and is probably one of
- vg_main.c, vg_clientmalloc.c or
- vg_mylibc.c.
- --auto=yes option. To do it
-manually, just specify the filenames as arguments to
-vg_annotate. For example, the output from running
-vg_annotate concord.c for our example produces the same
-output as above followed by an annotated version of
-concord.c, a section of which looks like:
-
-
---------------------------------------------------------------------------------
--- User-annotated source: concord.c
---------------------------------------------------------------------------------
-Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-
-[snip]
-
- . . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
- 3 1 1 . . . 1 0 0 {
- . . . . . . . . . FILE *file_ptr;
- . . . . . . . . . Word_Info *data;
- 1 0 0 . . . 1 1 1 int line = 1, i;
- . . . . . . . . .
- 5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
- . . . . . . . . .
- 4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
- 3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
- . . . . . . . . .
- . . . . . . . . . /* Open file, check it. */
- 6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
- 2 0 0 1 0 0 . . . if (!(file_ptr)) {
- . . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
- 1 1 1 . . . . . . exit(EXIT_FAILURE);
- . . . . . . . . . }
- . . . . . . . . .
- 165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
- 146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
- . . . . . . . . .
- 4 0 0 1 0 0 2 0 0 free(data);
- 4 0 0 1 0 0 2 0 0 fclose(file_ptr);
- 3 0 0 2 0 0 . . . }
-
-
-(Although column widths are automatically minimised, a wide terminal is clearly
-useful.)
-
-Each source file is clearly marked (User-annotated source) as
-having been chosen manually for annotation. If the file was found in one of
-the directories specified with the -I/--include
-option, the directory and file are both given.
- -Each line is annotated with its event counts. Events not applicable for a line -are represented by a `.'; this is useful for distinguishing between an event -which cannot happen, and one which can but did not.
- -Sometimes only a small section of a source file is executed. To minimise -uninteresting output, Valgrind only shows annotated lines and lines within a -small distance of annotated lines. Gaps are marked with the line numbers so -you know which part of a file the shown code comes from, eg: - -
-(figures and code for line 704) --- line 704 ---------------------------------------- --- line 878 ---------------------------------------- -(figures and code for line 878) -- -The amount of context to show around annotated lines is controlled by the -
--context option.
-
-To get automatic annotation, run vg_annotate --auto=yes.
-vg_annotate will automatically annotate every source file it can find that is
-mentioned in the function-by-function summary. Therefore, the files chosen for
-auto-annotation are affected by the --sort and
---threshold options. Each source file is clearly marked
-(Auto-annotated source) as being chosen automatically. Any files
-that could not be found are mentioned at the end of the output, eg:
-
-
--------------------------------------------------------------------------------- -The following files chosen for auto-annotation could not be found: --------------------------------------------------------------------------------- - getc.c - ctype.c - ../sysdeps/generic/lockfile.c -- -This is quite common for library files, since libraries are usually compiled -with debugging information, but the source files are often not present on a -system. If a file is chosen for annotation both manually and -automatically, it is marked as
User-annotated source.
-
-Use the -I/--include option to tell Valgrind where to look for
-source files if the filenames found from the debugging information aren't
-specific enough.
-
-Beware that vg_annotate can take some time to digest large
-cachegrind.out.pid files, e.g. 30 seconds or more. Also
-beware that auto-annotation can produce a lot of output if your program is
-large!
-
-
-
-
-To do this, you just need to assemble your .s files with
-assembler-level debug information. gcc doesn't do this, but you can
-use the GNU assembler with the --gstabs option to
-generate object files with this information, eg:
-
-
as --gstabs foo.s
-
-You can then profile and annotate source files in the same way as for C/C++
-programs.
-
-
-vg_annotate options--pid
-
- Indicates which cachegrind.out.pid file to read.
- Not actually an option -- it is required.
-
-
-h, --help-
-v, --version- - Help and version, as usual.
--sort=A,B,C [default: order in
- cachegrind.out.pid]
- Specifies the events upon which the sorting of the function-by-function
- entries will be based. Useful if you want to concentrate on eg. I cache
- misses (--sort=I1mr,I2mr), or D cache misses
- (--sort=D1mr,D2mr), or L2 misses
- (--sort=D2mr,I2mr).
- -
--show=A,B,C [default: all, using order in
- cachegrind.out.pid]
- Specifies which events to show (and the column order). Default is to use
- all present in the cachegrind.out.pid file (and use
- the order in the file).
- -
--threshold=X [default: 99%]
- Sets the threshold for the function-by-function summary. Functions are
- shown that account for more than X% of the primary sort event. If
- auto-annotating, also affects which files are annotated.
-
- Note: thresholds can be set for more than one of the events by appending
- any events for the --sort option with a colon and a number
- (no spaces, though). E.g. if you want to see the functions that cover
- 99% of L2 read misses and 99% of L2 write misses, use this option:
-
-
--sort=D2mr:99,D2mw:99
- - -
--auto=no [default]--auto=yes - When enabled, automatically annotates every file that is mentioned in the - function-by-function summary that can be found. Also gives a list of - those that couldn't be found. - -
--context=N [default: 8]- Print N lines of context before and after each annotated line. Avoids - printing large sections of source files that were not executed. Use a - large number (eg. 10,000) to show all source lines. -
- -
-I=<dir>, --include=<dir>
- [default: empty string]- Adds a directory to the list in which to search for files. Multiple - -I/--include options can be given to add multiple directories. -
cachegrind.out.pid file. This is because the
- information in cachegrind.out.pid is only recorded
- with line numbers, so if the line numbers change at all in the source
- (eg. lines added, deleted, swapped), any annotations will be
- incorrect.- -
cachegrind.out.pid file. If this
- happens, the figures for the bogus lines are printed anyway (clearly
- marked as bogus) in case they are important.-
- 1 0 0 . . . . . . leal -12(%ebp),%eax - 1 0 0 . . . 1 0 0 movl %eax,84(%ebx) - 2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp) - . . . . . . . . . .align 4,0x90 - 1 0 0 . . . . . . movl $.LnrB,%eax - 1 0 0 . . . 1 0 0 movl %eax,-16(%ebp) -- - How can the third instruction be executed twice when the others are - executed only once? As it turns out, it isn't. Here's a dump of the - executable, using
objdump -d:
-
- - 8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax - 8048f28: 89 43 54 mov %eax,0x54(%ebx) - 8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp) - 8048f32: 89 f6 mov %esi,%esi - 8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax - 8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp) -- - Notice the extra
mov %esi,%esi instruction. Where did this
- come from? The GNU assembler inserted it to serve as the two bytes of
- padding needed to align the movl $.LnrB,%eax instruction on
- a four-byte boundary, but pretended it didn't exist when adding debug
- information. Thus when Valgrind reads the debug info it thinks that the
- movl $0x1,0xffffffec(%ebp) instruction covers the address
- range 0x8048f2b--0x804833 by itself, and attributes the counts for the
- mov %esi,%esi to it.-
inline_me() is defined in
- foo.h and inlined in the functions f1(),
- f2() and f3() in bar.c, there will
- not be a foo.h:inline_me() function entry. Instead, there
- will be separate function entries for each inlining site, ie.
- foo.h:f1(), foo.h:f2() and
- foo.h:f3(). To find the total counts for
- foo.h:inline_me(), add up the counts from each entry.
-
- The reason for this is that although the debug info output by gcc
- indicates the switch from bar.c to foo.h, it
- doesn't indicate the name of the function in foo.h, so
- Valgrind keeps using the old one.
- -
/home/user/proj/proj.h and ../proj.h. In this
- case, if you use auto-annotation, the file will be annotated twice with
- the counts split between the two.-
struct
- nlist defined in a.out.h under Linux is only a 16-bit
- value. Valgrind can handle some files with more than 65,535 lines
- correctly by making some guesses to identify line number overflows. But
- some cases are beyond it, in which case you'll get a warning message
- explaining that annotations for the file might be incorrect.-
-g and some without, some
- events that take place in a file without debug info could be attributed
- to the last line of a file with debug info (whichever one gets placed
- before the non-debug-info file in the executable).-
- -Note: stabs is not an easy format to read. If you come across bizarre -annotations that look like might be caused by a bug in the stabs reader, -please let us know.
- - -
- -
- -
- -
- -
malloc() will allocate memory in different
- ways to the standard malloc(), which could warp the results.
- - -
- -
bts, btr and btc
- will incorrectly be counted as doing a data read if both the arguments
- are registers, eg:
-
- btsl %eax, %edx
-
- This should only happen rarely.
- - -
fsave) are treated as though they only access 16 bytes.
- These instructions seem to be rare so hopefully this won't affect
- accuracy much.
- -
valgrind.so file, the size of the program being
-profiled, or even the length of its name can perturb the results. Variations
-will be small, but don't expect perfectly repeatable results if your program
-changes at all.- -While these factors mean you shouldn't trust the results to be super-accurate, -hopefully they should be close enough to be useful.
- - -
-
- 2 How to use it, and how to
- make sense of the results
- 2.1 Getting started
- 2.2 The commentary
- 2.3 Reporting of errors
- 2.4 Suppressing errors
- 2.5 Command-line flags
- 2.6 Explanation of error messages
- 2.7 Writing suppressions files
- 2.8 The Client Request mechanism
- 2.9 Support for POSIX pthreads
- 2.10 Building and installing
- 2.11 If you have problems
-
- 3 Details of the checking machinery
- 3.1 Valid-value (V) bits
- 3.2 Valid-address (A) bits
- 3.3 Putting it all together
- 3.4 Signals
- 3.5 Memory leak detection
-
- 4 Limitations
-
- 5 How it works -- a rough overview
- 5.1 Getting started
- 5.2 The translation/instrumentation engine
- 5.3 Tracking the status of memory
- 5.4 System calls
- 5.5 Signals
-
- 6 An example
-
- 8 The design and implementation of Valgrind
-
-
-
diff --git a/docs/techdocs.html b/docs/techdocs.html
deleted file mode 100644
index 2e1cc8b7e..000000000
--- a/docs/techdocs.html
+++ /dev/null
@@ -1,2524 +0,0 @@
-
-
-jseward@acm.org
-http://developer.kde.org/~sewardj
-Copyright © 2000-2002 Julian Seward
-
-Valgrind is licensed under the GNU General Public License,
-version 2
-An open-source tool for finding memory-management problems in
-x86 GNU/Linux executables.
-
- - - - -
-You may need to read this document several times, and carefully. Some -important things, I only say once. - - -
-Most of the rest of 2001 was taken up designing and implementing the -instrumentation scheme. The main difficulty, which consumed a lot -of effort, was to design a scheme which did not generate large numbers -of false uninitialised-value warnings. By late 2001 a satisfactory -scheme had been arrived at, and I started to test it on ever-larger -programs, with an eventual eye to making it work well enough so that -it was helpful to folks debugging the upcoming version 3 of KDE. I've -used KDE since before version 1.0, and wanted to Valgrind to be an -indirect contribution to the KDE 3 development effort. At the start of -Feb 02 the kde-core-devel crew started using it, and gave a huge -amount of helpful feedback and patches in the space of three weeks. -Snapshot 20020306 is the result. - -
-In the best Unix tradition, or perhaps in the spirit of Fred Brooks'
-depressing-but-completely-accurate epitaph "build one to throw away;
-you will anyway", much of Valgrind is a second or third rendition of
-the initial idea. The instrumentation machinery
-(vg_translate.c, vg_memory.c) and core CPU
-simulation (vg_to_ucode.c, vg_from_ucode.c)
-have had three redesigns and rewrites; the register allocator,
-low-level memory manager (vg_malloc2.c) and symbol table
-reader (vg_symtab2.c) are on the second rewrite. In a
-sense, this document serves to record some of the knowledge gained as
-a result.
-
-
-
valgrind.so, and also a dummy one,
-valgrinq.so, of which more later. The
-valgrind shell script adds valgrind.so to
-the LD_PRELOAD list of extra libraries to be
-loaded with any dynamically linked library. This is a standard trick,
-one which I assume the LD_PRELOAD mechanism was developed
-to support.
-
-
-valgrind.so
-is linked with the -z initfirst flag, which requests that
-its initialisation code is run before that of any other object in the
-executable image. When this happens, valgrind gains control. The
-real CPU becomes "trapped" in valgrind.so and the
-translations it generates. The synthetic CPU provided by Valgrind
-does, however, return from this initialisation function. So the
-normal startup actions, orchestrated by the dynamic linker
-ld.so, continue as usual, except on the synthetic CPU,
-not the real one. Eventually main is run and returns,
-and then the finalisation code of the shared objects is run,
-presumably in inverse order to which they were initialised. Remember,
-this is still all happening on the simulated CPU. Eventually
-valgrind.so's own finalisation code is called. It spots
-this event, shuts down the simulated CPU, prints any error summaries
-and/or does leak detection, and returns from the initialisation code
-on the real CPU. At this point, in effect the real and synthetic CPUs
-have merged back into one, Valgrind has lost control of the program,
-and the program finally exit()s back to the kernel in the
-usual way.
-
-
-The normal course of activity, one Valgrind has started up, is as
-follows. Valgrind never runs any part of your program (usually
-referred to as the "client"), not a single byte of it, directly.
-Instead it uses function VG_(translate) to translate
-basic blocks (BBs, straight-line sequences of code) into instrumented
-translations, and those are run instead. The translations are stored
-in the translation cache (TC), vg_tc, with the
-translation table (TT), vg_tt supplying the
-original-to-translation code address mapping. Auxiliary array
-VG_(tt_fast) is used as a direct-map cache for fast
-lookups in TT; it usually achieves a hit rate of around 98% and
-facilitates an orig-to-trans lookup in 4 x86 insns, which is not bad.
-
-
-Function VG_(dispatch) in vg_dispatch.S is
-the heart of the JIT dispatcher. Once a translated code address has
-been found, it is executed simply by an x86 call
-to the translation. At the end of the translation, the next
-original code addr is loaded into %eax, and the
-translation then does a ret, taking it back to the
-dispatch loop, with, interestingly, zero branch mispredictions.
-The address requested in %eax is looked up first in
-VG_(tt_fast), and, if not found, by calling C helper
-VG_(search_transtab). If there is still no translation
-available, VG_(dispatch) exits back to the top-level
-C dispatcher VG_(toploop), which arranges for
-VG_(translate) to make a new translation. All fairly
-unsurprising, really. There are various complexities described below.
-
-
-The translator, orchestrated by VG_(translate), is
-complicated but entirely self-contained. It is described in great
-detail in subsequent sections. Translations are stored in TC, with TT
-tracking administrative information. The translations are subject to
-an approximate LRU-based management scheme. With the current
-settings, the TC can hold at most about 15MB of translations, and LRU
-passes prune it to about 13.5MB. Given that the
-orig-to-translation expansion ratio is about 13:1 to 14:1, this means
-TC holds translations for more or less a megabyte of original code,
-which generally comes to about 70000 basic blocks for C++ compiled
-with optimisation on. Generating new translations is expensive, so it
-is worth having a large TC to minimise the (capacity) miss rate.
-
-
-The dispatcher, VG_(dispatch), receives hints from
-the translations which allow it to cheaply spot all control
-transfers corresponding to x86 call and ret
-instructions. It has to do this in order to spot some special events:
-
VG_(shutdown). This is Valgrind's cue to
- exit. NOTE: actually this is done a different way; it should be
- cleaned up.
--
VG_(signalreturn_bogusRA). The signal simulator
- needs to know when a signal handler is returning, so we spot
- jumps (returns) to this address.
--
vg_trap_here. All malloc,
- free, etc calls that the client program makes are
- eventually routed to a call to vg_trap_here,
- and Valgrind does its own special thing with these calls.
- In effect this provides a trapdoor, by which Valgrind can
- intercept certain calls on the simulated CPU, run the call as it
- sees fit itself (on the real CPU), and return the result to
- the simulated CPU, quite transparently to the client program.
-malloc,
-free, etc,
-calls, so that it can store additional information. Each block
-malloc'd by the client gives rise to a shadow block
-in which Valgrind stores the call stack at the time of the
-malloc
-call. When the client calls free, Valgrind tries to
-find the shadow block corresponding to the address passed to
-free, and emits an error message if none can be found.
-If it is found, the block is placed on the freed blocks queue
-vg_freed_list, it is marked as inaccessible, and
-its shadow block now records the call stack at the time of the
-free call. Keeping free'd blocks in
-this queue allows Valgrind to spot all (presumably invalid) accesses
-to them. However, once the volume of blocks in the free queue
-exceeds VG_(clo_freelist_vol), blocks are finally
-removed from the queue.
-
-
-Keeping track of A and V bits (note: if you don't know what these are,
-you haven't read the user guide carefully enough) for memory is done
-in vg_memory.c. This implements a sparse array structure
-which covers the entire 4G address space in a way which is reasonably
-fast and reasonably space efficient. The 4G address space is divided
-up into 64K sections, each covering 64Kb of address space. Given a
-32-bit address, the top 16 bits are used to select one of the 65536
-entries in VG_(primary_map). The resulting "secondary"
-(SecMap) holds A and V bits for the 64k of address space
-chunk corresponding to the lower 16 bits of the address.
-
-
-
-Valgrind's answer is: cheat. Valgrind is designed so that it is
-possible to switch back to running the client program on the real
-CPU at any point. Using the --stop-after= flag, you can
-ask Valgrind to run just some number of basic blocks, and then
-run the rest of the way on the real CPU. If you are searching for
-a bug in the simulated CPU, you can use this to do a binary search,
-which quickly leads you to the specific basic block which is
-causing the problem.
-
-
-This is all very handy. It does constrain the design in certain
-unimportant ways. Firstly, the layout of memory, when viewed from the
-client's point of view, must be identical regardless of whether it is
-running on the real or simulated CPU. This means that Valgrind can't
-do pointer swizzling -- well, no great loss -- and it can't run on
-the same stack as the client -- again, no great loss.
-Valgrind operates on its own stack, VG_(stack), which
-it switches to at startup, temporarily switching back to the client's
-stack when doing system calls for the client.
-
-
-Valgrind also receives signals on its own stack,
-VG_(sigstack), but for different gruesome reasons
-discussed below.
-
-
-This nice clean switch-back-to-the-real-CPU-whenever-you-like story
-is muddied by signals. Problem is that signals arrive at arbitrary
-times and tend to slightly perturb the basic block count, with the
-result that you can get close to the basic block causing a problem but
-can't home in on it exactly. My kludgey hack is to define
-SIGNAL_SIMULATION to 1 towards the bottom of
-vg_syscall_mem.c, so that signal handlers are run on the
-real CPU and don't change the BB counts.
-
-
-A second hole in the switch-back-to-real-CPU story is that Valgrind's -way of delivering signals to the client is different from that of the -kernel. Specifically, the layout of the signal delivery frame, and -the mechanism used to detect a sighandler returning, are different. -So you can't expect to make the transition inside a sighandler and -still have things working, but in practice that's not much of a -restriction. - -
-Valgrind's implementation of malloc, free,
-etc, (in vg_clientmalloc.c, not the low-level stuff in
-vg_malloc2.c) is somewhat complicated by the need to
-handle switching back at arbitrary points. It does work tho.
-
-
-
-
- I am of the view that it's acceptable to spend 5% of the total - running time of your valgrindified program doing assertion checks - and other internal sanity checks. -
-
VG_(do_sanity_checks)
- runs every 1000 basic blocks, which means 500 to 2000 times/second
- for typical machines at present. It checks that Valgrind hasn't
- overrun its private stack, and does some simple checks on the
- memory permissions maps. Once every 25 calls it does some more
- extensive checks on those maps. Etc, etc.
- - The following components also have sanity check code, which can - be enabled to aid debugging: -
VG_(mallocSanityCheckArena)). This does a
- complete check of all blocks and chains in an arena, which
- is very slow. Is not engaged by default.
- -
VG_(read_symbols)
- for a start. Is permanently engaged.
- -
vg_memory.c.
- This can be compiled with cpp symbol
- VG_DEBUG_MEMORY defined, which removes all the
- fast, optimised cases, and uses simple-but-slow fallbacks
- instead. Not engaged by default.
- -
VG_DEBUG_LEAKCHECK.
- -
VG_(saneUInstr) and sanity checks the sequence
- as a whole with VG_(saneUCodeBlock). This stuff
- is engaged by default, and has caught some way-obscure bugs
- in the simulated CPU machinery in its time.
- -
VG_(first_and_last_secondaries_look_plausible) after
- every syscall; this is known to pick up bugs in the syscall
- wrappers. Engaged by default.
- -
VG_(dispatch), checks
- that translations do not set %ebp to any value
- different from VG_EBP_DISPATCH_CHECKED or
- & VG_(baseBlock). In effect this test is free,
- and is permanently engaged.
- -
vg_do_register_allocation.
- -
-Some more specific things are: - -
ld.so's point of view, and it therefore absolutely
- had better not export any symbol with a name which could clash
- with that of the client or any of its libraries. Therefore, all
- globally visible symbols exported from valgrind.so
- are defined using the VG_ CPP macro. As you'll see
- from vg_constants.h, this appends some arbitrary
- prefix to the symbol, in order that it be, we hope, globally
- unique. Currently the prefix is vgPlain_. For
- convenience there are also VGM_, VGP_
- and VGOFF_. All locally defined symbols are declared
- static and do not appear in the final shared object.
-
- To check this, I periodically do
- nm valgrind.so | grep " T ",
- which shows you all the globally exported text symbols.
- They should all have an approved prefix, except for those like
- malloc, free, etc, which we deliberately
- want to shadow and take precedence over the same names exported
- from glibc.so, so that valgrind can intercept those
- calls easily. Similarly, nm valgrind.so | grep " D "
- allows you to find any rogue data-segment symbol names.
-
-
glibc.so. For example, we have our own low-level
- memory manager in vg_malloc2.c, which is a fairly
- standard malloc/free scheme augmented with arenas, and
- vg_mylibc.c exports reimplementations of various bits
- and pieces you'd normally get from the C library.
-
- Why all the hassle? Because imagine the potential chaos of both
- the simulated and real CPUs executing in glibc.so.
- It just seems simpler and cleaner to be completely self-contained,
- so that only the simulated CPU visits glibc.so. In
- practice it's not much hassle anyway. Also, valgrind starts up
- before glibc has a chance to initialise itself, and who knows what
- difficulties that could lead to. Finally, glibc has definitions
- for some types, specifically sigset_t, which conflict
- (are different from) the Linux kernel's idea of same. When
- Valgrind wants to fiddle around with signal stuff, it wants to
- use the kernel's definitions, not glibc's definitions. So it's
- simplest just to keep glibc out of the picture entirely.
-
- To find out which glibc symbols are used by Valgrind, reinstate
- the link flags -nostdlib -Wl,-no-undefined. This
- causes linking to fail, but will tell you what you depend on.
- I have mostly, but not entirely, got rid of the glibc
- dependencies; what remains is, IMO, fairly harmless. AFAIK the
- current dependencies are: memset,
- memcmp, stat, system,
- sbrk, setjmp and longjmp.
-
-
-
vg_syscall_mem imports, via
- vg_unsafe.h, a significant number of C-library
- headers so as to know the sizes of various structs passed across
- the kernel boundary. This is of course completely bogus, since
- there is no guarantee that the C library's definitions of these
- structs matches those of the kernel. I have started to sort this
- out using vg_kerneliface.h, into which I had intended
- to copy all kernel definitions which valgrind could need, but this
- has not gotten very far. At the moment it mostly contains
- definitions for sigset_t and struct
- sigaction, since the kernel's definition for these really
- does clash with glibc's. I plan to use a vki_ prefix
- on all these types and constants, to denote the fact that they
- pertain to Valgrind's Kernel Interface.
-
- Another advantage of having a vg_kerneliface.h file
- is that it makes it simpler to interface to a different kernel.
- Once can, for example, easily imagine writing a new
- vg_kerneliface.h for FreeBSD, or x86 NetBSD.
-
-
-No MMX. Fixing this should be relatively easy, using the same giant -trick used for x86 FPU instructions. See below. -
-Support for weird (non-POSIX) signal stuff is patchy. Does anybody -care? -
- - - - -
-Since it generates x86 code in memory, Valgrind has complete control
-of the use of registers in the translations. Now pay attention. I
-shall say this only once, and it is important you understand this. In
-what follows I will refer to registers in the host (real) cpu using
-their standard names, %eax, %edi, etc. I
-refer to registers in the simulated CPU by capitalising them:
-%EAX, %EDI, etc. These two sets of
-registers usually bear no direct relationship to each other; there is
-no fixed mapping between them. This naming scheme is used fairly
-consistently in the comments in the sources.
-
-Host registers, once things are up and running, are used as follows: -
%esp, the real stack pointer, points
- somewhere in Valgrind's private stack area,
- VG_(stack) or, transiently, into its signal delivery
- stack, VG_(sigstack).
--
%edi is used as a temporary in code generation; it
- is almost always dead, except when used for the Left
- value-tag operations.
--
%eax, %ebx, %ecx,
- %edx and %esi are available to
- Valgrind's register allocator. They are dead (carry unimportant
- values) in between translations, and are live only in
- translations. The one exception to this is %eax,
- which, as mentioned far above, has a special significance to the
- dispatch loop VG_(dispatch): when a translation
- returns to the dispatch loop, %eax is expected to
- contain the original-code-address of the next translation to run.
- The register allocator is so good at minimising spill code that
- using five regs and not having to save/restore %edi
- actually gives better code than allocating to %edi
- as well, but then having to push/pop it around special uses.
--
%ebp points permanently at
- VG_(baseBlock). Valgrind's translations are
- position-independent, partly because this is convenient, but also
- because translations get moved around in TC as part of the LRUing
- activity. All static entities which need to be referred to
- from generated code, whether data or helper functions, are stored
- starting at VG_(baseBlock) and are therefore reached
- by indexing from %ebp. There is but one exception,
- which is that by placing the value
- VG_EBP_DISPATCH_CHECKED
- in %ebp just before a return to the dispatcher,
- the dispatcher is informed that the next address to run,
- in %eax, requires special treatment.
--
%eflags
- register.
-
-The state of the simulated CPU is stored in memory, in
-VG_(baseBlock), which is a block of 200 words IIRC.
-Recall that %ebp points permanently at the start of this
-block. Function vg_init_baseBlock decides what the
-offsets of various entities in VG_(baseBlock) are to be,
-and allocates word offsets for them. The code generator then emits
-%ebp relative addresses to get at those things. The
-sequence in which entities are allocated has been carefully chosen so
-that the 32 most popular entities come first, because this means 8-bit
-offsets can be used in the generated code.
-
-
-If I was clever, I could make %ebp point 32 words along
-VG_(baseBlock), so that I'd have another 32 words of
-short-form offsets available, but that's just complicated, and it's
-not important -- the first 32 words take 99% (or whatever) of the
-traffic.
-
-
-Currently, the sequence of stuff in VG_(baseBlock) is as
-follows:
-
%EAX .. %EDI, and the simulated flags,
- %EFLAGS.
--
-
VG_(helper_value_check4_fail),
- VG_(helper_value_check0_fail),
- which register V-check failures,
- VG_(helperc_STOREV4),
- VG_(helperc_STOREV1),
- VG_(helperc_LOADV4),
- VG_(helperc_LOADV1),
- which do stores and loads of V bits to/from the
- sparse array which keeps track of V bits in memory,
- and
- VGM_(handle_esp_assignment), which messes with
- memory addressibility resulting from changes in %ESP.
--
%EIP.
--
-
VG_(helperc_STOREV2),
- VG_(helperc_LOADV2). These are here because 2-byte
- loads and stores are relatively rare, so are placed above the
- magic 32-word offset boundary.
--
VGM_(fpu_write_check) and
- VGM_(fpu_read_check), which handle the A/V maps
- testing and changes required by FPU writes/reads.
--
VG_(helper_value_check2_fail) and
- VG_(helper_value_check1_fail). These are probably
- never emitted now, and should be removed.
--
-
vg_helpers.S, which deal with rare situations which
- are tedious or difficult to generate code in-line for.
-
-As a general rule, the simulated machine's state lives permanently in
-memory at VG_(baseBlock). However, the JITter does some
-optimisations which allow the simulated integer registers to be
-cached in real registers over multiple simulated instructions within
-the same basic block. These are always flushed back into memory at
-the end of every basic block, so that the in-memory state is
-up-to-date between basic blocks. (This flushing is implied by the
-statement above that the real machine's allocatable registers are
-dead in between simulated blocks).
-
-
-
VG_(startup), called from
-valgrind.so's initialisation section), really means
-copying the real CPU's state into VG_(baseBlock), and
-then installing our own stack pointer, etc, into the real CPU, and
-then starting up the JITter. Exiting valgrind involves copying the
-simulated state back to the real state.
-
-
-Unfortunately, there's a complication at startup time. Problem is
-that at the point where we need to take a snapshot of the real CPU's
-state, the offsets in VG_(baseBlock) are not set up yet,
-because to do so would involve disrupting the real machine's state
-significantly. The way round this is to dump the real machine's state
-into a temporary, static block of memory,
-VG_(m_state_static). We can then set up the
-VG_(baseBlock) offsets at our leisure, and copy into it
-from VG_(m_state_static) at some convenient later time.
-This copying is done by
-VG_(copy_m_state_static_to_baseBlock).
-
-
-On exit, the inverse transformation is (rather unnecessarily) used:
-stuff in VG_(baseBlock) is copied to
-VG_(m_state_static), and the assembly stub then copies
-from VG_(m_state_static) into the real machine registers.
-
-
-Doing system calls on behalf of the client (vg_syscall.S)
-is something of a half-way house. We have to make the world look
-sufficiently like that which the client would normally have to make
-the syscall actually work properly, but we can't afford to lose
-control. So the trick is to copy all of the client's state, except
-its program counter, into the real CPU, do the system call, and
-copy the state back out. Note that the client's state includes its
-stack pointer register, so one effect of this partial restoration is
-to cause the system call to be run on the client's stack, as it should
-be.
-
-
-As ever there are complications. We have to save some of our own state -somewhere when restoring the client's state into the CPU, so that we -can keep going sensibly afterwards. In fact the only thing which is -important is our own stack pointer, but for paranoia reasons I save -and restore our own FPU state as well, even though that's probably -pointless. - -
-The complication on the above complication is, that for horrible
-reasons to do with signals, we may have to handle a second client
-system call whilst the client is blocked inside some other system
-call (unbelievable!). That means there's two sets of places to
-dump Valgrind's stack pointer and FPU state across the syscall,
-and we decide which to use by consulting
-VG_(syscall_depth), which is in turn maintained by
-VG_(wrap_syscall).
-
-
-
-
-In normal operation, translation proceeds through six stages,
-coordinated by VG_(translate):
-
VG_(disBB)).
--
vg_improve), with the aim of
- caching simulated registers in real registers over multiple
- simulated instructions, and removing redundant simulated
- %EFLAGS saving/restoring.
--
vg_instrument), which adds
- value and address checking code.
--
vg_cleanup), removing
- redundant value-check computations.
--
vg_do_register_allocation),
- which, note, is done on UCode.
--
VG_(emit_code)).
-
-Notice how steps 2, 3, 4 and 5 are simple UCode-to-UCode
-transformation passes, all on straight-line blocks of UCode (type
-UCodeBlock). Steps 2 and 4 are optimisation passes and
-can be disabled for debugging purposes, with
---optimise=no and --cleanup=no respectively.
-
-
-Valgrind can also run in a no-instrumentation mode, given
---instrument=no. This is useful for debugging the JITter
-quickly without having to deal with the complexity of the
-instrumentation mechanism too. In this mode, steps 3 and 4 are
-omitted.
-
-
-These flags combine, so that --instrument=no together with
---optimise=no means only steps 1, 5 and 6 are used.
---single-step=yes causes each x86 instruction to be
-treated as a single basic block. The translations are terrible but
-this is sometimes instructive.
-
-
-The --stop-after=N flag switches back to the real CPU
-after N basic blocks. It also re-JITs the final basic
-block executed and prints the debugging info resulting, so this
-gives you a way to get a quick snapshot of how a basic block looks as
-it passes through the six stages mentioned above. If you want to
-see full information for every block translated (probably not, but
-still ...) find, in VG_(translate), the lines
- dis = True;
- dis = debugging_translation;
-
-and comment out the second line. This will spew out debugging
-junk faster than you can possibly imagine.
-
-
-
-
Tag
-UCode instructions have up to three operand fields, each of which has
-a corresponding Tag describing it. Possible values for
-the tag are:
-
-
NoValue: indicates that the field is not in use.
--
Lit16: the field contains a 16-bit literal.
--
Literal: the field denotes a 32-bit literal, whose
- value is stored in the lit32 field of the uinstr
- itself. Since there is only one lit32 for the whole
- uinstr, only one operand field may contain this tag.
--
SpillNo: the field contains a spill slot number, in
- the range 0 to 23 inclusive, denoting one of the spill slots
- contained inside VG_(baseBlock). Such tags only
- exist after register allocation.
--
RealReg: the field contains a number in the range 0
- to 7 denoting an integer x86 ("real") register on the host. The
- number is the Intel encoding for integer registers. Such tags
- only exist after register allocation.
--
ArchReg: the field contains a number in the range 0
- to 7 denoting an integer x86 register on the simulated CPU. In
- reality this means a reference to one of the first 8 words of
- VG_(baseBlock). Such tags can exist at any point in
- the translation process.
--
TempReg. The field contains the
- number of one of an infinite set of virtual (integer)
- registers. TempRegs are used everywhere throughout
- the translation process; you can have as many as you want. The
- register allocator maps as many as it can into
- RealRegs and turns the rest into
- SpillNos, so TempRegs should not exist
- after the register allocation phase.
-
- TempRegs are always 32 bits long, even if the data
- they hold is logically shorter. In that case the upper unused
- bits are required, and, I think, generally assumed, to be zero.
- TempRegs holding V bits for quantities shorter than
- 32 bits are expected to have ones in the unused places, since a
- one denotes "undefined".
-
UInstr-UCode was carefully designed to make it possible to do register -allocation on UCode and then translate the result into x86 code -without needing any extra registers ... well, that was the original -plan, anyway. Things have gotten a little more complicated since -then. In what follows, UCode instructions are referred to as uinstrs, -to distinguish them from x86 instructions. Uinstrs of course have -uopcodes which are (naturally) different from x86 opcodes. - -
-A uinstr (type UInstr) contains
-various fields, not all of which are used by any one uopcode:
-
val1, val2
- and val3.
--
tag1, tag2
- and tag3. Each of these has a value of type
- Tag,
- and they describe what the val1, val2
- and val3 fields contain.
--
-
FlagSets, specifying which x86 condition codes are
- read and written by the uinstr.
--
Opcode.
--
-
Condcode, indicating the condition
- which applies. The encoding is as it is in the x86 insn stream,
- except we add a 17th value CondAlways to indicate
- an unconditional transfer.
--
-UOpcodes (type Opcode) are divided into two groups: those
-necessary merely to express the functionality of the x86 code, and
-extra uopcodes needed to express the instrumentation. The former
-group contains:
-
GET and PUT, which move values from the
- simulated CPU's integer registers (ArchRegs) into
- TempRegs, and back. GETF and
- PUTF do the corresponding thing for the simulated
- %EFLAGS. There are no corresponding insns for the
- FPU register stack, since we don't explicitly simulate its
- registers.
--
LOAD and STORE, which, in RISC-like
- fashion, are the only uinstrs able to interact with memory.
--
MOV and CMOV allow unconditional and
- conditional moves of values between TempRegs.
--
TempRegs (before reg-alloc) or RealRegs
- (after reg-alloc). These are: ADD, ADC,
- AND, OR, XOR,
- SUB, SBB, SHL,
- SHR, SAR, ROL,
- ROR, RCL, RCR,
- NOT, NEG, INC,
- DEC, BSWAP, CC2VAL and
- WIDEN. WIDEN does signed or unsigned
- value widening. CC2VAL is used to convert condition
- codes into a value, zero or one. The rest are obvious.
-
- To allow for more efficient code generation, we bend slightly the
- restriction at the start of the previous para: for
- ADD, ADC, XOR,
- SUB and SBB, we allow the first (source)
- operand to also be an ArchReg, that is, one of the
- simulated machine's registers. Also, many of these ALU ops allow
- the source operand to be a literal. See
- VG_(saneUInstr) for the final word on the allowable
- forms of uinstrs.
-
-
LEA1 and LEA2 are not strictly
- necessary, but allow faciliate better translations. They
- record the fancy x86 addressing modes in a direct way, which
- allows those amodes to be emitted back into the final
- instruction stream more or less verbatim.
--
CALLM calls a machine-code helper, one of the methods
- whose address is stored at some VG_(baseBlock)
- offset. PUSH and POP move values
- to/from TempReg to the real (Valgrind's) stack, and
- CLEAR removes values from the stack.
- CALLM_S and CALLM_E delimit the
- boundaries of call setups and clearings, for the benefit of the
- instrumentation passes. Getting this right is critical, and so
- VG_(saneUCodeBlock) makes various checks on the use
- of these uopcodes.
-
- It is important to understand that these uopcodes have nothing to
- do with the x86 call, return,
- push or pop instructions, and are not
- used to implement them. Those guys turn into combinations of
- GET, PUT, LOAD,
- STORE, ADD, SUB, and
- JMP. What these uopcodes support is calling of
- helper functions such as VG_(helper_imul_32_64),
- which do stuff which is too difficult or tedious to emit inline.
-
-
FPU, FPU_R and FPU_W.
- Valgrind doesn't attempt to simulate the internal state of the
- FPU at all. Consequently it only needs to be able to distinguish
- FPU ops which read and write memory from those that don't, and
- for those which do, it needs to know the effective address and
- data transfer size. This is made easier because the x86 FP
- instruction encoding is very regular, basically consisting of
- 16 bits for a non-memory FPU insn and 11 (IIRC) bits + an address mode
- for a memory FPU insn. So our FPU uinstr carries
- the 16 bits in its val1 field. And
- FPU_R and FPU_W carry 11 bits in that
- field, together with the identity of a TempReg or
- (later) RealReg which contains the address.
--
JIFZ is unique, in that it allows a control-flow
- transfer which is not deemed to end a basic block. It causes a
- jump to a literal (original) address if the specified argument
- is zero.
--
INCEIP advances the simulated
- %EIP by the specified literal amount. This supports
- lazy %EIP updating, as described below.
--Stages 1 and 2 of the 6-stage translation process mentioned above -deal purely with these uopcodes, and no others. They are -sufficient to express pretty much all the x86 32-bit protected-mode -instruction set, at -least everything understood by a pre-MMX original Pentium (P54C). - -
-Stages 3, 4, 5 and 6 also deal with the following extra -"instrumentation" uopcodes. They are used to express all the -definedness-tracking and -checking machinery which valgrind does. In -later sections we show how to create checking code for each of the -uopcodes above. Note that these instrumentation uopcodes, although -some appearing complicated, have been carefully chosen so that -efficient x86 code can be generated for them. GNU superopt v2.5 did a -great job helping out here. Anyways, the uopcodes are as follows: - -
GETV and PUTV are analogues to
- GET and PUT above. They are identical
- except that they move the V bits for the specified values back and
- forth to TempRegs, rather than moving the values
- themselves.
--
LOADV and STOREV read and
- write V bits from the synthesised shadow memory that Valgrind
- maintains. In fact they do more than that, since they also do
- address-validity checks, and emit complaints if the read/written
- addresses are unaddressible.
--
TESTV, whose parameters are a TempReg
- and a size, tests the V bits in the TempReg, at the
- specified operation size (0/1/2/4 byte) and emits an error if any
- of them indicate undefinedness. This is the only uopcode capable
- of doing such tests.
--
SETV, whose parameters are also TempReg
- and a size, makes the V bits in the TempReg indicated
- definedness, at the specified operation size. This is usually
- used to generate the correct V bits for a literal value, which is
- of course fully defined.
--
GETVF and PUTVF are analogues to
- GETF and PUTF. They move the single V
- bit used to model definedness of %EFLAGS between its
- home in VG_(baseBlock) and the specified
- TempReg.
--
TAG1 denotes one of a family of unary operations on
- TempRegs containing V bits. Similarly,
- TAG2 denotes one in a family of binary operations on
- V bits.
-
-These 10 uopcodes are sufficient to express Valgrind's entire
-definedness-checking semantics. In fact most of the interesting magic
-is done by the TAG1 and TAG2
-suboperations.
-
-
-First, however, I need to explain about V-vector operation sizes.
-There are 4 sizes: 1, 2 and 4, which operate on groups of 8, 16 and 32
-V bits at a time, supporting the usual 1, 2 and 4 byte x86 operations.
-However there is also the mysterious size 0, which really means a
-single V bit. Single V bits are used in various circumstances; in
-particular, the definedness of %EFLAGS is modelled with a
-single V bit. Now might be a good time to also point out that for
-V bits, 1 means "undefined" and 0 means "defined". Similarly, for A
-bits, 1 means "invalid address" and 0 means "valid address". This
-seems counterintuitive (and so it is), but testing against zero on
-x86s saves instructions compared to testing against all 1s, because
-many ALU operations set the Z flag for free, so to speak.
-
-
-With that in mind, the tag ops are: - -
VgT_PCast40,
- VgT_PCast20, VgT_PCast10,
- VgT_PCast01, VgT_PCast02 and
- VgT_PCast04. A "pessimising cast" takes a V-bit
- vector at one size, and creates a new one at another size,
- pessimised in the sense that if any of the bits in the source
- vector indicate undefinedness, then all the bits in the result
- indicate undefinedness. In this case the casts are all to or from
- a single V bit, so for example VgT_PCast40 is a
- pessimising cast from 32 bits to 1, whereas
- VgT_PCast04 simply copies the single source V bit
- into all 32 bit positions in the result. Surprisingly, these ops
- can all be implemented very efficiently.
-
- There are also the pessimising casts VgT_PCast14,
- from 8 bits to 32, VgT_PCast12, from 8 bits to 16,
- and VgT_PCast11, from 8 bits to 8. This last one
- seems nonsensical, but in fact it isn't a no-op because, as
- mentioned above, any undefined (1) bits in the source infect the
- entire result.
-
-
VgT_Left4, VgT_Left2 and
- VgT_Left1. These are used to simulate the worst-case
- effects of carry propagation in adds and subtracts. They return a
- V vector identical to the original, except that if the original
- contained any undefined bits, then it and all bits above it are
- marked as undefined too. Hence the Left bit in the names.
--
VgT_SWiden14, VgT_SWiden24,
- VgT_SWiden12, VgT_ZWiden14,
- VgT_ZWiden24 and VgT_ZWiden12. These
- mimic the definedness effects of standard signed and unsigned
- integer widening. Unsigned widening creates zero bits in the new
- positions, so VgT_ZWiden* accordingly park mark
- those parts of their argument as defined. Signed widening copies
- the sign bit into the new positions, so VgT_SWiden*
- copies the definedness of the sign bit into the new positions.
- Because 1 means undefined and 0 means defined, these operations
- can (fascinatingly) be done by the same operations which they
- mimic. Go figure.
--
VgT_UifU4,
- VgT_UifU2, VgT_UifU1,
- VgT_UifU0, VgT_DifD4,
- VgT_DifD2, VgT_DifD1. These do simple
- bitwise operations on pairs of V-bit vectors, with
- UifU giving undefined if either arg bit is
- undefined, and DifD giving defined if either arg bit
- is defined. Abstract interpretation junkies, if any make it this
- far, may like to think of them as meets and joins (or is it joins
- and meets) in the definedness lattices.
--
VgT_ImproveAND4_TQ,
- VgT_ImproveAND2_TQ, VgT_ImproveAND1_TQ,
- VgT_ImproveOR4_TQ, VgT_ImproveOR2_TQ,
- VgT_ImproveOR1_TQ. These help out with AND and OR
- operations. AND and OR have the inconvenient property that the
- definedness of the result depends on the actual values of the
- arguments as well as their definedness. At the bit level:
- 1 AND undefined = undefined, but
- 0 AND undefined = 0, and similarly
- 0 OR undefined = undefined, but
- 1 OR undefined = 1.
-
- It turns out that gcc (quite legitimately) generates code which
- relies on this fact, so we have to model it properly in order to
- avoid flooding users with spurious value errors. The ultimate
- definedness result of AND and OR is calculated using
- UifU on the definedness of the arguments, but we
- also DifD in some "improvement" terms which
- take into account the above phenomena.
-
- ImproveAND takes as its first argument the actual
- value of an argument to AND (the T) and the definedness of that
- argument (the Q), and returns a V-bit vector which is defined (0)
- for bits which have value 0 and are defined; this, when
- DifD into the final result causes those bits to be
- defined even if the corresponding bit in the other argument is undefined.
-
- The ImproveOR ops do the dual thing for OR
- arguments. Note that XOR does not have this property that one
- argument can make the other irrelevant, so there is no need for
- such complexity for XOR.
-
-That's all the tag ops. If you stare at this long enough, and then -run Valgrind and stare at the pre- and post-instrumented ucode, it -should be fairly obvious how the instrumentation machinery hangs -together. - -
-One point, if you do this: in order to make it easy to differentiate
-TempRegs carrying values from TempRegs
-carrying V bit vectors, Valgrind prints the former as (for example)
-t28 and the latter as q28; the fact that
-they carry the same number serves to indicate their relationship.
-This is purely for the convenience of the human reader; the register
-allocator and code generator don't regard them as different.
-
-
-
VG_(disBB) allocates a new UCodeBlock and
-then uses disInstr to translate x86 instructions one at a
-time into UCode, dumping the result in the UCodeBlock.
-This goes on until a control-flow transfer instruction is encountered.
-
-
-Despite the large size of vg_to_ucode.c, this translation
-is really very simple. Each x86 instruction is translated entirely
-independently of its neighbours, merrily allocating new
-TempRegs as it goes. The idea is to have a simple
-translator -- in reality, no more than a macro-expander -- and the --
-resulting bad UCode translation is cleaned up by the UCode
-optimisation phase which follows. To give you an idea of some x86
-instructions and their translations (this is a complete basic block,
-as Valgrind sees it):
-
- 0x40435A50: incl %edx - - 0: GETL %EDX, t0 - 1: INCL t0 (-wOSZAP) - 2: PUTL t0, %EDX - - 0x40435A51: movsbl (%edx),%eax - - 3: GETL %EDX, t2 - 4: LDB (t2), t2 - 5: WIDENL_Bs t2 - 6: PUTL t2, %EAX - - 0x40435A54: testb $0x20, 1(%ecx,%eax,2) - - 7: GETL %EAX, t6 - 8: GETL %ECX, t8 - 9: LEA2L 1(t8,t6,2), t4 - 10: LDB (t4), t10 - 11: MOVB $0x20, t12 - 12: ANDB t12, t10 (-wOSZACP) - 13: INCEIPo $9 - - 0x40435A59: jnz-8 0x40435A50 - - 14: Jnzo $0x40435A50 (-rOSZACP) - 15: JMPo $0x40435A5B -- -
-Notice how the block always ends with an unconditional jump to the -next block. This is a bit unnecessary, but makes many things simpler. - -
-Most x86 instructions turn into sequences of GET,
-PUT, LEA1, LEA2,
-LOAD and STORE. Some complicated ones
-however rely on calling helper bits of code in
-vg_helpers.S. The ucode instructions PUSH,
-POP, CALL, CALLM_S and
-CALLM_E support this. The calling convention is somewhat
-ad-hoc and is not the C calling convention. The helper routines must
-save all integer registers, and the flags, that they use. Args are
-passed on the stack underneath the return address, as usual, and if
-result(s) are to be returned, it (they) are either placed in dummy arg
-slots created by the ucode PUSH sequence, or just
-overwrite the incoming args.
-
-
-In order that the instrumentation mechanism can handle calls to these
-helpers, VG_(saneUCodeBlock) enforces the following
-restrictions on calls to helpers:
-
-
CALL uinstr must be bracketed by a preceding
- CALLM_S marker (dummy uinstr) and a trailing
- CALLM_E marker. These markers are used by the
- instrumentation mechanism later to establish the boundaries of the
- PUSH, POP and CLEAR
- sequences for the call.
--
PUSH, POP and CLEAR
- may only appear inside sections bracketed by CALLM_S
- and CALLM_E, and nowhere else.
--
PUSH insns may
- push the same TempReg. Dually, no two two
- POPs may pop the same TempReg.
--
CLEAR, rather than POPs
- into a TempReg which is not subsequently used. This
- is because the instrumentation mechanism assumes that all values
- POPped from the stack are actually used.
-TempReg-to-TempReg moves. This helps the
-next phase, UCode optimisation, to generate better code.
-
-
-
-vg_improve()), which blurs the boundaries between the
-translations of the original x86 instructions. It's pretty
-straightforward. Three transformations are done:
-
-GET elimination. Actually, more general
- than that -- eliminates redundant fetches of ArchRegs. In our
- running example, uinstr 3 GETs %EDX into
- t2 despite the fact that, by looking at the previous
- uinstr, it is already in t0. The GET is
- therefore removed, and t2 renamed to t0.
- Assuming t0 is allocated to a host register, it means
- the simulated %EDX will exist in a host CPU register
- for more than one simulated x86 instruction, which seems to me to
- be a highly desirable property.
-
- There is some mucking around to do with subregisters;
- %AL vs %AH %AX vs
- %EAX etc. I can't remember how it works, but in
- general we are very conservative, and these tend to invalidate the
- caching.
-
-
PUT elimination. This annuls
- PUTs of values back to simulated CPU registers if a
- later PUT would overwrite the earlier
- PUT value, and there is no intervening reads of the
- simulated register (ArchReg).
-
- As before, we are paranoid when faced with subregister references.
- Also, PUTs of %ESP are never annulled,
- because it is vital the instrumenter always has an up-to-date
- %ESP value available, %ESP changes
- affect addressibility of the memory around the simulated stack
- pointer.
-
- The implication of the above paragraph is that the simulated
- machine's registers are only lazily updated once the above two
- optimisation phases have run, with the exception of
- %ESP. TempRegs go dead at the end of
- every basic block, from which is is inferrable that any
- TempReg caching a simulated CPU reg is flushed (back
- into the relevant VG_(baseBlock) slot) at the end of
- every basic block. The further implication is that the simulated
- registers are only up-to-date at in between basic blocks, and not
- at arbitrary points inside basic blocks. And the consequence of
- that is that we can only deliver signals to the client in between
- basic blocks. None of this seems any problem in practice.
-
-
-at 3: delete GET, rename t2 to t0 in (4 .. 6) -at 7: delete GET, rename t6 to t0 in (8 .. 9) -at 1: annul flag write OSZAP due to later OSZACP - -Improved code: - 0: GETL %EDX, t0 - 1: INCL t0 - 2: PUTL t0, %EDX - 4: LDB (t0), t0 - 5: WIDENL_Bs t0 - 6: PUTL t0, %EAX - 8: GETL %ECX, t8 - 9: LEA2L 1(t8,t0,2), t4 - 10: LDB (t4), t10 - 11: MOVB $0x20, t12 - 12: ANDB t12, t10 (-wOSZACP) - 13: INCEIPo $9 - 14: Jnzo $0x40435A50 (-rOSZACP) - 15: JMPo $0x40435A5B -- -
-As mentioned somewhere above, TempRegs carrying values
-have names like t28, and each one has a shadow carrying
-its V bits, with names like q28. This pairing aids in
-reading instrumented ucode.
-
-
-One decision about all this is where to have "observation points",
-that is, where to check that V bits are valid. I use a minimalistic
-scheme, only checking where a failure of validity could cause the
-original program to (seg)fault. So the use of values as memory
-addresses causes a check, as do conditional jumps (these cause a check
-on the definedness of the condition codes). And arguments
-PUSHed for helper calls are checked, hence the wierd
-restrictions on help call preambles described above.
-
-
-Another decision is that once a value is tested, it is thereafter
-regarded as defined, so that we do not emit multiple undefined-value
-errors for the same undefined value. That means that
-TESTV uinstrs are always followed by SETV
-on the same (shadow) TempRegs. Most of these
-SETVs are redundant and are removed by the
-post-instrumentation cleanup phase.
-
-
-The instrumentation for calling helper functions deserves further
-comment. The definedness of results from a helper is modelled using
-just one V bit. So, in short, we do pessimising casts of the
-definedness of all the args, down to a single bit, and then
-UifU these bits together. So this single V bit will say
-"undefined" if any part of any arg is undefined. This V bit is then
-pessimally cast back up to the result(s) sizes, as needed. If, by
-seeing that all the args are got rid of with CLEAR and
-none with POP, Valgrind sees that the result of the call
-is not actually used, it immediately examines the result V bit with a
-TESTV -- SETV pair. If it did not do this,
-there would be no observation point to detect that the some of the
-args to the helper were undefined. Of course, if the helper's results
-are indeed used, we don't do this, since the result usage will
-presumably cause the result definedness to be checked at some suitable
-future point.
-
-
-In general Valgrind tries to track definedness on a bit-for-bit basis, -but as the above para shows, for calls to helpers we throw in the -towel and approximate down to a single bit. This is because it's too -complex and difficult to track bit-level definedness through complex -ops such as integer multiply and divide, and in any case there is no -reasonable code fragments which attempt to (eg) multiply two -partially-defined values and end up with something meaningful, so -there seems little point in modelling multiplies, divides, etc, in -that level of detail. - -
-Integer loads and stores are instrumented with firstly a test of the
-definedness of the address, followed by a LOADV or
-STOREV respectively. These turn into calls to
-(for example) VG_(helperc_LOADV4). These helpers do two
-things: they perform an address-valid check, and they load or store V
-bits from/to the relevant address in the (simulated V-bit) memory.
-
-
-FPU loads and stores are different. As above the definedness of the
-address is first tested. However, the helper routine for FPU loads
-(VGM_(fpu_read_check)) emits an error if either the
-address is invalid or the referenced area contains undefined values.
-It has to do this because we do not simulate the FPU at all, and so
-cannot track definedness of values loaded into it from memory, so we
-have to check them as soon as they are loaded into the FPU, ie, at
-this point. We notionally assume that everything in the FPU is
-defined.
-
-
-It follows therefore that FPU writes first check the definedness of -the address, then the validity of the address, and finally mark the -written bytes as well-defined. - -
-If anyone is inspired to extend Valgrind to MMX/SSE insns, I suggest -you use the same trick. It works provided that the FPU/MMX unit is -not used to merely as a conduit to copy partially undefined data from -one place in memory to another. Unfortunately the integer CPU is used -like that (when copying C structs with holes, for example) and this is -the cause of much of the elaborateness of the instrumentation here -described. - -
-vg_instrument() in vg_translate.c actually
-does the instrumentation. There are comments explaining how each
-uinstr is handled, so we do not repeat that here. As explained
-already, it is bit-accurate, except for calls to helper functions.
-Unfortunately the x86 insns bt/bts/btc/btr are done by
-helper fns, so bit-level accuracy is lost there. This should be fixed
-by doing them inline; it will probably require adding a couple new
-uinstrs. Also, left and right rotates through the carry flag (x86
-rcl and rcr) are approximated via a single
-V bit; so far this has not caused anyone to complain. The
-non-carry rotates, rol and ror, are much
-more common and are done exactly. Re-visiting the instrumentation for
-AND and OR, they seem rather verbose, and I wonder if it could be done
-more concisely now.
-
-
-The lowercase o on many of the uopcodes in the running
-example indicates that the size field is zero, usually meaning a
-single-bit operation.
-
-
-Anyroads, the post-instrumented version of our running example looks -like this: - -
-Instrumented code: - 0: GETVL %EDX, q0 - 1: GETL %EDX, t0 - - 2: TAG1o q0 = Left4 ( q0 ) - 3: INCL t0 - - 4: PUTVL q0, %EDX - 5: PUTL t0, %EDX - - 6: TESTVL q0 - 7: SETVL q0 - 8: LOADVB (t0), q0 - 9: LDB (t0), t0 - - 10: TAG1o q0 = SWiden14 ( q0 ) - 11: WIDENL_Bs t0 - - 12: PUTVL q0, %EAX - 13: PUTL t0, %EAX - - 14: GETVL %ECX, q8 - 15: GETL %ECX, t8 - - 16: MOVL q0, q4 - 17: SHLL $0x1, q4 - 18: TAG2o q4 = UifU4 ( q8, q4 ) - 19: TAG1o q4 = Left4 ( q4 ) - 20: LEA2L 1(t8,t0,2), t4 - - 21: TESTVL q4 - 22: SETVL q4 - 23: LOADVB (t4), q10 - 24: LDB (t4), t10 - - 25: SETVB q12 - 26: MOVB $0x20, t12 - - 27: MOVL q10, q14 - 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) - 29: TAG2o q10 = UifU1 ( q12, q10 ) - 30: TAG2o q10 = DifD1 ( q14, q10 ) - 31: MOVL q12, q14 - 32: TAG2o q14 = ImproveAND1_TQ ( t12, q14 ) - 33: TAG2o q10 = DifD1 ( q14, q10 ) - 34: MOVL q10, q16 - 35: TAG1o q16 = PCast10 ( q16 ) - 36: PUTVFo q16 - 37: ANDB t12, t10 (-wOSZACP) - - 38: INCEIPo $9 - - 39: GETVFo q18 - 40: TESTVo q18 - 41: SETVo q18 - 42: Jnzo $0x40435A50 (-rOSZACP) - - 43: JMPo $0x40435A5B -- - -
-This pass, coordinated by vg_cleanup(), removes redundant
-definedness computation created by the simplistic instrumentation
-pass. It consists of two passes,
-vg_propagate_definedness() followed by
-vg_delete_redundant_SETVs.
-
-
-vg_propagate_definedness() is a simple
-constant-propagation and constant-folding pass. It tries to determine
-which TempRegs containing V bits will always indicate
-"fully defined", and it propagates this information as far as it can,
-and folds out as many operations as possible. For example, the
-instrumentation for an ADD of a literal to a variable quantity will be
-reduced down so that the definedness of the result is simply the
-definedness of the variable quantity, since the literal is by
-definition fully defined.
-
-
-vg_delete_redundant_SETVs removes SETVs on
-shadow TempRegs for which the next action is a write.
-I don't think there's anything else worth saying about this; it is
-simple. Read the sources for details.
-
-
-So the cleaned-up running example looks like this. As above, I have -inserted line breaks after every original (non-instrumentation) uinstr -to aid readability. As with straightforward ucode optimisation, the -results in this block are undramatic because it is so short; longer -blocks benefit more because they have more redundancy which gets -eliminated. - - -
-at 29: delete UifU1 due to defd arg1 -at 32: change ImproveAND1_TQ to MOV due to defd arg2 -at 41: delete SETV -at 31: delete MOV -at 25: delete SETV -at 22: delete SETV -at 7: delete SETV - - 0: GETVL %EDX, q0 - 1: GETL %EDX, t0 - - 2: TAG1o q0 = Left4 ( q0 ) - 3: INCL t0 - - 4: PUTVL q0, %EDX - 5: PUTL t0, %EDX - - 6: TESTVL q0 - 8: LOADVB (t0), q0 - 9: LDB (t0), t0 - - 10: TAG1o q0 = SWiden14 ( q0 ) - 11: WIDENL_Bs t0 - - 12: PUTVL q0, %EAX - 13: PUTL t0, %EAX - - 14: GETVL %ECX, q8 - 15: GETL %ECX, t8 - - 16: MOVL q0, q4 - 17: SHLL $0x1, q4 - 18: TAG2o q4 = UifU4 ( q8, q4 ) - 19: TAG1o q4 = Left4 ( q4 ) - 20: LEA2L 1(t8,t0,2), t4 - - 21: TESTVL q4 - 23: LOADVB (t4), q10 - 24: LDB (t4), t10 - - 26: MOVB $0x20, t12 - - 27: MOVL q10, q14 - 28: TAG2o q14 = ImproveAND1_TQ ( t10, q14 ) - 30: TAG2o q10 = DifD1 ( q14, q10 ) - 32: MOVL t12, q14 - 33: TAG2o q10 = DifD1 ( q14, q10 ) - 34: MOVL q10, q16 - 35: TAG1o q16 = PCast10 ( q16 ) - 36: PUTVFo q16 - 37: ANDB t12, t10 (-wOSZACP) - - 38: INCEIPo $9 - 39: GETVFo q18 - 40: TESTVo q18 - 42: Jnzo $0x40435A50 (-rOSZACP) - - 43: JMPo $0x40435A5B -- - -
vg_from_ucode.c
-is a big file. Position-independent x86 code is generated into
-a dynamically allocated array emitted_code; this is
-doubled in size when it overflows. Eventually the array is handed
-back to the caller of VG_(translate), who must copy
-the result into TC and TT, and free the array.
-
-
-This file is structured into four layers of abstraction, which,
-thankfully, are glued back together with extensive
-__inline__ directives. From the bottom upwards:
-
-
emit_amode_regmem_reg et al.
--
emit_movv_offregmem_reg.
- The v suffix is Intel parlance for a 16/32 bit insn;
- there are also b suffixes for 8 bit insns.
--
synth_* functions, which
- synthesise possibly a sequence of raw x86 instructions to do some
- simple task. Some of these are quite complex because they have to
- work around Intel's silly restrictions on subregister naming. See
- synth_nonshiftop_reg_reg for example.
--
emitUInstr(),
- which emits code for a single uinstr.
--Some comments: -
FPU ucode instruction, we load the simulated FPU's
- state into from its VG_(baseBlock) into the real FPU
- using an x86 frstor insn, do the ucode
- FPU insn on the real CPU, and write the updated FPU
- state back into VG_(baseBlock) using an
- fnsave instruction. This is pretty brutal, but is
- simple and it works, and even seems tolerably efficient. There is
- no attempt to cache the simulated FPU state in the real FPU over
- multiple back-to-back ucode FPU instructions.
-
- FPU_R and FPU_W are also done this way,
- with the minor complication that we need to patch in some
- addressing mode bits so the resulting insn knows the effective
- address to use. This is easy because of the regularity of the x86
- FPU instruction encodings.
-
-
flags_r and flags_w fields, that they
- read or write the simulated %EFLAGS. For such cases
- we first copy the simulated %EFLAGS into the real
- %eflags, then do the insn, then, if the insn says it
- writes the flags, copy back to %EFLAGS. This is a
- bit expensive, which is why the ucode optimisation pass goes to
- some effort to remove redundant flag-update annotations.
--And so ... that's the end of the documentation for the instrumentating -translator! It's really not that complex, because it's composed as a -sequence of simple(ish) self-contained transformations on -straight-line blocks of code. - - -
VG_(toploop). This is basically boring and
-unsurprising, not to mention fiddly and fragile. It needs to be
-cleaned up.
-
-
-The only perhaps surprise is that the whole thing is run
-on top of a setjmp-installed exception handler, because,
-supposing a translation got a segfault, we have to bail out of the
-Valgrind-supplied exception handler VG_(oursignalhandler)
-and immediately start running the client's segfault handler, if it has
-one. In particular we can't finish the current basic block and then
-deliver the signal at some convenient future point, because signals
-like SIGILL, SIGSEGV and SIGBUS mean that the faulting insn should not
-simply be re-tried. (I'm sure there is a clearer way to explain this).
-
-
-
%EIP is not updated after every simulated x86
-insn as this was regarded as too expensive. Instead ucode
-INCEIP insns move it along as and when necessary.
-Currently we don't allow it to fall more than 4 bytes behind reality
-(see VG_(disBB) for the way this works).
-
-Note that %EIP is always brought up to date by the inner
-dispatch loop in VG_(dispatch), so that if the client
-takes a fault we know at least which basic block this happened in.
-
-
-
vg_signals.c.
-Basically, since we have to intercept all system
-calls anyway, we can see when the client tries to install a signal
-handler. If it does so, we make a note of what the client asked to
-happen, and ask the kernel to route the signal to our own signal
-handler, VG_(oursignalhandler). This simply notes the
-delivery of signals, and returns.
-
-
-Every 1000 basic blocks, we see if more signals have arrived. If so,
-VG_(deliver_signals) builds signal delivery frames on the
-client's stack, and allows their handlers to be run. Valgrind places
-in these signal delivery frames a bogus return address,
-VG_(signalreturn_bogusRA), and checks all jumps to see
-if any jump to it. If so, this is a sign that a signal handler is
-returning, and if so Valgrind removes the relevant signal frame from
-the client's stack, restores the from the signal frame the simulated
-state before the signal was delivered, and allows the client to run
-onwards. We have to do it this way because some signal handlers never
-return, they just longjmp(), which nukes the signal
-delivery frame.
-
-
-The Linux kernel has a different but equally horrible hack for -detecting signal handler returns. Discovering it is left as an -exercise for the reader. - - - -
VG_(primary_map) structure, whether or not
-accesses to the individual secondary maps need locking, what
-race-condition issues result, and whether the already-nasty mess that
-is the signal simulator needs further hackery.
-
--I realise that threads are the most-frequently-requested feature, and -I am thinking about it all. If you have guru-level understanding of -fast mutual exclusion mechanisms and race conditions, I would be -interested in hearing from you. - - -
tests/ contains various ad-hoc tests for
-Valgrind. However, there is no systematic verification or regression
-suite, that, for example, exercises all the stuff in
-vg_memory.c, to ensure that illegal memory accesses and
-undefined value uses are detected as they should be. It would be good
-to have such a suite.
-
-
--The main difficulties, for an x86-ELF platform, seem to be: - -
/proc/self/maps parser
- (vg_procselfmaps.c).
- Easy.
--
vg_syscall_mem.c, or, more
- specifically, provide one for your OS. This is tedious, but you
- can implement syscalls on demand, and the Linux kernel interface
- is, for the most part, going to look very similar to the *BSD
- interfaces, so it's really a copy-paste-and-modify-on-demand job.
- As part of this, you'd need to supply a new
- vg_kerneliface.h file.
--
vg_mylibc.c.
--
vg_symtab2.c which reads "stabs" style
-debugging info is pretty weak. It usually correctly translates
-simulated program counter values into line numbers and procedure
-names, but the file name is often completely wrong. I think the
-logic used to parse "stabs" entries is weak. It should be fixed.
-The simplest solution, IMO, is to copy either the logic or simply the
-code out of GNU binutils which does this; since GDB can clearly get it
-right, binutils (or GDB?) must have code to do this somewhere.
-
-
-
-
-
-
-The incorrect instrumentation is due to use of helper functions. This
-means we lose bit-level definedness tracking, which could wind up
-giving spurious uninitialised-value use errors. The Right Thing to do
-is to invent a couple of new UOpcodes, I think GET_BIT
-and SET_BIT, which can be used to implement all 4 x86
-insns, get rid of the helpers, and give bit-accurate instrumentation
-rules for the two new UOpcodes.
-
-
-I realised the other day that they are mis-implemented too. The x86 -insns take a bit-index and a register or memory location to access. -For registers the bit index clearly can only be in the range zero to -register-width minus 1, and I assumed the same applied to memory -locations too. But evidently not; for memory locations the index can -be arbitrary, and the processor will index arbitrarily into memory as -a result. This too should be fixed. Sigh. Presumably indexing -outside the immediate word is not actually used by any programs yet -tested on Valgrind, for otherwise they (presumably) would simply not -work at all. If you plan to hack on this, first check the Intel docs -to make sure my understanding is really correct. - - - -
VG_(primary_map), one of the resulting
-secondary, and the original. Not to mention, the instrumented
-translations are 13 to 14 times larger than the originals. All in all
-one would expect the memory system to be hammered to hell and then
-some.
-
--So here's an idea. An x86 insn involving a read from memory, after -instrumentation, will turn into ucode of the following form: -
- ... calculate effective addr, into ta and qa ... - TESTVL qa -- is the addr defined? - LOADV (ta), qloaded -- fetch V bits for the addr - LOAD (ta), tloaded -- do the original load --At the point where the
LOADV is done, we know the actual
-address (ta) from which the real LOAD will
-be done. We also know that the LOADV will take around
-20 x86 insns to do. So it seems plausible that doing a prefetch of
-ta just before the LOADV might just avoid a
-miss at the LOAD point, and that might be a significant
-performance win.
-
-
-Prefetch insns are notoriously tempermental, more often than not
-making things worse rather than better, so this would require
-considerable fiddling around. It's complicated because Intels and
-AMDs have different prefetch insns with different semantics, so that
-too needs to be taken into account. As a general rule, even placing
-the prefetches before the LOADV insn is too near the
-LOAD; the ideal distance is apparently circa 200 CPU
-cycles. So it might be worth having another analysis/transformation
-pass which pushes prefetches as far back as possible, hopefully
-immediately after the effective address becomes available.
-
-
-Doing too many prefetches is also bad because they soak up bus
-bandwidth / cpu resources, so some cleverness in deciding which loads
-to prefetch and which to not might be helpful. One can imagine not
-prefetching client-stack-relative (%EBP or
-%ESP) accesses, since the stack in general tends to show
-good locality anyway.
-
-
-There's quite a lot of experimentation to do here, but I think it -might make an interesting week's work for someone. - -
-As of 15-ish March 2002, I've started to experiment with this, using
-the AMD prefetch/prefetchw insns.
-
-
-
-
-The presentation falls into two pieces. - -
-Part 1: user-defined address-range permission setting -
-
-Valgrind intercepts the client's malloc,
-free, etc calls, watches system calls, and watches the
-stack pointer move. This is currently the only way it knows about
-which addresses are valid and which not. Sometimes the client program
-knows extra information about its memory areas. For example, the
-client could at some point know that all elements of an array are
-out-of-date. We would like to be able to convey to Valgrind this
-information that the array is now addressable-but-uninitialised, so
-that Valgrind can then warn if elements are used before they get new
-values.
-
-
-What I would like are some macros like this: -
- VALGRIND_MAKE_NOACCESS(addr, len) - VALGRIND_MAKE_WRITABLE(addr, len) - VALGRIND_MAKE_READABLE(addr, len) --and also, to check that memory is addressible/initialised, -
- VALGRIND_CHECK_ADDRESSIBLE(addr, len) - VALGRIND_CHECK_INITIALISED(addr, len) -- -
-I then include in my sources a header defining these macros, rebuild -my app, run under Valgrind, and get user-defined checks. - -
-Now here's a neat trick. It's a nuisance to have to re-link the app -with some new library which implements the above macros. So the idea -is to define the macros so that the resulting executable is still -completely stand-alone, and can be run without Valgrind, in which case -the macros do nothing, but when run on Valgrind, the Right Thing -happens. How to do this? The idea is for these macros to turn into a -piece of inline assembly code, which (1) has no effect when run on the -real CPU, (2) is easily spotted by Valgrind's JITter, and (3) no sane -person would ever write, which is important for avoiding false matches -in (2). So here's a suggestion: -
- VALGRIND_MAKE_NOACCESS(addr, len) --becomes (roughly speaking) -
- movl addr, %eax - movl len, %ebx - movl $1, %ecx -- 1 describes the action; MAKE_WRITABLE might be - -- 2, etc - rorl $13, %ecx - rorl $19, %ecx - rorl $11, %eax - rorl $21, %eax --The rotate sequences have no effect, and it's unlikely they would -appear for any other reason, but they define a unique byte-sequence -which the JITter can easily spot. Using the operand constraints -section at the end of a gcc inline-assembly statement, we can tell gcc -that the assembly fragment kills
%eax, %ebx,
-%ecx and the condition codes, so this fragment is made
-harmless when not running on Valgrind, runs quickly when not on
-Valgrind, and does not require any other library support.
-
-
--Part 2: using it to detect interference between stack variables -
- -Currently Valgrind cannot detect errors of the following form: -
-void fooble ( void )
-{
- int a[10];
- int b[10];
- a[10] = 99;
-}
-
-Now imagine rewriting this as
-
-void fooble ( void )
-{
- int spacer0;
- int a[10];
- int spacer1;
- int b[10];
- int spacer2;
- VALGRIND_MAKE_NOACCESS(&spacer0, sizeof(int));
- VALGRIND_MAKE_NOACCESS(&spacer1, sizeof(int));
- VALGRIND_MAKE_NOACCESS(&spacer2, sizeof(int));
- a[10] = 99;
-}
-
-Now the invalid write is certain to hit spacer0 or
-spacer1, so Valgrind will spot the error.
-
--There are two complications. - -
-The first is that we don't want to annotate sources by hand, so the -Right Thing to do is to write a C/C++ parser, annotator, prettyprinter -which does this automatically, and run it on post-CPP'd C/C++ source. -See http://www.cacheprof.org for an example of a system which -transparently inserts another phase into the gcc/g++ compilation -route. The parser/prettyprinter is probably not as hard as it sounds; -I would write it in Haskell, a powerful functional language well -suited to doing symbolic computation, with which I am intimately -familar. There is already a C parser written in Haskell by someone in -the Haskell community, and that would probably be a good starting -point. - -
-The second complication is how to get rid of these
-NOACCESS records inside Valgrind when the instrumented
-function exits; after all, these refer to stack addresses and will
-make no sense whatever when some other function happens to re-use the
-same stack address range, probably shortly afterwards. I think I
-would be inclined to define a special stack-specific macro
-
- VALGRIND_MAKE_NOACCESS_STACK(addr, len) --which causes Valgrind to record the client's
%ESP at the
-time it is executed. Valgrind will then watch for changes in
-%ESP and discard such records as soon as the protected
-area is uncovered by an increase in %ESP. I hesitate
-with this scheme only because it is potentially expensive, if there
-are hundreds of such records, and considering that changes in
-%ESP already require expensive messing with stack access
-permissions.
-
--This is probably easier and more robust than for the instrumenter -program to try and spot all exit points for the procedure and place -suitable deallocation annotations there. Plus C++ procedures can -bomb out at any point if they get an exception, so spotting return -points at the source level just won't work at all. - -
-Although some work, it's all eminently doable, and it would make -Valgrind into an even-more-useful tool. - - -
- - -
LOAD, STORE,
-FPU_R and FPU_W. By contrast, because of the x86
-addressing modes, almost every instruction can read or write memory.
-
-Most of the cache profiling machinery is in the file
-vg_cachesim.c.
- -These notes are a somewhat haphazard guide to how Valgrind's cache profiling -works.
- -
iCC), and one for instructions that do
-(idCC):
-
-
-typedef struct _CC {
- ULong a;
- ULong m1;
- ULong m2;
-} CC;
-
-typedef struct _iCC {
- /* word 1 */
- UChar tag;
- UChar instr_size;
-
- /* words 2+ */
- Addr instr_addr;
- CC I;
-} iCC;
-
-typedef struct _idCC {
- /* word 1 */
- UChar tag;
- UChar instr_size;
- UChar data_size;
-
- /* words 2+ */
- Addr instr_addr;
- CC I;
- CC D;
-} idCC;
-
-
-Each CC has three fields a, m1,
-m2 for recording references, level 1 misses and level 2 misses.
-Each of these is a 64-bit ULong -- the numbers can get very large,
-ie. greater than 4.2 billion allowed by a 32-bit unsigned int.
-
-A iCC has one CC for instruction cache accesses. A
-idCC has two, one for instruction cache accesses, and one for data
-cache accesses.
-
-The iCC and dCC structs also store unchanging
-information about the instruction:
-
-
-
idCC only)-
-
idCC. This is
-because for many memory-referencing instructions the data address can change
-each time it's executed (eg. if it uses register-offset addressing). We have
-to give this item to the cache simulation in a different way (see
-Instrumentation section below). Some memory-referencing instructions do always
-reference the same address, but we don't try to treat them specialy in order to
-keep things simple.
-
-Also note that there is only room for recording info about one data cache
-access in an idCC. So what about instructions that do a read then
-a write, such as:
-
-
inc %(esi)
-
-In a write-allocate cache, as simulated by Valgrind, the write cannot miss,
-since it immediately follows the read which will drag the block into the cache
-if it's not already there. So the write access isn't really interesting, and
-Valgrind doesn't record it. This means that Valgrind doesn't measure
-memory references, but rather memory references that could miss in the cache.
-This behaviour is the same as that used by the AMD Athlon hardware counters.
-It also has the benefit of simplifying the implementation -- instructions that
-read and write memory can be treated like instructions that read memory.- -
- -Valgrind does JIT translations at the basic block level, and cost centres are -also setup and stored at the basic block level. By doing things carefully, we -store all the cost centres for a basic block in a contiguous array, and lookup -comes almost for free.
- -Consider this part of a basic block (for exposition purposes, pretend it's an -entire basic block): - -
-movl $0x0,%eax -movl $0x99, -4(%ebp) -- -The translation to UCode looks like this: - -
-MOVL $0x0, t20 -PUTL t20, %EAX -INCEIPo $5 - -LEA1L -4(t4), t14 -MOVL $0x99, t18 -STL t18, (t14) -INCEIPo $7 -- -The first step is to allocate the cost centres. This requires a preliminary -pass to count how many x86 instructions were in the basic block, and their -types (and thus sizes). UCode translations for single x86 instructions are -delimited by the
INCEIPo instruction, the argument of which gives
-the byte size of the instruction (note that lazy INCEIP updating is turned off
-to allow this).
-
-We can tell if an x86 instruction references memory by looking for
-LDL and STL UCode instructions, and thus what kind of
-cost centre is required. From this we can determine how many cost centres we
-need for the basic block, and their sizes. We can then allocate them in a
-single array.
-
-Consider the example code above. After the preliminary pass, we know we need
-two cost centres, one iCC and one dCC. So we
-allocate an array to store these which looks like this:
-
-
-|(uninit)| tag (1 byte) -|(uninit)| instr_size (1 bytes) -|(uninit)| (padding) (2 bytes) -|(uninit)| instr_addr (4 bytes) -|(uninit)| I.a (8 bytes) -|(uninit)| I.m1 (8 bytes) -|(uninit)| I.m2 (8 bytes) - -|(uninit)| tag (1 byte) -|(uninit)| instr_size (1 byte) -|(uninit)| data_size (1 byte) -|(uninit)| (padding) (1 byte) -|(uninit)| instr_addr (4 bytes) -|(uninit)| I.a (8 bytes) -|(uninit)| I.m1 (8 bytes) -|(uninit)| I.m2 (8 bytes) -|(uninit)| D.a (8 bytes) -|(uninit)| D.m1 (8 bytes) -|(uninit)| D.m2 (8 bytes) -- -(We can see now why we need tags to distinguish between the two types of cost -centres.)
- -We also record the size of the array. We look up the debug info of the first -instruction in the basic block, and then stick the array into a table indexed -by filename and function name. This makes it easy to dump the information -quickly to file at the end.
- -
-
-
-|INSTR_CC| tag (1 byte) -|5 | instr_size (1 bytes) -|(uninit)| (padding) (2 bytes) -|i_addr1 | instr_addr (4 bytes) -|0 | I.a (8 bytes) -|0 | I.m1 (8 bytes) -|0 | I.m2 (8 bytes) - -|WRITE_CC| tag (1 byte) -|7 | instr_size (1 byte) -|4 | data_size (1 byte) -|(uninit)| (padding) (1 byte) -|i_addr2 | instr_addr (4 bytes) -|0 | I.a (8 bytes) -|0 | I.m1 (8 bytes) -|0 | I.m2 (8 bytes) -|0 | D.a (8 bytes) -|0 | D.m1 (8 bytes) -|0 | D.m2 (8 bytes) -- -(Note that this step is not performed if a basic block is re-translated; see -here for more information.)
-
-GCC inserts padding before the instr_size field so that it is word
-aligned.
- -The instrumentation added to call the cache simulation function looks like this -(instrumentation is indented to distinguish it from the original UCode): - -
-MOVL $0x0, t20 -PUTL t20, %EAX - PUSHL %eax - PUSHL %ecx - PUSHL %edx - MOVL $0x4091F8A4, t46 # address of 1st CC - PUSHL t46 - CALLMo $0x12 # second cachesim function - CLEARo $0x4 - POPL %edx - POPL %ecx - POPL %eax -INCEIPo $5 - -LEA1L -4(t4), t14 -MOVL $0x99, t18 - MOVL t14, t42 -STL t18, (t14) - PUSHL %eax - PUSHL %ecx - PUSHL %edx - PUSHL t42 - MOVL $0x4091F8C4, t44 # address of 2nd CC - PUSHL t44 - CALLMo $0x13 # second cachesim function - CLEARo $0x8 - POPL %edx - POPL %ecx - POPL %eax -INCEIPo $7 -- -Consider the first instruction's UCode. Each call is surrounded by three -
PUSHL and POPL instructions to save and restore the
-caller-save registers. Then the address of the instruction's cost centre is
-pushed onto the stack, to be the first argument to the cache simulation
-function. The address is known at this point because we are doing a
-simultaneous pass through the cost centre array. This means the cost centre
-lookup for each instruction is almost free (just the cost of pushing an
-argument for a function call). Then the call to the cache simulation function
-for non-memory-reference instructions is made (note that the
-CALLMo UInstruction takes an offset into a table of predefined
-functions; it is not an absolute address), and the single argument is
-CLEARed from the stack.
-
-The second instruction's UCode is similar. The only difference is that, as
-mentioned before, we have to pass the address of the data item referenced to
-the cache simulation function too. This explains the MOVL t14,
-t42 and PUSHL t42 UInstructions. (Note that the seemingly
-redundant MOVing will probably be optimised away during register
-allocation.)
- -Note that instead of storing unchanging information about each instruction -(instruction size, data size, etc) in its cost centre, we could have passed in -these arguments to the simulation function. But this would slow the calls down -(two or three extra arguments pushed onto the stack). Also it would bloat the -UCode instrumentation by amounts similar to the space required for them in the -cost centre; bloated UCode would also fill the translation cache more quickly, -requiring more translations for large programs and slowing them down more.
- -However, we can't use this approach for profiling -- we can't throw away cost -centres for instructions in the middle of execution! So when a basic block is -translated, we first look for its cost centre array in the hash table. If -there is no cost centre array, it must be the first translation, so we proceed -as described above. But if there is a cost centre array already, it must be a -retranslation. In this case, we skip the cost centre allocation and -initialisation steps, but still do the UCode instrumentation step.
- -
-
-The interface to the simulation is quite clean. The functions called from the
-UCode contain calls to the simulation functions in the files
-vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only
-one function call is done per simulated x86 instruction. The file
-vg_cachesim.c simply #includes the three files
-containing the simulation, which makes plugging in new cache simulations is
-very easy -- you just replace the three files and recompile.
- -
- -Input file has the following format: - -
-file ::= desc_line* cmd_line events_line data_line+ summary_line
-desc_line ::= "desc:" ws? non_nl_string
-cmd_line ::= "cmd:" ws? cmd
-events_line ::= "events:" ws? (event ws)+
-data_line ::= file_line | fn_line | count_line
-file_line ::= ("fl=" | "fi=" | "fe=") filename
-fn_line ::= "fn=" fn_name
-count_line ::= line_num ws? (count ws)+
-summary_line ::= "summary:" ws? (count ws)+
-count ::= num | "."
-
-
-Where:
-
-non_nl_string is any string not containing a newline.-
cmd is a command line invocation.-
filename and fn_name can be anything.-
num and line_num are decimal numbers.-
ws is whitespace.-
nl is a newline.-
- -Counts can be "." to represent "N/A", eg. the number of write misses for an -instruction that doesn't write to memory.
-
-The number of counts in each line and the
-summary_line should not exceed the number of events in the
-event_line. If the number in each line is less,
-vg_annotate treats those missing as though they were a "." entry.
-
-A file_line changes the current file name. A fn_line
-changes the current function name. A count_line contains counts
-that pertain to the current filename/fn_name. A "fn=" file_line
-and a fn_line must appear before any count_lines to
-give the context of the first count_lines.
-
-Each file_line should be immediately followed by a
-fn_line. "fi=" file_lines are used to switch
-filenames for inlined functions; "fe=" file_lines are similar, but
-are put at the end of a basic block in which the file name hasn't been switched
-back to the original file name. (fi and fe lines behave the same, they are
-only distinguished to help debugging.)
- - -
- -
- -
- -
- -
cachegrind.out output files can contain huge amounts of
- information; file format was carefully chosen to minimise file
- sizes.-
-
-In particular, vg_annotate would not need to change -- the file format is such
-that it is not specific to the cache simulation, but could be used for any kind
-of line-by-line information. The only part of vg_annotate that is specific to
-the cache simulation is the name of the input file
-(cachegrind.out), although it would be very simple to add an
-option to control this.
- - -