On a big executable, the trunk needs:
dinfo: 134873088/71438336 max/curr mmap'd, 134607808/66717872 max/curr
With the patch, we have:
dinfo: 99065856/56836096 max/curr mmap'd, 97883776/51663656 max/curr
So, peak dinfo memory decreases by about 36Mb, and final by 15Mb.
(for info, valgrind 3.9.0 uses
dinfo: 158941184/109666304 max/curr mmap'd, 156775944/107590656 max/curr
So, compared to 3.9.0, dinfo peak decreases by about 40%, and the final
memory is divided by more than 2).
The memory decrease is obtained by:
* using a dedup pool to store filename/dirname pair for the loctab source/line
information.
As typically, there is not a lot of such pairs, typically a UShort is
good enough to identify a fn/dn pair in a dedup pool.
To avoid losing memory due to alignment, the fndn indexes are stored
in a "parallel" array to the DiLoc loctab array, with entries having
1, or 2 or 4 bytes according to the nr of fn/dn pairs in the dedup pool.
See priv_storage.h comments for details.
(there was a extensible WordArray local implementation in readdwarf.c.
As with this change, we use an xarray, the local implementation was
removed).
* the memory needed for --read-inline-info is slightly decreased (-2Mb)
by removing the (unused) dirname from the DiInlLoc struct.
Handling dirname for inlined function caller implies to rework
the dwarf3 parser read_filename_table common to the var and inlinfo parser.
Waiting for this to be done, the dirname component is removed from DiInlLoc.
* the stabs reader (readstabs.c) is broken since 3.9.0.
For this change, the code has been updated to make it compile with the new
DiLoc/FnDn dedup pool. As the code is completely broken, a vg_assert(0)
has been put at the begin of the stabs reader.
* the pdb reader (readpdb.c) has been trivially updated and should still work.
It has not been tested (how do we test this ?).
A follow-up patch will be done to avoid doing too many calls to
ML_(addFnDn) : instead of having one call per ML_(addLineInfo), one
should have a single call done when reading the filename table.
This has also be tested in an outer/inner setup, to verify no
memory leak/bugs.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14158
* Avoid printing the size of a null dedup pool
* Avoid warnings of 2 unused variables on some platforms
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14132
On a big executable, the trunk needs:
dinfo: 155844608/106737664 max/curr mmap'd 155572624/102276760 max/curr
With the patch, we have:
dinfo: 134873088/70389760 max/curr mmap'd 134607808/66717512 max/curr
So, peak dinfo memory decreases by 21Mb, and final by 36Mb.
The memory decrease is obtained by:
* using a dedup pool to store the machine dependent part (cfsi_m)
of the cfsi information as this information is highly duplicated.
For x86 and arm64, the duplication factor of cfsi machine dependent
part is very high (up to a factor 60).
For arm64, it is more like a factor 3.
A 'variable size' (1, 2 or 4 bytes) is automatically used to identify
the cfsi_m, if there is less than or more than 255/64K different cfsi_m.
* not storing explicitely the length of a range for which a cfsi_m
is to be used: in a large majority of the cases, ranges are
consecutive, and so the end of a range is just one byte before
the start of the next range.
So, we do not store the length of the ranges.
If there is a hole between 2 ranges, the hole is stored explicitely
as a range in which we have no cfsi_m information.
On x86 and amd64, we have quite some holes (something like one hole
every 7 cfsi). On arm64, we have very few holes (less than one hole
every 50 cfsi).
Even with the nr of holes on x86/amd64, it is more memory efficient
to store the holes rather than to store the length of each cfsi.
* Merging consecutive ranges that have the same cfsi_m info:
Many cfsi are "mergeable": there is no hole between 2 cfsi, and their
machine dependent part is identical
(I guess the unwind info needed by valgrind is subset of the full
unwind info, and so, the cfsi entries are not merged by the compiler,
but can be merged for simple unwind). Depending on the platform
(x86, amd64, arm64) and of the library/object file, we can have a
significant nr of mergeable entries.
The patch is not very small, but a lot is mechanical changes.
The patch has been compiled and tested on x86/amd64/ppc32/ppc64
(but ppc does not use cfsi so that just verifies it compiles).
It has been compiled on arm64, and "tested" by launching valgrind on
one executable.
It has not been compiled on s390 and mips.
With some luck, maybe it will compile on these platforms.
And if that uses the whole provision of luck for 2014, it might even work
on these platforms :).
If it does not compile, the fix should be straightforward.
Runtime problems might be more tricky (but arm64 "worked out of the box"
once x86/amd64 were ok).
This has also be tested in an outer/inner setup, to verify no memory leak/bugs.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14129
showing inlined function calls.
See 278972 valgrind stacktraces and suppression do not handle inlined function call debuginfo
Reading the inlined dwarf call info is activated using the new clo
--read-inline-info=yes
Default is currently no but an objective is to optimise the performance
and memory in order to possibly set it on by default.
(see below discussion about performances).
Basically, the patch provides the following pieces:
1. Implement a new dwarf3 reader that reads the inlined call info
2. Some performance improvements done for this new parser, and
on some common code between the new parser and the var info parser.
3. Use the parsed inlined info to produce stacktrace showing inlined calls
4. Use the parsed inlined info in the suppression matching and suppression generation
5. and of course, some reg tests
1. new dwarf3 reader:
---------------------
Two options were possible: add the reading of the inlined info
in the current var info dwarf reader, or add a 2nd reader.
The 2nd approach was preferred, for the following reasons:
The var info reader is slow, memory hungry and quite complex.
Having a separate parsing phase for the inlined information
is simpler/faster when just reading the inlined info.
Possibly, a single parser would be faster when using both
--read-var-info=yes and --read-inline-info=yes.
However, var-info being extremely memory/cpu hungry, it is unlikely
to be used often, and having a separate parsing for inlined info
does in any case make not much difference.
(--read-var-info=yes is also now less interesting thanks to commit
r13991, which provides a fast and low memory "reasonable" location
for an address).
The inlined info parser reads the dwarf info to make calls
to priv_storage.h ML_(addInlInfo).
2. performance optimisations
----------------------------
* the abbrev cache has been improved in revision r14035.
* The new parser skips the non interesting DIEs
(the var-info parser has no logic to skip uninteresting DIEs).
* Some other minor perf optimisation here and there.
In total now, on a big executable, 15 seconds CPU are needed to
create the inlined info (on my slow x86 pentium).
With regards to memory, the dinfo arena:
with inlined info: 172281856/121085952 max/curr mmap'd
without : 157892608/106721280 max/curr mmap'd,
So, basically, inlined information costs about 15Mb of memory for
my big executable (compared to first version of the patch, this is
already using less memory, thanks to the strpool deduppoolalloc.
The needed memory can probably be decreased somewhat more.
3. produce better stack traces
------------------------------
VG_(describe_IP) has a new argument InlIPCursor *iipc which allows
to describe inlined function calls by doing repetitive calls
to describe_IP. See pub_tool_debuginfo.h for a description.
4. suppression generation and matching
--------------------------------------
* suppression generation now also uses an InlIPCursor *iipc
to generate a line for each inlined fn call.
* suppression matching: to allow suppression matching to
match one IP to several function calls in a suppression entry,
the 'inputCompleter' object (that allows to lazily generate
function or object names for a stacktrace when matching
an error with a suppression) has been generalised a little bit
more to also lazily generate the input sequence.
VG_(generic_match) has been updated so as to be more generic
with respect to the input completer : when providing an
input completer, VG_(generic_match) does not need anymore
to produce/compute any input itself : this is all delegated
to the input completer.
5. various regtests
-------------------
to test stack traces with inlined calls, and suppressions
of (some of) these errors using inlined fn calls matching.
Work still to do:
-----------------
* improve parsing performance
* improve the memory overhead.
* handling the directory name for files of the inlined function calls is not yet done.
(probably implies to refactor some code)
* see if m_errormgr.c *offsets arrays cannot be managed via xarray
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14036
It is possible that a debug info contains no string (and so strpool
is never allocated).
A protection to avoid accessing strpool was already necessary
in ML_(canonicaliseTables) :
if (di->strpool)
VG_(freezeDedupPA) (di->strpool);
So, if a similar debug info is released, we need the same protection
to avoid accessing a NULL strpool.
Detect by Julian on arm64, but not (at least easily) reproduced on amd64.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14033
include/pub_tool_deduppoolalloc.h
coregrind/pub_core_deduppoolalloc.h
coregrind/m_deduppoolalloc.c
and uses it (currently only) for the strings in m_debuginfo/storage.c
The idea is that such ddup pool allocator will also be used for other
highly duplicated information (e.g. the DiCFSI information), where
significant gains can also be achieved.
The dedup pool for strings also decreases significantly the memory
needed by the read inline information (patch still to be committed,
see bug 278972).
When testing with a big executable (tacot_process),
this reduces the size of the dinfo arena from
trunk: 158941184/109760512 max/curr mmap'd, 156775944/107882728 max/curr,
to
ddup: 157892608/106614784 max/curr mmap'd, 156362160/101414712 max/curr
(so 3Mb less mmap-ed once debug info is read, 1Mb less mmap-ed in peak,
6Mb less allocated once debug info is read).
This is all gained due to the string which changes from:
trunk: 17,434,704 in 266: di.storage.addStr.1
to
ddup: 10,966,608 in 750: di.storage.addStr.1
(6.5Mb less memory used by strings)
The gain in mmap-ed memory is smaller due to fragmentation.
Probably one could decrease the fragmentation by using bigger
size for the dedup pool, but then we would lose memory on the last
allocated pool (and for small libraries, we often do not use much
of a big pool block).
Solution might be to increase the pool size but have a "shrink_block"
operation. To be looked at in the future.
In terms of performance, startup of a big executable (on an old pentium)
is not influenced significantly (something like 0.1 seconds on 15 seconds
startup for a big executable, on a slow pentium).
The dedup pool uses a hash table. The hash function used currently
is the VG_(adler32) check sum. It is reported (and visible also here)
that this checksum is not a very good hash function (many collisions).
To have statistics about collisions, use --stats -v -v -v
As an example of the collisions, on the strings in debug info of memcheck tool on x86,
one obtain:
--4789-- dedupPA:di.storage.addStr.1 9983 allocs (8174 uniq) 11 pools (4820 bytes free in last pool)
--4789-- nr occurences of chains of len N, N-plicated keys, N-plicated elts
--4789-- N: 0 : nr chain 6975, nr keys 0, nr elts 0
--4789-- N: 1 : nr chain 3670, nr keys 6410, nr elts 8174
--4789-- N: 2 : nr chain 1070, nr keys 226, nr elts 0
--4789-- N: 3 : nr chain 304, nr keys 100, nr elts 0
--4789-- N: 4 : nr chain 104, nr keys 84, nr elts 0
--4789-- N: 5 : nr chain 72, nr keys 42, nr elts 0
--4789-- N: 6 : nr chain 44, nr keys 34, nr elts 0
--4789-- N: 7 : nr chain 18, nr keys 13, nr elts 0
--4789-- N: 8 : nr chain 17, nr keys 8, nr elts 0
--4789-- N: 9 : nr chain 4, nr keys 6, nr elts 0
--4789-- N:10 : nr chain 9, nr keys 4, nr elts 0
--4789-- N:11 : nr chain 1, nr keys 0, nr elts 0
--4789-- N:13 : nr chain 1, nr keys 1, nr elts 0
--4789-- total nr of unique chains: 12289, keys 6928, elts 8174
which shows that on 8174 different strings, we have only 6410 strings which have
a unique hash value. As other examples, N:13 line shows we have 13 strings
mapping to the same key. N:14 line shows we have 4 groups of 10 strings mapping to the
same key, etc.
So, adler32 is definitely a bad hash function.
Trials have been done with another hash function, giving a much lower
collision rate. So, a better (but still fast) hash function would probably
be beneficial. To be looked at ...
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14029
Necessary changes to Valgrind to support MIPS64LE on Linux.
Minor cleanup/style changes embedded in the patch as well.
The change corresponds to r2687 in VEX.
Patch written by Dejan Jevtic and Petar Jovanovic.
More information about this issue:
https://bugs.kde.org/show_bug.cgi?id=313267
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13292
* other platforms (e.g. amd64) are first trying to unwind
with cfi info, then with the fp chain.
* fp unwind when code is compiled without frame pointer can
fail and give incomplete stack traces (often terminating
with a random program counter, causing a huge amount of
recorded stack traces).
This patch improves unwinding on x86 by:
* first time an IP is unwound, do the unwind both with
CFI technique and with fp technique.
If results are identical, IP is inserted in a cache of
'fp unwindable' IP
* following unwind of the same IP are then done directly
either with fp unwind or with cfi, depending on the
cached result of the check done during first unwind.
The cache is needed so as to avoid as much as possible cfi unwind,
as this is significantly slower than fp unwind.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13280
This change is based on rumours/legends/oral transmission of experience/...
that prime nrs are good to use for hash table size :).
If someone has a (short) explanation about why this is useful,
that will be welcome.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13237
Two fixes could be done:
Either we fix the comments
or we increase N_FRAMES to be rather VG_DEEPEST_BACKTRACE.
We fix the comment for the following reason:
This is (at least for the moment) not performance critical.
as this is only called when an error is reported.
However, searching for local vars is extremely costly.
It is unlikely that an error is reported for a stack variable
which is more than 8 frames deeper than theframe in which
it is detected.
So, fix the comment, waiting for a complaint that a deeper
variable is not properly described.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13235
This patch changes the way static variables are
recorded by readdwarf3.c (when giving --read-var-info=yes),
improving the way such variables are described.
Currently:
A static variable does not have the DW_AT_external tag.
So, readdwarf3.c does not consider it a global variable.
It is rather considered a "local" variable.
When it is recorded, it is associated to a range of program counters
(the functions in the file where it is visible).
However, even if the static variable is only visible
in the source file where it is declared, it can in reality
be used by any range of program counters, typically
by having the address of the local variable passed
to other functions.
Such local variable can then only be described
when the program counter is in the range of program
counters for which it has been recorded.
However, this (local) description is obtained
by a kludge in debuginfo.c (around line 3285).
This kludge then produces a strange description,
telling that the variable has been declared in
frame 0 of a thread (see second example below).
The kludge is not always able to describe
the address (if the IP of the tid is in another file than
where the variable has been declared).
I suspect the kludge can sometimes describe the var as being
declared in an unrelated thread
(e.g. if an error is triggered by tid 5, but tid1 is by
luck in an IP corresponding to the recorded range).
The patch changes the way a static variable is recorded:
if DW_AT_external tag is found, a variable is marked as global.
If a variable is not external, but is seen when level is 1,
then we record the variable as a global variable (i.e.
with a full IP range).
This improves the way such static variable are described:
* they are described even if being accessed by other files.
* their description is not in an artificial "thread frame".
First example:
**************
a variable cannot be described because it is
accessed by a function in another file:
with the trunk:
==20410== ----------------------------------------------------------------
==20410==
==20410== Possible data race during read of size 4 at 0x600F54 by thread #1
==20410== Locks held: none
==20410== at 0x4007E4: a (abc.c:42)
==20410== by 0x4006BC: main (mabc.c:24)
==20410==
==20410== This conflicts with a previous write of size 4 by thread #2
==20410== Locks held: none
==20410== at 0x4007ED: a (abc.c:42)
==20410== by 0x400651: brussels_fn (mabc.c:9)
==20410== by 0x4C2B54E: mythread_wrapper (hg_intercepts.c:219)
==20410== by 0x4E348C9: start_thread (pthread_create.c:300)
==20410==
==20410== ----------------------------------------------------------------
with the patch:
==4515== ----------------------------------------------------------------
==4515==
==4515== Possible data race during read of size 4 at 0x600F54 by thread #1
==4515== Locks held: none
==4515== at 0x4007E4: a (abc.c:42)
==4515== by 0x4006BC: main (mabc.c:24)
==4515==
==4515== This conflicts with a previous write of size 4 by thread #2
==4515== Locks held: none
==4515== at 0x4007ED: a (abc.c:42)
==4515== by 0x400651: brussels_fn (mabc.c:9)
==4515== by 0x4C2B54E: mythread_wrapper (hg_intercepts.c:219)
==4515== by 0x4E348C9: start_thread (pthread_create.c:300)
==4515==
==4515== Location 0x600f54 is 0 bytes inside global var "static_global"
==4515== declared at mabc.c:4
==4515==
==4515== ----------------------------------------------------------------
Second example:
***************
When the kludge can describe the variable, it is strangely described
as being declared in a frame of a thread, while for sure the declaration
has nothing to do with a thread
With the trunk:
==20410== Location 0x600f68 is 0 bytes inside local var "static_global_a"
==20410== declared at abc.c:3, in frame #0 of thread 1
With the patch:
==4515== Location 0x600f68 is 0 bytes inside global var "static_global_a"
==4515== declared at abc.c:3
#include <stdio.h>
static int static_global_a = 0; //// <<<< this is abc.c:3
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13153
From: Ian Campbell <Ian.Campbell@citrix.com>
Under Xen the toolstack is responsible for managing the domains in
the system, e.g. creating, destroying, and otherwise manipulating
them.
To do this it uses a number of ioctls on the /proc/xen/privcmd
device. Most of these (the MMAPBATCH ones) simply set things up such
that a subsequenct mmap call will map the desired guest memory. Since
valgrind has no way of knowing what the memory contains we assume
that it is all initialised (to do otherwise would require valgrind to
be observing the complete state of the system and not just the given
process).
The most interesting ioctl is XEN_IOCTL_PRIVCMD_HYPERCALL which
allows the toolstack to make arbitrary hypercalls. Although the
mechanism here is specific to the OS of the guest running the
toolstack the hypercalls themselves are defined solely by the
hypervisor. Therefore I have split support for this ioctl into a part
in syswrap-linux.c which handles the ioctl itself and passes things
onto a new syswrap-xen.c which handles the specifics of the
hypercalls themselves. Porting this to another OS should just be a
matter of wiring up syswrap-$OS.c to decode the ioctl and call into
syswrap-xen.c. In the future we may want to split this into
syswrap-$ARCH-xen.c but for now this is x86 only.
The hypercall coverage here is pretty small but is enough to get
reasonable(-ish) results out of the xl toolstack when listing,
creating and destroying domains.
One issue is that the hypercalls which are exlusively used by the
toolstacks (as opposed to those used by guest operating systems) are
not considered a stable ABI, since the hypervisor and the lowlevel
tools are considered a matched pair. This covers the sysctl and
domctl hypercalls which are a fairly large chunk of the support
here. I'm not sure how to solve this without invoking a massive
amount of duplication. Right now this targets the Xen unstable
interface (which will shortly be released as Xen 4.2), perhaps I can
get away with deferring this problem until the first change .
On the plus side the vast majority of hypercalls are not of interest
to the toolstack (they are used by guests) so we can get away without
implementing them.
Note: a hypercall only reads as many words from the ioctl arg
struct as there are actual arguments to that hypercall and the
toolstack only initialises the arguments which are used. However
there is no space in the DEFN_PRE_TEMPLATE prototype to allow this to
be communicated from syswrap-xen.c back to syswrap-linux.c. Since a
hypercall can have at most 5 arguments I have hackily stolen ARG8 for
this purpose.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12963
This is a slightly modified version of a patch provided by Petar Jovanovic
<petar.jovanovic@rt-rk.com>.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12960
di->soname was not freed, so was leaked when debug info is removed.
free(soname) added in free_Debuginfo, after having verified
and then ensured that all soname are allocated in dinfo.
regtested on deb6/amd64
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12442
(Rusty Russell, rusty@rustcorp.com.au)
tdb uses fcntl locks and mmap, and some of the tests fail under valgrind.
strace showed valgrind opening the tdb file, reading 1024 bytes, then closing
it. This is not allowed: POSIX says if you open and close a file, all fcntl
locks on it are dropped (insane, yes).
Finally got around to hacking the source to track this down: di_notify_mmap is
doing the damage. The simplest fix was to hand in an optional fd for it to
use, then have it do pread.
I had to fix your pread; surely this should seek back even if the platform
doesn't have pread support?
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12224
* configure.in support
* new supp file darwin11.supp
* comment out many intercepts in mc_replace_strmem.c and
vg_replace_malloc.c that are apparently unnecessary for Darwin
* add minimal handling for the following new syscalls and mach traps:
mach_port_set_context
task_get_exception_ports
getaudit_addr
psynch_mutexwait
psynch_mutexdrop
psynch_cvbroad
psynch_cvsignal
psynch_cvwait
psynch_rw_rdlock
psynch_rw_wrlock
psynch_rw_unlock
psynch_cvclrprepost
* wqthread_hijack on amd64-darwin: deal with
tst->os_state.pthread having an apparently different offset,
which caused an assertion failure
* m_debuginfo: for 32 bit processes on Lion, use the DebugInfoFSM
cleanup added in r12041/12042 to handle apparently new dyld
behaviour, which is to map text areas r-- first and only
vm_protect them later to r-x.
The following cleanups remain to be done
* remove apparently pointless, commented out wrapper macro
invokations in mc_replace_strmem.c, eg
//MEMMOVE(VG_Z_DYLD, memmove)
(or determine that they are still necessary, and uncomment)
* ditto in vg_replace_malloc.c, plus general VGO_darwin cleanups
there
* write proper syscall wrappers for
mach_port_set_context
task_get_exception_ports
getaudit_addr
psynch_mutexwait
psynch_mutexdrop
psynch_cvbroad
psynch_cvsignal
psynch_cvwait
psynch_rw_rdlock
psynch_rw_wrlock
psynch_rw_unlock
psynch_cvclrprepost
These are currently just no-ops and may be causing Memcheck to
report false undef-value errors
* figure out why it doesn't work properly unless built with gcc-4.2 on
Lion.
gcc-4.2 is the "normal" gcc (i686-apple-darwin11-gcc-4.2.1). Plain
gcc is the hybrid gcc-front-end clang-back-end thing
(i686-apple-darwin11-llvm-gcc-4.2). Whereas on Snow Leopard, plain
gcc is the normal gcc.
The symptoms of the failure are that wqthread_hijack in
syswrap-amd64-linux.c hits this /*NOTREACHED*/ vg_assert(0); right
at the end (you need a pretty complex threaded app to trigger this),
which makes me think that either ML_(wqthread_continue_NORETURN) or
call_on_new_stack_0_1 do return, which they are not expected to.
* figure out if some of the uninitialised value errors reported in
system libraries on are caused by Memcheck being confused by LLVM
generated code, as per bug #242137
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12043
contains a bunch of fields which are used as a very simple state
machine that observes mmap calls and decides when to read debuginfo
for the associated file. This change moves these fields into their
own structure, struct _DebugInfoFSM, for cleanness, so as to make it
clear they have a common purpose.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12041
with only one symbol. Instead, allow an address to have arbitrarily
many names. This reflects reality better, particularly for systemy
libraries such as glibc and ld.so, and is background work needed for
fixing #275284. This is not in itself a fix for #275284. A followup
commit to un-break compilation on OSX will follow shortly.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11981