slightly increases the performance. It also moderately improves
the nr of cases where helgrind can provide the stack trace of the old
access (when using the same amount of memory for the OldRef entries).
The patch also provides a new helgrind monitor command to show
the recorded accesses for an address+len, and adds an optional argument
lock_address to the monitor command 'info locks', to show the info
about just this lock.
Currently, oldref are maintained in a sparse WA, that points to N
entries, as specified by --conflict-cache-size=N.
For each entry (associated to an address), we have the last 5 accesses.
Old entries are recycled in an exact LRU order.
But inside an entry, we could have a recent access, and 4 very
old accesses that are kept 'alive' by a single thread accessing
repetitively the address shared with the 4 other old entries.
The attached patch replaces the sparse WA that maintains the OldREf
by an hash table.
Each OldRef now also only maintains one single access for an address.
As an OldRef now maintains only one access, all the entries are now
strictly in LRU mode.
Memory used for OldRef
-----------------------
For the trunk, an OldRef has a size of 72 bytes (on 32 bits archs)
maintaining up to 5 accesses to the same address.
On 64 bits arch, an OldRef is 104 bytes.
With the patch, an OldRef has a size of 32 bytes (on 32 bits archs)
or 56 bytes (on 64 bits archs).
So, for one single access, the new code needs (on 32 bits)
32 bytes, while the trunk needs only 14.4 bytes.
However, that is the worst case, assuming that the 5 entries in the
accs array are all used.
Looking on 2 big apps (one of them being firefox), we see that
we have very few OldRef entries that have the 5 entries occupied.
On a firefox startup, of the 5x1,000,000 accesses, we only have
1,406,939 accesses that are used.
So, in average, the trunk uses in reality around 52 bytes per access.
The default value for --conflict-cache-size has been doubled to 2000000.
This ensures that the memory used for the OldRef is more or less the
same as the trunk (104Mb for OldRef entries).
Memory used for sparseWA versus hashtable
-----------------------------------------
Looking on 2 big apps (one of them being firefox), we see that
there are big variations on the size of the WA : it can go in a few
seconds from 10MB to 250MB, or can decrease back to 10 MB.
This all depends where the last N accesses were done: if well localised,
the WA will be small.
If the last N accesses were distributed over a big address space,
then the WA will be big: the last level of WA (the biggest memory consumer)
uses slightly more than 1KB (2KB on 64 bits) for each '256 bytes' memory
zone where there is an oldref. So, in the worst case, on 32 bits, we
need > 1_000_000_000 sparseWA memory to keep 1_000_000 OldRef.
The hash table has between 1 to 2 Word overhead per OldRef
(as the chain array is +- doubled each time the hash table is full).
So, unless the OldRef are extremely localised, the overhead of the
hash table will be significantly less.
With the patch, the core arena total alloc is:
5299535/1201448632 totalloc-blocks/bytes
The trunk is
6693111/3959050280 totalloc-blocks/bytes
(so, around 1.20Gb versus 3.95Gb).
This big difference is due to the fact that the sparseWA repetitively
allocates then frees Level0 or LevelN when OldRef in the region covered
by the Level0/N have all been recycled.
In terms of CPU
---------------
With the patch, on amd64, a firefox startup seems slightly faster (around 1%).
The peak memory mmaped/used decreases by 200Mb.
For a libreoffice test, the memory decreases by 230Mb. CPU also decreases
slightly (1%).
In terms of correctness:
-----------------------
The trunk could potentially show not the most recent access
to the memory of a race : the first OldRef entry matching the raced upon
address was used, while we could have a more recent access in a following
OldRef entry. In other words, the trunk only guaranteed to find the
most recent access in an OldRef, but not between the several OldRef that
could cover the raced upon address.
So, assuming it is important to show the most recent access, this patch
ensures we really show the most recent access, even in presence of overlapping
accesses.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15289
Having a one elt free lineF cache avoids many PA calls.
This seems to slightly improve (a few %) a firefox startup.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15254
Currently, each SecMap has an array of linesF, referenced by the linesZ
of the secmap that needs a lineF, via an index stored in dict[1].
When the array is full, its size is doubled.
The linesF array of a secmap is freed when the SecMap is GC-ed.
The above strategy has the following consequences:
A. in average, 25% of the LinesF are unused.
B. if a SecMap has 'temporarily' a need for linesF, but afterwards,
these linesF are converted to normal lineZ representation, the linesF
will not be recuperated unless the SecMap is GC-ed (i.e. fully marked
no access).
The patch replaces the linesF array private per SecMap
by a pool allocator of LinesF shared between all SecMap.
A lineZ that needs a lineF will directly point to its lineF (using a pointer
stored in dict[1]), instead of having in dict[1] the index in the SecMap
linesF array.
When a lineZ needs a lineF, it is allocated from the pool allocator.
When a lineZ does not need anymore a lineF, it is returned back to the
pool allocator.
On a firefox startup, the above strategy reduces the memory for linesF
by about 42Mb. It seems that the more firefox is used (e.g. to visit
a few websites), the bigger the memory gain.
After opening the home page of valgrind, wikipedia and google, the memory
gain is about 94Mb:
trunk:
linesF: 392,181 allocd ( 203,934,120 bytes occupied) ( 173,279 used)
patch:
linesF: 212,966 allocd ( 109,038,592 bytes occupied) ( 170,252 used)
There is also less alloc/free operations in core arena with the patch:
trunk:
core : 810,680,320/ 802,291,712 max/curr mmap'd, 17/19 unsplit/split sb unmmap'd, 759,441,224/ 703,191,896 max/curr, 40631760/16376828248 totalloc-blocks/bytes, 188015696 searches 8 rzB
patch:
core : 701,628,416/ 690,753,536 max/curr mmap'd, 12/29 unsplit/split sb unmmap'd, 643,041,944/ 577,793,712 max/curr, 32050040/14056017712 totalloc-blocks/bytes, 174097728 searches 8 rzB
In terms of performance, no CPU impact detected on Firefox startup.
Note we have no representative reproducible (and preferrably small)
perf test that uses extensively linesF. Firefox is a good heavy lineF
user but is far to be reproducible, and is very far to be small.
Theoretically, in terms of CPU performance, the patch might have some
small benefits here and there for read operations, as the lineF pointer
is directly retrieved from the lineZ, rather than retrieved via an indirection
in the linesF array.
For write operations, the patch might need a little bit more CPU,
as we replace an
assignment to lineF inUse boolean to False (and then probably back to True
when the cacheline is written back)
by
a call to pool allocator VG_(freeEltPA) (and then probably a call to
VG_(allocEltPA) when the cacheline is written back).
These PA functions are small, so cost should be ok.
We might however still maintain in clear_LineF_of_Z the last cleared lineF
and re-use it in alloc_LineF_for_Z. Not sure how many calls to the PA functions
would be avoided by this '1 elt cache' (and the needed 'if elt == NULL'
check in both clear_LineF_of_Z and alloc_LineF_for_Z.
This possible optimisationwill be looked at later.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15253
reduced memory use doing SecMap GC, but was slowing down some workloads
(typically, workloads doing a lot of malloc/free).
A significant part of the slowdown came from the clear of the filter,
that was not optimised for big ranges : the filter was working byte
per byte till an 8 alignment. Then working per 8 bytes at a time.
With the patch, the filter clear is done the following way:
* all the bytes till 8 alignement are done together
* then 8 bytes at a time till filter_line alignment (32 bytes)
* then 32 bytes at a time.
Moreover, as the filter cache is small (1024 lines of 32 bytes),
clearing filter for ranges bigger than 32Kb was uselessly checking
several times the same entry. This is now avoided by using a range
check rather than a tag equality check.
As the new filter clear is significanly more complex than the previous simple
algorithm, the old algorithm is kept and used to check the new algorithm
when CHECK_ZSM is defined as 1.
The patch also contains a few micro optimisations and
disables
// VG_(track_die_mem_stack) ( evh__die_mem );
as this had no effect and was somewhat costly.
With this patch, we have almost reached for all perf tests the same
performance as we had before revision 15207. Some tests are still
slightly slower than before the SecMap GC (max 2% difference).
Some tests are now significantly faster (e.g. sarp).
For almost all tests, we are now faster than valgrind 3.10.1.
Details below.
Regtested on x86/amd64/ppc64 (and regtested with all compile time
checks set).
I have also regtested with libreoffice and firefox.
(with firefox, also with CHECK_ZSM set to 1).
Details about performance:
hgtrace = this patch
trunk_untouched = trunk
base_secmap = trunk before secmap GC
valgrind 3.10.1 included for comparison
Measured on core i5 2.53GHz
-- Running tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 hgtrace :0.14s he: 2.6s (18.4x, -----)
bigcode1 trunk_untouched:0.14s he: 2.6s (18.4x, -0.4%)
bigcode1 base_secmap:0.14s he: 2.6s (18.6x, -1.2%)
bigcode1 valgrind-3.10.1:0.14s he: 2.8s (19.8x, -7.8%)
-- bigcode2 --
bigcode2 hgtrace :0.14s he: 6.3s (44.7x, -----)
bigcode2 trunk_untouched:0.14s he: 6.2s (44.6x, 0.2%)
bigcode2 base_secmap:0.14s he: 6.3s (45.0x, -0.6%)
bigcode2 valgrind-3.10.1:0.14s he: 6.6s (47.1x, -5.4%)
-- bz2 --
bz2 hgtrace :0.64s he:11.3s (17.7x, -----)
bz2 trunk_untouched:0.64s he:11.7s (18.2x, -3.2%)
bz2 base_secmap:0.64s he:11.1s (17.3x, 1.9%)
bz2 valgrind-3.10.1:0.64s he:12.6s (19.7x,-11.3%)
-- fbench --
fbench hgtrace :0.29s he: 3.4s (11.8x, -----)
fbench trunk_untouched:0.29s he: 3.4s (11.7x, 0.6%)
fbench base_secmap:0.29s he: 3.6s (12.4x, -5.0%)
fbench valgrind-3.10.1:0.29s he: 3.5s (12.2x, -3.5%)
-- ffbench --
ffbench hgtrace :0.26s he: 9.8s (37.7x, -----)
ffbench trunk_untouched:0.26s he:10.0s (38.4x, -1.9%)
ffbench base_secmap:0.26s he: 9.8s (37.8x, -0.2%)
ffbench valgrind-3.10.1:0.26s he:10.0s (38.4x, -1.9%)
-- heap --
heap hgtrace :0.11s he: 9.2s (84.0x, -----)
heap trunk_untouched:0.11s he: 9.6s (87.1x, -3.7%)
heap base_secmap:0.11s he: 9.0s (81.9x, 2.5%)
heap valgrind-3.10.1:0.11s he: 9.1s (82.9x, 1.3%)
-- heap_pdb4 --
heap_pdb4 hgtrace :0.13s he:10.7s (82.3x, -----)
heap_pdb4 trunk_untouched:0.13s he:11.0s (84.8x, -3.0%)
heap_pdb4 base_secmap:0.13s he:10.5s (80.8x, 1.8%)
heap_pdb4 valgrind-3.10.1:0.13s he:10.6s (81.8x, 0.7%)
-- many-loss-records --
many-loss-records hgtrace :0.01s he: 1.5s (152.0x, -----)
many-loss-records trunk_untouched:0.01s he: 1.6s (157.0x, -3.3%)
many-loss-records base_secmap:0.01s he: 1.6s (158.0x, -3.9%)
many-loss-records valgrind-3.10.1:0.01s he: 1.7s (167.0x, -9.9%)
-- many-xpts --
many-xpts hgtrace :0.03s he: 2.8s (91.7x, -----)
many-xpts trunk_untouched:0.03s he: 2.8s (94.7x, -3.3%)
many-xpts base_secmap:0.03s he: 2.8s (94.0x, -2.5%)
many-xpts valgrind-3.10.1:0.03s he: 2.9s (97.7x, -6.5%)
-- memrw --
memrw hgtrace :0.06s he: 7.3s (121.2x, -----)
memrw trunk_untouched:0.06s he: 7.2s (120.3x, 0.7%)
memrw base_secmap:0.06s he: 7.1s (117.7x, 2.9%)
memrw valgrind-3.10.1:0.06s he: 8.1s (135.2x,-11.6%)
-- sarp --
sarp hgtrace :0.02s he: 7.6s (378.5x, -----)
sarp trunk_untouched:0.02s he: 8.4s (422.0x,-11.5%)
sarp base_secmap:0.02s he: 8.6s (431.0x,-13.9%)
sarp valgrind-3.10.1:0.02s he: 8.8s (442.0x,-16.8%)
-- tinycc --
tinycc hgtrace :0.20s he:12.4s (62.0x, -----)
tinycc trunk_untouched:0.20s he:12.6s (63.2x, -1.9%)
tinycc base_secmap:0.20s he:12.6s (63.0x, -1.6%)
tinycc valgrind-3.10.1:0.20s he:12.7s (63.5x, -2.3%)
-- Finished tests in perf ----------------------------------------------
== 12 programs, 48 timings =================
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15236
* avoid indirection via function pointers to call SVal__rcinc and SVal__rcdec
* declare these functions inlined
* transform 2 asserts on hot path in conditionally compiled checks
on CHECK_ZSM
This slightly optimises some perf tests with helgrind
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15212
by implementing a Garbage Collection for the SecMap.
The basic change is that freed memory is marked as noaccess
(while before, it kept the previous marking, on the basis that
non buggy applications are not accessing freed memory in any case).
Keeping the previous marking avoids the CPU/memory changes needed
to mark noaccess.
However, marking freed memory noaccess and GC the secmap reduces
the memory on big apps.
For example, a firefox test needs 220Mb less (on about 2.06 Gb).
Similar reduction for libreoffice batch (260 MB less on 1.09 Gb).
On such applications, the performance with the patch is similar to the trunk.
There is a performance decrease for applications that are doing
a lot of malloc/free repetitively: e.g. on some perf tests, an increase
in cpu of up to 15% has been observed.
Several performance optimisations can be done afterwards to not loose
too much performance. The decrease of memory is expected to produce
in any case significant benefit in memory constrained environments
(e.g. android phones).
So, after discussion with Julian, it was decided to commit as-is
and (re-)gain (part of) performance in follow-up commits.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15207
on 32 bits platforms. No memory reduction on 64 bits platforms,
due to alignment.
The patch also shows the vts stats when showing the helgrind stats.
The perf/memrw.c perf test gets also some new additional features
allowing e.g. to control the size of the read or written blocks.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15174
* nr of client malloc-ed blocks
* how many OldRef helgrind has, and the distribution
of these OldRef according to the nr of accs they have
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15128
Otherwise, long running applications still see the max nr of RCEC
slowly growing, which increases the memory usage and
makes the (fixed) contextTab hash table slower to search.
Without this margin, the max could increase as the GC code
is not called at exactly the moment we reach the previous max,
but rather when a thread has run a bunch of basic blocks.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15126
conflict cache size.
The current policy is:
A 'more or less' LRU policy is implemented by giving
to each OldRef a generation nr in which it was last touched.
A new generation is created every 50000 new access.
GC is done when the nr of OldRef reaches --conflict-cache-size.
The GC consists in removing enough generations to free
half of the entries.
After GC of OldRef, the RCEC (Ref Counted Exe Contexts)
not referenced anymore are GC-ed.
The new policy is:
An exact LRU policy is implemented using a doubly linked list
of OldRef.
When reaching --conflict-cache-size, the LRU entry is re-used.
The not referenced RCEC are GC-ed when less than 75% of the RCEC
are referenced, and the nr of RCEC is 'big' (at least half the
size of the contextTab, and at least the max nr of RCEC reached
previously).
(note: tried to directly recover a unref'ed RCEC when recovering
the LRU oldref, but that gives a lot of re-creation of RCEC).
new policy has the following advantages/disadvantages:
1. It is faster (at least for big applications)
On a firefox startup/exit, we gain about 1m30 second on 11m.
Similar 5..10% speed up encountered on other big applications
or on the new perf/memrw test.
The speed increase depends on the amount of memory
touched by the application. For applications with a
working set fitting in conflict-cache-size, the new policy
might be marginally slower than previous policy on platforms
having a small cache : the current policy only sets a generation
nr when an address is re-accessed, while the new policy
has to unchain and rechain the OldRef access in the LRU
doubly linked list.
2. It uses less memory (at least for big applications)
Firefox startup/exit "core" arena max use decreases from
1175MB mmap-ed/1060MB alloc-ed
to
994MB mmap-ed/913MB alloc-ed
The decrease in memory is the result of having a lot less RCEC:
The current policy let the nr of RCEC grow till the conflict
cache size is GC-ed.
The new policy limits the nr of RCEC to 133% of the RCEC
really referenced. So, we end up with a max nr of RCEC
a lot smaller with the new policy : max RCEC 191000
versus 1317000, for a total nr of discard RCEC operations
almost the same: 33M versus 32M.
Also, the current policy allocates a big temporary array
to do the GC of OldRef.
With the new policy, size of an OldRef increases because
we need 2 pointers for the LRU doubly linked list, and
we need the accessed address.
In total, the OldRef increase is limited to one Word,
as we do not need anymore the gen, and the 'magic'
for sanity check was removed (the check somewhat
becomes less needed, because an OldRef is never freed
anymore. Also, we do a new cross-check between
the ga in the OldRef and the sparseWA key).
For applications using small memory and having
a small nr of different stack traces accessing memory,
the new policy causes an increase in memory (one Word
per OldRef).
3. Functionally, the new policy gives better past information:
once the steady state is reached (i.e. the conflict cache
is full), the new policy has always --conflict-cache-size
entries of past information.
The current policy has a nr of past information varying
between --conflict-cache-size/2 and --conflict-cache-size
(so in average, 75% of conflict-cache-size).
4. The new code is a little bit smaller/simpler:
The generation based GC is replaced by a simpler LRU policy.
So, in summary, this patch should allow big applications
to use less cpu/memory, while having very little
or no impact on memory/cpu of small applications.
Note that the OldRef data structure LRU policy
is not really explicitely tested by a regtest.
Not easy at first sight to make such a test portable
between platforms/OS/compilers/....
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15119
done from libhb_maybe_GC, i.e. check the condition in
libhb_maybe_GC, and call the (non inlined) GC only if
a GC is needed.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15082
* do VTS pruning only if new threads were declared
very dead since the last pruning round.
* When doing pruning, use the new list of threads very dead
to do the pruning : this decreases the cost of the dichotomic search
in VTS__substract
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15044
Eliminates a fixed size buffer in helgrind. Instead of building up a
string in a buffer and then writing the string to stdout can as well
write to stdout directly.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14672
First, as the allocator function does not fail, there is no need
to assert its return value.
Second, remove commented out (since r8765) function VG_(isEmptyFM).
Third, remove VG_(getNodeSizeFM) from the API. The details of the
implementation do not need to be exposed.
Fourth, for consistency require that the copy functions for keys and
values in VG_(dopyFM) (which are essentially like allocators) return
non-NULL values for non-NULL arguments if they return.
Fifth, document NULL-ness of return values for VG_(newFM), VG_(dopyFM),
and VG_(newBag). Remove pointless asserts at call sites.
Six, change avl_dopy to assert that the node the function is
supposed to copy is not NULL. It is called that way anyhow. With
that change the function never returns NULL which allows us to
simplify the call sites. Checking the return value is no longer needed.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@14535
So, we can now allocate this memory only when approx history level
is requested.
I double checked using printf that clo processing was done before
this procedure is called.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13915
Also fix all usages of the wordFM data structure. Once upon a time
wordFM used Words but now it uses UWords.
Likewise for WordBag.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@13070
* new files include/pub_tool_groupalloc.h and coregrind/m_groupalloc.c
implementing a group allocator (based on helgrind group alloc).
* include/Makefile.am coregrind/Makefile.am : added pub_tool_groupalloc.h
and m_groupalloc.c
* helgrind/libhb_core.c : use pub_tool_groupalloc.h/m_groupalloc.c
instead of the local implementation.
* include/pub_tool_oset.h coregrind/m_oset.c : new function
allowing to create an oset that will use a pool allocator.
new function allowing to clone an oset (so as to share the pool alloc)
* memcheck/tests/unit_oset.c drd/tests/unit_bitmap.c : modified
so that it compiles with the new m_oset.c
* memcheck/mc_main.c : use group alloc for MC_Chunk
memcheck/mc_include.h : declare the MC_Chunk group alloc
* memcheck/mc_main.c : use group alloc for the nodes of the secVBitTable OSet
* include/pub_tool_hashtable.h coregrind/m_hashtable.c : pass the free node
function in the VG_(HT_destruct).
(needed as the hashtable user can allocate a node with its own alloc,
the hash table destroy must be able to free the nodes with the user
own free).
* coregrind/m_gdbserver/m_gdbserver.c : pass free function to VG_(HT_destruct)
* memcheck/mc_replace_strmem.c memcheck/mc_machine.c
memcheck/mc_malloc_wrappers.c memcheck/mc_leakcheck.c
memcheck/mc_errors.c memcheck/mc_translate.c : new include needed
due to group alloc.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@12341
* performance and scalability improvements
* show locks held by both threads in a race
* show all 4 locks involved in a lock order violation
* better delimited error messages
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11824
paint the relevant address range as NoAccess rather than ignoring the
event. This is important for avoiding VTS leaks in libhb_core.
More details in comments in the code.
Also rename the _noaccess_ painters that do nothing to make it clearer
that they do nothing :-)
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11654