Sync VEX/LICENSE.GPL with top-level COPYING file. We used 3 different
addresses for writing to the FSF to receive a copy of the GPL. Replace
all different variants with an URL <http://www.gnu.org/licenses/>.
The following files might still have some slightly different (L)GPL
copyright notice because they were derived from other programs:
- files under coregrind/m_demangle which come from libiberty:
cplus-dem.c, d-demangle.c, demangle.h, rust-demangle.c,
safe-ctype.c and safe-ctype.h
- coregrind/m_demangle/dyn-string.[hc] derived from GCC.
- coregrind/m_demangle/ansidecl.h derived from glibc.
- VEX files for FMA detived from glibc:
host_generic_maddf.h and host_generic_maddf.c
- files under coregrin/m_debuginfo derived from LZO:
lzoconf.h, lzodefs.h, minilzo-inl.c and minilzo.h
- files under coregrind/m_gdbserver detived from GDB:
gdb/signals.h, inferiors.c, regcache.c, regcache.h,
regdef.h, remote-utils.c, server.c, server.h, signals.c,
target.c, target.h and utils.c
Plus the following test files:
- none/tests/ppc32/testVMX.c derived from testVMX.
- ppc tests derived from QEMU: jm-insns.c, ppc64_helpers.h
and test_isa_3_0.c
- tests derived from bzip2 (with embedded GPL text in code):
hackedbz2.c, origin5-bz2.c, varinfo6.c
- tests detived from glibc: str_tester.c, pth_atfork1.c
- test detived from GCC libgomp: tc17_sembar.c
- performance tests derived from bzip2 or tinycc (with embedded GPL
text in code): bz2.c, test_input_for_tinycc.c and tinycc.c
[This commit contains an implementation for all targets except amd64-solaris
and x86-solaris, which will be completed shortly.]
In the baseline simulator, jumps to guest code addresses that are not known at
JIT time have to be looked up in a guest->host mapping table. That means:
indirect branches, indirect calls and most commonly, returns. Since there are
huge numbers of these (often 10+ million/second) the mapping mechanism needs
to be extremely cheap.
Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with
2^15 (guest_addr, host_addr) pairs. This is queried in handwritten assembly
in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S. If there is a miss in the
cache then we fall back out to C land, and do a slow lookup using
VG_(search_transtab).
Given that the size of the translation table(s) in recent years has expanded
significantly in order to keep pace with increasing application sizes, two bad
things have happened: (1) the cost of a miss in the fast cache has risen
significantly, and (2) the miss rate on the fast cache has also increased
significantly. This means that large (~ one-million-basic-blocks-JITted)
applications that run for a long time end up spending a lot of time in
VG_(search_transtab).
The proposed fix is to increase associativity of the fast cache, from 1
(direct mapped) to 4. Simulations of various cache configurations using
indirect-branch traces from a large application show that is the best of
various configurations. In an extreme case with 5.7 billion indirect
branches:
* The increase of associativity from 1 way to 4 way, whilst keeping the
overall cache size the same (32k guest/host pairs), reduces the miss rate by
around a factor of 3, from 4.02% to 1.30%.
* The use of a slightly better hash function than merely slicing off the
bottom 15 bits of the address, reduces the miss rate further, from 1.30% to
0.53%.
Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but
reduced by a factor of up to almost 8 on large workloads.
By implementing each (4-entry) cache set using a move-to-front scheme in the
case of hits in ways 1, 2 or 3, the vast majority of hits can be made to
happen in way 0. Hence the cost of having this extra associativity is almost
zero in the case of a hit. The improved hash function costs an extra 2 ALU
shots (a shift and an xor) but overall this seems performance neutral to a
win.
Explanation by Matthias Schwarzott:
The linker will request an executable stack as soon as at least one
object file, that is linked in, wants an executable stack.
And the absence of the
.section .note.GNU-stack."",@progbits
is enough to tell the linker that an executable stack is needed.
So even an empty asm-file must at least contain this statement to not
force executable stacks on the whole executable.
* Define a helper macro MARK_STACK_NO_EXEC that disables the
executable stack.
* Instantiate this macro unconditionally at the end of each asm file.
Patch by Matthias Schwarzott <zzam@gentoo.org>.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@15692
the fact that all {VG,VEX}_TRC_VALUES have their lowest bit set. All
other targets can benefit from this trick too.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11781
make some attempt to schedule for Cortex-A8. Improves overall IPC
for none running perf/bz2.c "-O" from 0.879 to 0.925.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11780
"insn_address >> 1". The former is appropriate for ARM code, where
all insns are 4-sized and 4-aligned, but not for Thumb code, where the
minimum size and alignment is 2. The old scheme happened to work for
Thumb (indeed, any hash function would), but caused huge amounts of
conflict misses in the fast cache for some programs.
The change has been observed to reduce conflict misses by up to 100
times, and in some cases, improves performance significantly for Thumb
code. Performance of ARM code is unchanged or possibly a bit worse.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@11716
the changes to do with reading and using ELF and DWARF3 info.
This breaks all targets except amd64-linux and x86-linux.
git-svn-id: svn://svn.valgrind.org/valgrind/trunk@10982