Implementation for x86-solaris and amd64-solaris. This completes the
implementations for all targets. Note these two are untested because I don't
have any way to test them.
[This commit contains an implementation for all targets except amd64-solaris
and x86-solaris, which will be completed shortly.]
In the baseline simulator, jumps to guest code addresses that are not known at
JIT time have to be looked up in a guest->host mapping table. That means:
indirect branches, indirect calls and most commonly, returns. Since there are
huge numbers of these (often 10+ million/second) the mapping mechanism needs
to be extremely cheap.
Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with
2^15 (guest_addr, host_addr) pairs. This is queried in handwritten assembly
in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S. If there is a miss in the
cache then we fall back out to C land, and do a slow lookup using
VG_(search_transtab).
Given that the size of the translation table(s) in recent years has expanded
significantly in order to keep pace with increasing application sizes, two bad
things have happened: (1) the cost of a miss in the fast cache has risen
significantly, and (2) the miss rate on the fast cache has also increased
significantly. This means that large (~ one-million-basic-blocks-JITted)
applications that run for a long time end up spending a lot of time in
VG_(search_transtab).
The proposed fix is to increase associativity of the fast cache, from 1
(direct mapped) to 4. Simulations of various cache configurations using
indirect-branch traces from a large application show that is the best of
various configurations. In an extreme case with 5.7 billion indirect
branches:
* The increase of associativity from 1 way to 4 way, whilst keeping the
overall cache size the same (32k guest/host pairs), reduces the miss rate by
around a factor of 3, from 4.02% to 1.30%.
* The use of a slightly better hash function than merely slicing off the
bottom 15 bits of the address, reduces the miss rate further, from 1.30% to
0.53%.
Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but
reduced by a factor of up to almost 8 on large workloads.
By implementing each (4-entry) cache set using a move-to-front scheme in the
case of hits in ways 1, 2 or 3, the vast majority of hits can be made to
happen in way 0. Hence the cost of having this extra associativity is almost
zero in the case of a hit. The improved hash function costs an extra 2 ALU
shots (a shift and an xor) but overall this seems performance neutral to a
win.
The wrong bit number was used when checking for the vector facility. This
can result in a fatal emulation error: "Encountered an instruction that
requires the vector facility. That facility is not available on this
host."
In many cases the wrong facility bit was usually set as well, hence
nothing bad happened. But when running Valgrind within a Qemu/KVM guest,
the wrong bit was not (always?) set and the emulation error occurred.
This fix simply corrects the vector facility bit number, changing it from
128 to 129.
PTRACE_GET_THREAD_AREA is not handled by amd64 linux syswrap, which leads
to false positive errors in 64 bits program ptrace-ing 32 bits processes.
For example, the below error was wrongly reported on GDB:
==25377== Conditional jump or move depends on uninitialised value(s)
==25377== at 0x8A1D7EC: td_thr_get_info (td_thr_get_info.c:35)
==25377== by 0x526819: thread_from_lwp(thread_info*, ptid_t) (linux-thread-db.c:417)
==25377== by 0x5281D4: thread_db_notice_clone(ptid_t, ptid_t) (linux-thread-db.c:442)
==25377== by 0x51773B: linux_handle_extended_wait(lwp_info*, int) (linux-nat.c:2027)
....
==25377== Uninitialised value was created by a stack allocation
==25377== at 0x69A360: x86_linux_get_thread_area(int, void*, unsigned int*) (x86-linux-nat.c:278)
Fix this by implementing PTRACE_GET|SET_THREAD_AREA on amd64.
addex uses OV as carry in and carry out. For all other instructions
OV is the signed overflow flag. And instructions like adde use CA
as carry.
Replace set_XER_OV_OV32 with set_XER_OV_OV32_ADDEX, which will
call calculate_XER_CA_64 and calculate_XER_CA_32, but with OV
as input, and sets OV and OV32.
Enable test_addex in none/tests/ppc64/test_isa_3_0.c and update
the expected output. test_addex would fail to match the expected
output before this patch.
A few .exp files (not tested on amd64) have to be changed to
have the messages in the new order:
Use --track-origins=yes to see where uninitialised values come from
For lists of detected and suppressed errors, rerun with: -s
This option allows to list the detected errors and show the used
suppressions without increasing the verbosity.
Increasing the verbosity also activates a lot of messages that
are often not very useful for the user.
So, this option allows to see the list of errors and used suppressions
independently of the verbosity.
Note if a high verbosity is selected, the behaviour is unchanged.
In other words, when specifying -v, the list of detected errors
and the used suppressions are still shown, even if
--show-error-list=yes and -s are not used.
Each tool producing errors had identical code to produce this msg.
Factorize the production of the message in m_main.c
This prepares the work to have a specific option to show the list
of detected errors and the count of suppressed errors.
This has a (small) visible effect on the output of memcheck:
Instead of producing
For counts of detected and suppressed errors, rerun with: -v
Use --track-origins=yes to see where uninitialised values come from
memcheck now produces:
Use --track-origins=yes to see where uninitialised values come from
For counts of detected and suppressed errors, rerun with: -v
i.e. the track origin and counts of errors msg are inverted.
For most SIMD operations that happen on 64-bit values (as would arise from MMX
instructions, for example, such as Add16x4, CmpEQ32x2, etc), generate code
that performs the operation using SSE/SSE2 instructions on values in the low
halves of XMM registers. This is much more efficient than the previous scheme
of calling out to helper functions written in C. There are still a few SIMD64
operations done via helpers, though.
.. by adding support for MOVQ xmm/ireg and using that to implement 64HLtoV128,
4x64toV256 and their inverses. This reduces the number of instructions,
removes the use of memory as an intermediary, and avoids store-forwarding
stalls.
pshufb mm/xmm/ymm rearranges byte lanes in vector registers. It's fairly
widely used, but we generated terrible code for it. With this patch, we just
generate, at the back end, pshufb plus a bit of masking, which is a great
improvement.
* changes set_AV_CR6 so that it does scalar comparisons against zero,
rather than sometimes against an all-ones word. This is something
that Memcheck can instrument exactly.
* in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64
by default on ppc64le.
https://bugs.kde.org/show_bug.cgi?id=386945#c62
This makes it possible for memcheck to know which part of the 128bit
vector is defined, even if the load is partly beyond an addressable block.
Partially resolves bug 386945.
On powerpc partial unaligned loads of vectors from partially invalid
addresses are OK and could be generated by our translation of lxvd2x.
Adjust partial_load memcheck tests to allow partial loads of 16 byte
vectors on powerpc64.
Part of resolving bug #386945.
This makes it possible for memcheck to know which part of the 128bit
vector is defined, even if the load is partly beyond an addressable block.
Partially resolves bug 386945.
On powerpc partial unaligned loads of words from partially invalid
addresses are OK and could be generated by our translation of ldbrx.
Adjust partial_load memcheck tests to allow partial loads of words
on powerpc64.
Part of resolving bug #386945.
This makes it possible for memcheck to analyse the new gcc strcmp
inlined code correctly even if the ldbrx load is partly beyond an
addressable block.
Partially resolves bug 386945.
This happens when processing openssl aes_v8_set_encrypt_key
(aesv8-armx.S:133). The noteTmpUsesIn () function is new since
PR387664 Memcheck: make expensive-definedness-checks be the default.
It didn't handle Iex_VECRET which is used in the arm64 crypto
instruction dirty handlers.
The sys_ptrace post didn't mark the thread as being in traceme mode.
This occassionally would make the memcheck/tests/linux/getregset.vgtest
testcase fail. With this patch it reliably passes.
Wait for children to finish before terminating the main process.
This fixes occasional failures of the following tests:
drd/tests/fork-parallel (stderr)
drd/tests/fork-serial (stderr)
In final_tidyup we setup the guest to call the freeres_wrapper, which
will (possibly) call __gnu_cxx::__freeres() and/or __libc_freeres().
In a couple of cases (ppc64be, ppc64le and mips32) this involves setting
up one or more helper registers. Since we setup these guest registers
we should make sure to mark them as fully defined. Otherwise we might
see spurious warnings about undefined value usage if the guest register
happened to not be fully defined before.
This fixes PR402006.
Because it's very useful. As part of this, the "percentage of events
annotated" numbers at the bottom of the output is changed to "events
annotated" so that --show-percs doesn't compute a percentage of a
percentage.
Example output lines:
```
4,967,137,442 (100.0%) PROGRAM TOTALS
4,543 (25.23%) 17,566 ( 0.43%) 47,993 ( 0.92%) /build/glibc-OTsEL5/glibc-2.27/elf/dl-lookup.c
1 ( 0.01%) 2,000,001 (49.29%) 3,000,004 (57.36%) for (int i = 0; i < 1000000; i++) {
```
The commit also adds some much-needed tests for cg_annotate and
callgrind_annotate.
glibc 2.28 filters out some bad signal numbers and returns
Invalid argument instead of passing such bad signal numbers
the kernel sigaction syscall. So we won't see such bad signal
numbers and won't print "bad signal number" ourselves.
Add a new memcheck/tests/sigkill.stderr.exp-glibc-2.28 to catch
this case.
The mfvscr and vor instructions in jm-insns.c had a "=vr" constraint.
This should have been an "=v" constraint. This resolved assembler
warnings and the testcase failing on ppc64le with gcc 8.2 and
binutils 2.30.
This adds a configuration file ".dir-locals.el" for Emacs to the topmost
directory of the Valgrind source tree, and another such file to the
directory drd/tests. These files contain per-directory local Emacs
variables.
The following settings are performed:
* The base C style is set to "Linux", indentation is set to 3 columns
per level, the use of tabs for indentation is disabled, and the fill
column is set to 80.
* The source files in drd/tests use 2 instead of 3 columns per indentation
level.