diff --git a/NEWS b/NEWS
index e169ac1ce..e6e0fb578 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,11 @@ Release 3.8.0 (????)
* The C++ demangler has been updated so as to work well with C++
compiled by even the most recent g++'s.
+* The new option --fair-sched allows to control the locking mechanism
+ used by Valgrind. The locking mechanism influences the performance
+ and scheduling of multithreaded applications (in particular
+ on multiprocessor/multicore systems).
+
* ==================== FIXED BUGS ====================
The following bugs have been fixed or resolved. Note that "n-i-bz"
@@ -41,6 +46,7 @@ https://bugs.kde.org/show_bug.cgi?id=XXXXXX
where XXXXXX is the bug number as listed below.
247386 make perf does not run all performance tests
+270006 -Valgrind scheduler unfair
270796 s390x: Removed broken support for the TS insn
271438 Fix configure for proper SSE4.2 detection
273114 s390x: Support TR, TRE, TROO, TROT, TRTO, and TRTT instructions
diff --git a/docs/xml/manual-core.xml b/docs/xml/manual-core.xml
index 736c39756..2d3d086af 100644
--- a/docs/xml/manual-core.xml
+++ b/docs/xml/manual-core.xml
@@ -1660,6 +1660,44 @@ need to use these.
+
+
+
+
+
+ The controls the
+ locking mechanism used by Valgrind to serialise thread
+ execution. The locking mechanism differs in the way the threads
+ are scheduled, giving a different trade-off between fairness and
+ performance. For more details about the Valgrind thread
+ serialisation principle and its impact on performance and thread
+ scheduling, see .
+
+
+ The value
+ activates a fair scheduling. Basically, if multiple threads are
+ ready to run, the threads will be scheduled in a round robin
+ fashion. This mechanism is not available on all platforms or
+ linux versions. If not available,
+ using will cause Valgrind to
+ terminate with an error.
+
+
+ The value
+ activates the fair scheduling if available on the
+ platform. Otherwise, it will automatically fallback
+ to .
+
+
+ The value activates
+ a scheduling mechanism which does not guarantee fairness
+ between threads ready to run.
+
+
+
+
+
+
@@ -1836,8 +1874,8 @@ that your program will use the native threading library, but Valgrind
serialises execution so that only one (kernel) thread is running at a
time. This approach avoids the horrible implementation problems of
implementing a truly multithreaded version of Valgrind, but it does
-mean that threaded apps run only on one CPU, even if you have a
-multiprocessor or multicore machine.
+mean that threaded apps never use more than one CPU simultaneously,
+even if you have a multiprocessor or multicore machine.
Valgrind doesn't schedule the threads itself. It merely ensures
that only one thread runs at once, using a simple locking scheme. The
@@ -1860,6 +1898,86 @@ everything is shared (a thread) or nothing is shared (fork-like); partial
sharing will fail.
+
+Scheduling and Multi-Thread Performance
+
+A thread executes some code only when it holds the lock. After
+executing a certain nr of instructions, the running thread will release
+the lock. All threads ready to run will compete to acquire the lock.
+
+The option controls the locking mechanism
+used to serialise the thread execution.
+
+ The default pipe based locking
+() is available on all platforms. The
+pipe based locking does not guarantee fairness between threads : it is
+very well possible that the thread that has just released the lock
+gets it back directly. When using the pipe based locking, different
+execution of the same multithreaded application might give very different
+thread scheduling.
+
+ The futex based locking is available on some platforms.
+If available, it is activated by or
+. The futex based locking ensures
+fairness between threads : if multiple threads are ready to run, the lock
+will be given to the thread which first requested the lock. Note that a thread
+which is blocked in a system call (e.g. in a blocking read system call) has
+not (yet) requested the lock: such a thread requests the lock only after the
+system call is finished.
+
+ The fairness of the futex based locking ensures a better reproducibility
+of the thread scheduling for different executions of a multithreaded
+application. This fairness/better reproducibility is particularly
+interesting when using Helgrind or DRD.
+
+ The Valgrind thread serialisation implies that only one thread
+is running at a time. On a multiprocessor/multicore system, the
+running thread is assigned to one of the CPUs by the OS kernel
+scheduler. When a thread acquires the lock, sometimes the thread will
+be assigned to the same CPU as the thread that just released the
+lock. Sometimes, the thread will be assigned to another CPU. When
+using the pipe based locking, the thread that just acquired the lock
+will often be scheduled on the same CPU as the thread that just
+released the lock. With the futex based mechanism, the thread that
+just acquired the lock will more often be scheduled on another
+CPU.
+
+The Valgrind thread serialisation and CPU assignment by the OS
+kernel scheduler can badly interact with the CPU frequency scaling
+available on many modern CPUs : to decrease power consumption, the
+frequency of a CPU or core is automatically decreased if the CPU/core
+has not been used recently. If the OS kernel often assigns the thread
+which just acquired the lock to another CPU/core, there is quite some
+chance that this CPU/core is currently at a low frequency. The
+frequency of this CPU will be increased after some time. However,
+during this time, the (only) running thread will have run at a low
+frequency. Once this thread has run during some time, it will release
+the lock. Another thread will acquire this lock, and might be
+scheduled again on another CPU whose clock frequency was decreased in
+the meantime.
+
+The futex based locking causes threads to more often switch of
+CPU/core. So, if CPU frequency scaling is activated, the futex based
+locking might decrease significantly (up to 50% degradation has been
+observed) the performance of a multithreaded app running under
+Valgrind. The pipe based locking also somewhat interacts badly with
+CPU frequency scaling. Up to 10..20% performance degradation has been
+observed.
+
+To avoid this performance degradation, you can indicate to the
+kernel that all CPUs/cores should always run at maximum clock
+speed. Depending on your linux distribution, CPU frequency scaling
+might be controlled using a graphical interface or using command line
+such as
+cpufreq-selector or
+cpufreq-set. You might also indicate to the
+OS scheduler to run a Valgrind process on a specific (fixed) CPU using the
+taskset command : running on a fixed
+CPU should ensure that this specific CPU keeps a high frequency clock speed.
+
+
+
+