ftmemsim-valgrind/exp-bbv/docs/bbv-manual.xml

<?xml version="1.0"?> <!-- -*- sgml -*- -->
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
  "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<chapter id="bbv-manual" xreflabel="BBV">
  <title>BBV: an experimental basic block vector generation tool</title>

<para>To use this tool, you must specify
<option>--tool=exp-bbv</option> on the Valgrind
command line.</para>

<sect1 id="bbv-manual.overview" xreflabel="Overview">
<title>Overview</title>

<para>
   A Basic Blocks Vector (BBV) is a list of all basic blocks entered
   during program execution, and a count of how many times each
   block was run (a basic block is a section of code
   with only one entry point and one exit point).
</para>

<para>
   BBV is tool that generates basic block vectors
   for use with the SimPoint analysis tool
   (http://www.cse.ucsd.edu/~calder/simpoint/).
   The SimPoint methodology enables speeding up architectural
   simulations by only running a small portion of a program
   and then extrapolating total behavior from this
   small portion.  Most programs exhibit phase-based behavior, which
   means that at various times during execution a program will encounter
   intervals of time where the code behaves similarly to a previous
   interval.  If you can detect these intervals and group them together,
   an approximation of the total program behavior can be obtained
   by only simulating a bare minimum number of intervals, and then scaling
   the results.
</para>

<para>
  In computer architecture research, running a
  benchmark on a cycle-accurate simulator can cause slowdowns on the order
  of 1000 times, making it take days, weeks, or even longer to run full
  benchmarks.  By utilizing SimPoint this can be reduced significantly,
  usually by 90-95%, while still retaining reasonable accuracy.
</para>

<para>
   A more complete introduction to how SimPoint works can be
   found in the paper "Automatically Characterizing Large Scale
   Program Behavior" by T. Sherwood, E. Perelman, G. Hamerly, and
   B. Calder.
</para>

</sect1>

<sect1 id="bbv-manual.quickstart" xreflabel="Quick Start">
<title>Using Basic Block Vectors to create SimPoints</title>

<para>
   To quickly create a basic block vector file, you will call Valgrind
   like this:
   <computeroutput>valgrind --tool=exp-bbv /bin/ls</computeroutput>
   In this case we are running on the "ls" program, but this
   can be any executable.  By default a file called
   <computeroutput>bb.out.PID</computeroutput> will be created,
   where PID is replaced by the process ID of the running process.
   This file is the basic block vector.  For long-running programs
   this file can be quite large, so it might be wise to compress
   it with gzip or some other compression program.
</para>

<para>
   To create actual SimPoint results, you will need the
   SimPoint utility, available from the SimPoint webpage
   (http://www.cse.ucsd.edu/~calder/simpoint/).
   Assuming you have downloaded SimPoint 3.2 and compiled it,
   create SimPoint results with a command like the following:

   <programlisting><![CDATA[
./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
    -loadFVFile bb.out.1234.gz \
    -k 5 -saveSimpoints results.simpts \
    -saveSimpointWeights results.weights]]></programlisting>

   where bb.out.1234.gz is your compressed basic block vector file
   generated by Valgrind exp-bbv.
</para>

<para>
   The SimPoint utility does random linear projection using 15-dimensions,
   then does k-mean clustering to calculate which intervals are
   of interest.  In this example we specify 5 intervals with the
   -k 5 option.
</para>

<para>
   The outputs from the SimPoint run are the
   <computeroutput>results.simpts</computeroutput>
   and <computeroutput>results.weights</computeroutput> files.
   The first holds the 5 most relevant intervals of the program.
   The seconds holds the weight to scale each interval by when
   extrapolating full-program behavior.  The intervals and the weights
   can be used in conjunction with a simulator that supports
   fast-forwarding; you fast-forward to the interval of interest,
   collect stats for the desired interval length, then use
   statistics gathered in conjunction with the weights to
   calculate your results.
</para>

</sect1>

<sect1 id="bbv-manual.usage" xreflabel="BBV Usage">
<title>BBV Command Line Options</title>

<para>
   BBV has various options that control the behavior of the plugin:
<!-- start of xi:include in the manpage -->
<variablelist id="bbv.opts.list">

  <varlistentry id="opt.interval-size" xreflabel="--interval-size">
      <term>
        <option><![CDATA[--interval-size=<number> [default: 100000000] ]]></option>
      </term>
      <listitem>
      <para>
         This option selects the size of the interval to use.
         The default is 100
         million instructions, which is a commonly used value.
         Other sizes can be used; smaller intervals can help programs
         with finer-grained phases.  However smaller interval size
         can lead to accuracy issues due to warm-up effects
         (When fast-forwarding the various architectural features
         will be un-initialized, and it will take some number
         of instructions before they "warm up" to the state a
         full simulation would be at without the fast-forwarding.
         Large interval sizes tend to mitigate this.)
      </para>
      </listitem>
  </varlistentry>

  <varlistentry id="opt.instr-count-only" xreflabel="--instr-count-only">
     <term>
        <option><![CDATA[--instr-count-only [default: no] ]]></option>
     </term>
     <listitem>
        <para>
           This option tells the tool to only display instruction
           count totals, and to not generate the
           actual BBV file.  This is useful for debugging, and for
           gathering instruction count info without generating
           the large BBV files.
        </para>
     </listitem>
   </varlistentry>

  <varlistentry id="opt.bb-out-file" xreflabel="--bb-out-file">
     <term>
        <option><![CDATA[--bb-out-file=<name> [default: bb.out.%p] ]]></option>
     </term>
     <listitem>
        <para>
           This option selects the name of the basic block file.  Default is
           bb.out.%p.   The
           <option>%p</option> and <option>%q</option> format specifiers can be
           used to embed the process ID and/or the contents of an environment
           variable in the name, as is the case for the core option
           <option>--log-file</option>.
        </para>
     </listitem>
  </varlistentry>

  <varlistentry id="opt.pc-out-file" xreflabel="--pc-out-file">
     <term>
        <option><![CDATA[--pc-out-file=<name> [default: pc.out.%p] ]]></option>
     </term>
     <listitem>
        <para>
           This option selects the name of the PC file.
           This file holds program counter addresses
           and function name info for the various basic blocks.
           This can be used in conjunction
           with the bbv file to fast-forward via function names
           instead of just instruction counts.
	   The default filename is pc.out.%p.
           <option>%p</option> and <option>%q</option> format specifiers can be
           used to embed the process ID and/or the contents of an environment
           variable in the name, as is the case for the core option
           <option>--log-file</option>.

        </para>
     </listitem>
   </varlistentry>
</variablelist>
<!-- end of xi:include in the manpage -->

</para>

</sect1>

<sect1 id="bbv-manual.fileformat" xreflabel="BBV File Format">
<title>Basic Block Vector File Format</title>

<para>
  The Basic Block Vector is dumped at fixed intervals.  This
  is commonly done every 100 million instructions; the
  <option>--interval-size</option> option can be
  used to change this.
</para>

<para>
  The output file looks like this:
</para>

<programlisting><![CDATA[
T:45:1024 :189:99343
T:11:78573 :15:1353  :56:1
T:18:45 :12:135353 :56:78 314:4324263]]></programlisting>

<para>
  Each new interval starts with a T.   This is followed by a colon,
  then by a unique number identifying the basic block.  This is followed
  by another colon, then followed by the frequency (which is scaled
  by the number of instructions in the basic block).
</para>

<para>
  The entry count is multiplied by the number of instructions that are
  in the basic block, in order to weigh the count so that instructions in
  small Basic Blocks aren't counted as more important than instructions
  in large Basic Blocks.
</para>

</sect1>

<sect1 id="bbv-manual.implementation" xreflabel="Implementation">
<title>Implementation</title>

<para>
   Valgrind provides all of the information necessary to create
   BBV files.  In the current implementation, all instructions
   are instrumented.  This is slower (by approximately a factor
   of two) than a method that instruments at the basic-block level,
   but there are some complications (especially with rep prefix
   detection) that make that method more difficult.
</para>

<para>
   Valgrind actually provides instrumentation at a super-block level.
   A super-block has one entry point but unlike basic-blocks can
   have multiple exit points.  Once a branch occurs into the middle
   of a block, it is split into a new basic-block.  Because
   Valgrind cannot produce "true" basic blocks, the generated
   BBV vectors will be different than those generated by other tools.
   In practice this does not seem to affect the accuracy of the
   SimPoint results.  We do internally force the
   <option>--vex-guest-chase-thresh=0</option>
   option to Valgrind which forces a more basic-block like
   behavior.
</para>

<para>
   When a super block is run for the first time, it is instrumented
   with our BBV routine.  This adds a call to our instruction
   counting function for each original instruction.
   The current superblock is looked up in an Ordered Set to find
   a structure that holds block-specific statistics (the entry point
   address is the index into the hash table).  We increment the
   instruction count for this superblock and
   also update the master instruction count.
   If the master count overflows the interval size
   then we print out the basic block statistics for the current interval
   to disk, and then reset all the superblock counters to zero.
</para>

<para>
   On the x86 and amd64 architectures the code takes special
   care with rep-prefixed string instructions.  This is because
   actual hardware counts a rep-prefixed instruction
   as one instruction, while a naive Valgrind implementation
   would count it as many (possibly hundreds, thousands or even millions)
   of instructions.  We have special code to handle
   this properly, which makes the results match hardware performance
   counter results.
</para>

<para>
   The exp-bbv tool also counts the fldcw instruction.  This
   instruction is used on x86 machines when converting numbers
   from floating point to integer (among other uses).
   On Pentium 4 systems the retired instruction performance
   counter counts this instruction as two
   instructions (all other known processors only count it as one).
   This can affect results when using SimPoint on Pentium 4 systems,
   so we provide the count for use in mitigating this at analysis time.
</para>

</sect1>

<sect1 id="bbv-manual.threadsupport" xreflabel="BBV Threaded Support">
<title>Threaded Executable Support</title>

<para>
   BBV supports threaded programs.  When a program has multiple threads,
   an additional BBV file is created for each thread (each additional
   file is the specified filename with the thread number
   appended at the end).
</para>

<para>
   There is no official method of using SimPoint with
   threaded workloads.  The most common method is to run
   SimPoint on each thread's results independently, and use
   some method of deterministic execution to try to match the
   original workload.  This should be possible with current
   exp-bbv.
</para>

</sect1>

<sect1 id="bbv-manual.validation" xreflabel="BBV Validation">
<title>Validation</title>

<para>
   This plugin has been tested on x86, amd64, and ppc32 platforms.
   An earlier version of the plugin was tested in detail using
   hardware performance counters, this work is described in a paper
   from the HiPEAC'08 conference, "Using Dynamic Binary Instrumentation
   to Generate Multi-Platform SimPoints: Methodology and Accuracy" by
   V.M. Weaver and S.A. McKee.
</para>

</sect1>

<sect1 id="bbv-manual.performance" xreflabel="BBV Performance">
<title>Performance</title>

<para>
  Using this program slows down execution by roughly a factor of 40
  over native execution.  This varies depending on the machine
  used and the benchmark being run.
  On the SPEC CPU 2000 benchmarks running on a 3.4GHz Pentium D
  processor, the slowdown ranges from 24x (mcf) to 340x (vortex.2).
</para>

</sect1>

</chapter>