pic-report/report.tex

\documentclass[twocolumn]{article}

% Packages
\usepackage{lipsum}
\usepackage{biblatex}
\usepackage{hyperref}

% TODO: DUE DATE: 12/07

% `biblatex` resources
\addbibresource{references.bib}

% Meta
\title{TODO}
\author{Filipe Rodrigues}
\date{June 9, 2023}

% Document
\begin{document}

% Title
\maketitle

% Abstract
\begin{abstract}
\end{abstract}

% Introduction
\section{Introduction}

Over recent years, we've witnessed the introduction of
programs that require extremely large runtime datasets.
Despite advancements in main memory capacities, some programs
truly require larger-than-main-memory datasets.

Traditionally, the memory system used on most machines is one
consisting of main-memory, situated on a DRAM chip, alongside
swap memory, typically situated on non-volatile storage shared
with other user filesystem partitions.

Once main memory is close to full, the kernel moves pages from
it to swap. Once a program attempts to access the page, a page
fault occurs and the kernel transparently moves the page from
swap back to ram and resumes the program.

This handling of memory can incur large costs. This is largely due to
swap residing on persistent memory, typically a hard drive or
solid-state disk. These have much lower throughput and higher
latency than DRAM chips. They might also contain other partitions
than the swap partition, which might imply competition with
filesystem reads/writes.

Another reason for possible high costs of this approach is the
time it takes for the kernel to resolve a page fault when a program
accesses a swapped-out page.

When datasets become bigger than the main memory capacity, a program
will encounter trashing, where pages are constantly bouncing between
main memory and swap.

With the advent of tiered-memory systems, we found a different paradigm
with a much more scalable approach.

In tiered-memory systems, we typically find two, or more, memories,
each with different tradeoffs regarding size and throughput / latency.
Each of these may be addressed by the CPU individually.

By assigning each page to a different memory, based on it's usage patterns,
it is possible to avoid paying the cost of page faults, since the program
can just access the memory directly. Despite this memory potentially being
slower than main memory, it is still much faster than swap and doesn't require
copying a whole page to main memory on access.

These systems have an additional problem to solve, however. They need to
decide where each page should reside. This is typically solved by running a
classifier in the background that receives periodic samples of the access
patterns for each page. This classifier can then chose to migrate certain pages
to other memories.

These migrations are similar to the swap migrations, but they can be much faster,
since they don't involve copying to/from a hard-drive or solid-state disk.
It is also possible to, in some configurations, perform these page migrations in
the background via a DMA command to another processing unit in the system.

% TODO: Explain what we're doing

% TODO: Problem is somewhat unresolved

% TODO: Mention more tired memory systems

% TODO: Memory systems aren't very well developed

% TODO: Want to contribute to this problem, by studying current systems

% TODO: Explain what we solved (Some HeMem results not yet studied + framework)

% TODO: Explain swap isn't byte-addressable

% TODO: Explain how heterogenous memory models address multiple RAMs (~NUMA)

% Background
\section{Background}

% Explain how some memories systems exist
%       Hybrid:
% TODO: Example: DRAM + Optane
% TODO: Example: DRAM + CCLX Mention

% TODO: Explain how OSes interact with memory systems

One such implementation of a tiered-memory system is HeMem \cite{10.1145/3477132.3483550}.
It is designed to work with two memories: A main memory based on DRAM, as well
as secondary memory, based on Intel Optane.

This gives it a fast, but small DRAM for frequently accessed pages, and then a
slightly-slower, but larger Optane for less accessed pages.

It's classifier works by tagging pages as cold/hot, depending on their access patterns.

A cold page is one that isn't accessed very often.
This usually implies it can remain in / be moved to Optane memory.

On the other hand, hot pages are ones that are accessed very often.
This usually implies it should remain in / be moved to DRAM.

Pages are allocated on the main memory, as long as space is available. Else they are
allocated on the secondary memory.

When a cold page that resides in secondary memory becomes hot, it is moved to main memory.
This may require migrating colder pages in main memory to secondary memory before-hand.
The coldest memory is chosen for this migration.

When a hot page that resides in main memory becomes cold, it is moved to secondary memory.

% TODO: Mention approach 1 (JIT recompiler that checks "hot" memory sections)

% TODO: Mention approach 2 (Runtime, like HeMem)

% Objectives
\section{Objectives}

The HeMem classifier will be the main object of study for this report.
We'll be studying how it acts under certain circumstances, to determine
whether some anti-patterns occur, where it may be choosing actions that
ultimately slow down the system as a whole.

HeMem as a whole is a very complex system. However, we're only interested
in studying it's classifier. In order to achieve this, we create a simulator
to run just the classifier on.

In order to study how a program's accesses are evaluated by HeMem, we record
all memory accesses of that program to a trace file.
This is done with \texttt{valgrind}, using a custom tool.

Given that valgrind can record all accesses, and not just those that go to memory,
this tool contains an ideal cache simulator, to ensure only accesses that
would reach memory are registered.

A second tool, called \texttt{ftmemsim}, then accepts these traces and
runs the HeMem classifier on it, outputting a data file with results on
several statistics collected.

A third tool, called \texttt{ftmemsim-graphs} can then accept the data file and
produce graphs to intuitively visualize the data.

% Results
\section{Results}

% TODO: Ensure replicability

% TODO: Mention workloads & explain them (Single-treaded versus Multi-threaded).

% TODO: Section should be Q&A.

% TODO: Add "conclusion" to each question / graph.

% Conclusion
\section{Conclusion}

% Bibliography
% TODO: Investigate if we can fix the hbadness of
%       `printbibliography`?
\hbadness 10000
\printbibliography

\end{document}