Sunday, June 2, 2013

Multi-threading support (infrastructure)

Recently, I have been looking into multi-thread enabling some of the time-consuming elements of the end-turn processing.

My initial focus was on path generation, which turned out to be an interesting exercise (I was able to increase performance through general optimizations unrelated to threading, but illuminated by recasting the algorithm to support it) by 50% or so.  However, ultimately multi-threading at the level of individual paths (and probably individual unit groups) was not very successful, and in the end I didn't push these changes to the main codebase, because the gains were too small to justify both the complexity, and the potential for harming performance through loading more cores, and thereby reducing turbo-mode availability to the primary thread.

Subsequent to that I started looking at the processing of cities in turn-end processing, which proved a much more fruitful area.

The reason city processing is a better candidate then (individual) path generation is basically down to granularity.  If the work you want to split up (across threads) only takes a small amount of time anyway, the latency effects of thread context switches (background threads won't get going on a problem until they first get a time-slice) quickly become more significant, and in the case of path generation it was simply the case that the total time to generate a path is comparable with time-slice latency, which makes it much harder to achieve real gains through attempting to split the processing up.  City end turn processing, on the other hand, takes an appreciable fraction of a second, and processing all of a player's cities in the turn end typically many seconds.

Anyway, the approach I've taken to multi-threading aspects of the city turn processing will be the subject of my next blog entry, since it involves a re-usable generic pattern, that can be applied to many areas.  On the way to that I wanted to stop off and talk about profiling a bit, and multi-threaded profiling in particular.

Profiling Basics

Some time ago (one of the first things I did on joining the C2C team a couple of years ago), I rewrote the profiler that came with the Firaxis code, so I could get more useful output from it to do performance analysis.  A brief digression on what the profiler does, and some basics as to how it works are in order, before we get to the impact of multi-threading on it.  I'll start off pretty basic here for anyone that doesn't have much background in the area:

What is profiling?

Profiling is a means of measuring, ideally in some detail, how long different parts of a program take to execute.

Why do we want to profile?

It's not generally possible to optimize effectively without knowing what is taking all the time in the first place.  For example, it's easy to spot a piece of inefficient code that can be improved, but if it's only executed a small number of times and takes a small percentage of overall execution time, it doesn't matter how much you optimize it - you'll never have much effect on the overall time taken. As such we need to know where the time is spent in order to know what to focus our optimization efforts on.

So what specific things does a profiler measure?

This will vary from profiler to profiler, but for the purposes of our discussion we want to measure the time taken by individual methods and functions in the code.  We might also want to know things like:
  • How many times is a function executed?
  • What is its average execution time?
  • What is the total execution time for all calls to the method/function?
  • Where is it being called from?

How do we tell the profiler which functions to measure?

Again this will vary depending on the profiler and the environment.  In some environments, a profiler can identify functions and allow the user to select which to profile dynamically, without impacting the code being profiled at all.  The C++ environment of the Civ4 DLL is not one of these!  Instead we rely on the programmer placing profiling macros in the code, usually at function starts.  In the version we release to the end users (the 'Final Release' build in C2C nomenclature) these macros will expand to nothing at all (so they have no impact or function in that build), but in a profilable build (the 'Release' build in C2C) they expand into function calls to some profiling code, which I'll be talking more about later in this blog entry.

How do we view the results of profiling?

Profilers may have their own UI, or may output to files.  C2C's profiler outputs tab-separated records for each profile point to a special profile log file, which can then be loaded into a spreadsheet for easy manipulation (sorting by time for example).   The following is an example.  This is the profiler output from the latest build, while it executes the infamous 'Talin 6 minute turn' which I've been using as a test case for the past month or so (it doesn't take 6 minutes any longer as you can see):



This output has been sorted by 'Time', so it shows the things taking longest first.  'root' is a pseudo-entry that means 'the entirety of everything that got profiled', which in this case is effectively a synonym for 'CvGame::update()' which is the entry point for the entire end-turn calculation.  'PipelineThread' is also a pseudo-entry, which I'll have more to say about later (the next blog entry in fact), but everything after that corresponds to a function, or more granular profiled section, in the code.  As you can see, some major constituents are the kinds of things we'd expect:
  • An end turn consists of ending the turn for every AI player, so a lot of it is in CvPlayer::doTurn, which is the end turn processing for an individual player
  • A large chunk is deciding what to do with units (AI_unitUpdate)
  • Pathing is still a major cost (CvPathenerator::generatePath) - as you can see this turn required us to calculate over 65000 unit paths
Just because a function has a large 'time' value doesn't necessarily mean its the problem in itself - functions near the root of the call tree will inevitably have large times, but note the columns 'Child time' and 'Self time' - for any given profile entry these are the amount of the time accrued to that function spent in (respectively) child functions that it calls, and itself.  As such, it is often entries with large 'self time' values that are the actual issues.
Also note the '#calls' column - this is the number of calls to the function made within the profiled turn.  Some functions may not take long for an individual call, but if they are used a lot it can add up (check out CvUnit::getCommander, for example - it is called 47 million times).  Such functions may well be worthwhile targets of optimization efforts, either to make them as streamlined as possible or (often more effectively) to look for higher level algorithmic changes that can reduce the number of times we need to call them.

How the profiler works

When adding profiling, it is very important to ensure that the overhead you impose by the act of profiling does not come to dominate the overall execution times, and distort the very figures it is generating as a result.  As such, the profiling code itself must be very light-weight - we can't have it doing things like allocate memory or other lengthy operations when processing a function invocation!

The profiler (prior to any attempt to add multi-threading) worked by having the profiling macros (typically placed at the start of functions) declare a static structure and a locally-scoped object.  The locally scoped object is responsible for the measurement of that particular function invocation (its constructor records the start time, its destructor the end time), and the static structure holds the aggregate totals for that function.  Thus the following code:

int myFunction()
{
    PROFILE_FUNC();

    // do some stuff
    ...
}

expands to something like this, through the definition of the macro PROFILE_FUNC:

int myFunction()
{
    static ProfileSample __myFunctionSample;
    CProfileScope __myFunctionScope(&__myFunctionSample);

    // do some stuff
    ...
}

Internally the constructor for CProfileScope will squirrel away the pointer to the static ProfileSample structure, and record the start time of the function call.  When the function exits, it will go out of scope and its destructor is called.  The destructor will calculate the time and add it to an accumulated value in the static structure.  It will also add a reference to the static structure to a list of profile entries that have been hit in this run.  At the end of profiling the list of entries hit will be iterated over, and their values dumped out to form the log which gives rise to spreadsheets like the earlier example.

In practice there is slightly more to it than this, but the gist is as simple as the above.  Specific other aspects are:
  • You need to do the 'add a reference' bit efficiently, so the table index of the reference added is cached in the static structure, so that subsequent calls don't have to do it again.
  • In order to track call paths we want to record the parent from which an entry is (first) called, so the profiler maintains a stack of pointers to the static structures declared by each macro invocation, storing the id of the current one in a child when it is called.
  • Recursion needs some special handling, because you don't want the inner invocations of a function to add to its time, or the outer invocation will be double-counting.  To handle this, the static structure also contains a current 'EntryCount' which amounts to a recursion depth - the destructor will only accumulate the time if this is the outer exit (EntryCount going to 0)

Enabling multi-threading support

Enabling multi-threading on the above structure raises some issues:
  • You cannot use simple EntryCount and ParentId fields in the static structure, because 2 or more independent threads may be in it at once, and need to store independent values
  • You cannot use a global stack - each thread has its own independent stack to keep track of
  • You need to be careful to lock access to parts of the data structure that multiple threads can be manipulating at once - for example - when accumulating a time into a sample's total you cannot just say something like:
    Total += elapsed;
    This is because the addition (in this case of 64-bit quantities in a 32-bit program) is not atomic, so two threads doing it at once can mangle the result

Addressing some of these points is easier than others!

The global stack and parent linkage are easiest, and have an obvious solution, which is replacing the stack with a linked list, and putting the linkage information in the locally scoped object rather than the static sample structure (we still keep the id of the first parent we see there, to give the 'parent' dumped in the final output, which was always only the first call path a function is seen on).  In this scheme we use thread-local-storage (variables declared as thread-local have different values on different threads, but they need to be used carefully as you cannot have very many of them in total) to hold a 'currentSample', which is a pointer to the sample structure of the profile scope we're currently in, on any given thread.  Using this we can replace the global stack with a series of linked lists (one per thread) each rooted in the thread-local 'currentSample' entry.

The EntryCount is a bit trickier.  In principle we can continue to use a single value on the sample structure, which now becomes the TOTAL entry count (between all threads).  Of course if we do that then incrementing, decrementing and testing it all need to do so in a thread-safe manner (which I'll come to in a moment).  In this scheme we need to explicitly check for recursion by walking up the linked list to determine if the same sample occurs higher up - this would be expensive except that it can be omitted in the special (but very common) case where the total entry count is going to 0 on exit anyway (in which case there cannot possibly be a recursive entry, since that would mean the count would still have to be at least 1).

Finally we need to lock access to the structures across update of them (and testing of them in some cases, like the recursion checking's access to EntryCount).

This is where things get interesting.

My initial (naïve) attempt, was pretty much what I've described above, with critical sections protecting the key manipulations in the profile entry and exit functions.  This turned out to be problematic however, as I'll explain shortly.  First a brief digression into locking and thread safety...

Locking and Thread Safety

If you have 2 or more threads concurrently accessing a single data structure, and at least one of them can be making changes to it, then you need to take precautions to ensure that a thread switch at a 'bad' time (like when one thread is half way through modifying the structure) cannot cause another thread to operate on a corrupted (aka partially updated) set of data, getting an incorrect result or (often) just plain crashing.  Note that this can occur even on single-core systems, since when the OS chooses to pre-emptively switch threads is outside of the control of the application. To achieve thread safety you need to either use locks of one sort or another, or develop a lockless algorithm that relies on guaranteed atomicity of certain operations.

Broadly speaking there are 3 levels of primitives provided you can choose to employ:

  1. OS-level synchronization primitives - semaphores, mutexs, events, etc.  I'm not going to go into details of all the various options available (it's extensive), but canonically consider a mutex.  This is an entity that can be 'owned' by a single thread, and has an API that allows threads to 'request' and 'release' it.  A thread requesting an un-owned mutex will take ownership, and continue to process.  When it is done it releases it.  If a thread requests a mutex that is already owned, the OS will block that thread's execution until the owner releases it.  Hence you can use mutexs to guarantee that only  single thread is in some critical piece of code at any one time.
  2. Process-level synchronization primitives - critical sections essentially.  A critical section is essentially a mutex that only operates within a single process.
  3. Interlocked functions, that provide (processor level) guarantees of atomicity.  Intel x86 CPUs have a capability for some instructions to be 'locked', such that their operation is guaranteed to be atomic across the system, even in multi-CPU systems.  These are exposed via Windows (or compiler intrinsics, but not in the ancient complier we're stuck with) as the Interlocked... set of APIs.  Canonically:
    • InterlockedIncrement - adds 1 to a (32 bit) variable atomically, returning the value it ended up as
    • InterlockedDecrement - subtracts 1 atomically, returning the resulting value
    • InterlockedCompareExchange - compare a variable against a provided value, and if they are equal set it to another value - all occurring atomically
Each has pros and cons:
  • The OS-level primitives are easy to use and come in many useful forms.  However, a call to one always requires a context switch into kernel space, which is expensive (think order of 1000 CPU cycles expensive!)
  • Critical sections are relatively fast in at least the uncontended case (that is, if the section is not owned, it is fairly fast to take ownership and later to release it), but they only operate within a single process (which is fine for us).  In the contended case they still cause a thread switch and hence drop into kernel space, so contention gets expensive (there is an API that allows you to set a spin wait count on them to mitigate this, but it's a detail (albeit one we want to take advantage of when we use critical sections, in many cases) rather than a fundamental shift in the behavior, at the level we're concerned with)
  • Interlocked function are much harder to code complex algorithms with, and are generally suitable for very small-scale usage (like using InterlockedIncrement rather than just the '++' operator with a critical section).  However, if something really needs the tightest possible optimization, it is sometimes possible to construct lockless algorithms around the use of the Interlocked functions.  Even Interlocked functions have some performance overhead, which I'll come back to.

What I discovered, with my initial naïve attempt, was that the use of critical sections in profiler function entry processing was highly distorting (the very act of profiling added tens of percent to many functions, and hundreds to some very frequently called ones).  This was even in the single threaded case - where contention occurred due to multiple threads executing the same code, it got much, much worse!  In short, having to take a critical section inside of the processing of profiling a function entry was not acceptable.

I then spent some time, and was able to recode using only Interlocked functions, so it became non-blocking, which totally removes the contention problem.  However, to my surprise, even this produced massive distortions!  At that point I started digging into the exact cost of an Interlocked function.  Although I'd realized it had some overhead (CPU caches have to synchronize at what are known as 'memory fences', and out-of-order execution pipelines in the CPU have to be flushed and restarted), I had not comprehended just how expensive this can be.  Consider:

result = ++iValue;

vs

result = InterlockedIncrement(&iValue);

Semantically they are the same (well they are in a single-threaded environment at least), but the first will execute (subject to compiler optimizations on where it has put result and iValue) in as little as a single CPU cycle (250 pica seconds, say), whereas the second will typically cost hundreds of cycles, due to pipeline stalls and so on!!

So, we need a totally lockless algorithm, which is a bit harder to achieve!

What I ended up doing is this:
  1. Only support up to some (small) finite maximum number of concurrent threads (to be profiled at least) - I chose 8
  2. For the key shared fields in the sample structure (EntryCount, and the accumulated time totals) replace the single value with an array of MAX_THREADS+1 values
  3. On first call from a thread, take a critical section lock (only on thread entry and exit will we need to do this), and find an unused slot in a table of size MAX_THREADS (i.e. - give the thread a unique index in the arrays declared in (2) to use).  Record in this table which slots are occupied, and save the thread's slot id in a thread-local variable
  4. On exit from the root call for a thread (thread exit from profiling essentially) take the lock again, and for each sample in the entire profile, add the accumulator values from the thread's slot into an aggregate maintained in a reserved slot (MAX_THREADS+1) - thus as threads exit (from profiling) their profile data is added into the aggregate total.  Mark the thread's slot id as unused again in the slots table as the thread exits profiling.  This step is very expensive, but only occurs at thread exit, so its essentially irrelevant on the overall scale of the entire profile.
  5. At profile exit dump the values from the reserved slot (which has aggregated all individual thread's contributions) to the output log.
The net result of all this is no need to ever lock anything except at thread profile start and end.

Profiler changes DLL modders may need to be aware of

Macros

The macros used are unchanged, with the exception of the addition of:

PROFILE_THREAD(name)

usage example:

void MyThread(void)
{
    PROFILE_THREAD("MyThreadRoot");

    // Do stuff
    ...
}

This macro should be placed (typically) in the root thread function of any thread you wish to enable profiling on (apart from the main thread, which is always profiled).  It serves the twin purpose of turning on profiling for that thread (by default a thread will not be), and declaring a root profiling entry.

Global Defines

I have also added some new global defines as follows (in 'A_New_Dawn_GlobalDefines.xml'):
  •  (Boolean) 'ENABLE_BACKGROUND_PROFILING' (default value '1').  If this is set to 0 all profiling of threads other than the main thread will be disabled
  • (String) 'PROFILER_ALTERNATE_SAMPLE_SET_SOURCE'.  If set to the name of a profile section (e.g. - the name of a profiled method) will accrue into the 'Alternate sample set' (see below) the time spent in the named profile section within call stacks rooted in the entry reporting.  For example, if we see a function showing up a lot on the overall profile and want to know more about the call paths it gets executed in, we can set it as the alternate sample, and this will tell us under which ancestor methods the (expensive) calls are taking place under.  Pathing is a good example - suppose I know that a lot of time is spent pathing, but I want to look into optimizing the use of the pathing (reduce number of calls) rather than the pathing engine itself.  To do that I need to know what callers are accounting for the most time (in this case I'll find that AttackCityMove is responsible for a lot of it, as it happens)

Changes to the output table in the profile log

The log example above, is from the new profiler.  Compared to the old version there are a few extra columns:
  • Main Thread Time - this column shows how much of the total time (reported in 'Time') was on the main thread.  This is important to know, because (unless we're trying to optimize heat output at least), it is only main thread elapsed time that ultimately matters (the rest is happening in parallel)
  • Alternate Time - this is time accrued by the alternate sample as discussed above.  In the example spreadsheet shown earlier, the 'alternate sample set' was set to 'CvTeam::setHasTech'.  The results are unremarkable, but note that although most time accrues from CvPlayer::doResearch, there are also some calls from other paths

Other Interesting Things I learnt

One other interesting discovery occurred while I was testing this (and the general framework for multi-threading which I'll describe in my next entry, but here is as good a place as any to mention it).

It turns out that something (I think the boost threading library, but that's just a hypothesis, which I have not confirmed) catches structured exceptions on background threads (at least those launched via boost::thread).  The net effect of this is that (without further coding at least), if a crash condition (typically bad memory reference or divide by zero) occurs on a background thread, no minidump is created, the thread silently terminates, and execution continues (or at least it attempts to!)!!

I have rectified this for the threads we currently have (that are boost-launched) with some explicit code to call the minidump creator, and then exit the process.

No comments:

Post a Comment