Trends and the Future of
Computing and the Internet

Symmetric Multi-Processing and Threaded Algorithms

Introduction

Modern computers have multi-core processors in them, primarily because we've reached a plateau in size and speed on a silicon wafer. It had to happen eventually, since the number of atoms needed to form a gate can only be reduced so far before you get to a limit, even if it's '1'. The speed of light and other physical attributes of silicon wafers affect the switching and transition time of a gate, and the time it takes for a signal to go from one place on the silicon to another, Physics just won't let us easily make processors much faster or smaller than they are now, without some sort of exotic or radical change in the way they are made. Yet people want faster computers, and our current technology can't produce superconducting nor quantum processers, at least not yet...

As a compromise, chip manufacturers cram as many CPU cores as they can onto the same piece of silicon, and then provide sufficient logic such that they 'play well together'. As an example, when one CPU modifies memory, it can transparently notify the others, so that they will flush their L1 caches and re-read any changed values from the next level cache (usually L2 cache) as necessary. And of course the operating system would need to be written to take advantage of these built-in features as much as possible.

But even though you now have (effectively) SEVERAL computers under the hood of your machine, most of the applications you run are STILL SINGLE THREADED, such that only a single core of your CPU is made use of at any given time. Though the operating system makes use of other cores from time to time (doing certain background processes, drawing on the screen, processing the keyboard and mouse input, and controlling disk operations) the majority of the application is STILL operating in a single 'thread of execution', on a single CPU core. And the other core(s) remain idle, effectively twiddling their thumbs. You might as well have purchased a 2 year old machine, instead of shelling out more cash to get a multi-core machine, for all the good it does you.

Legacy Libraries and Linear Logic

Most, if not all, of the various language libraries that provide math and utility functions use the same basic technology that has been used by computers since the punch-card era. The execution of a program or function starts in one place, loops around a bit, maybe calls some other functions, then exits back to whatever called it, all in a STRAIGHT LINE. This is one-dimensional coding. One-dimensional coding works well, and has been in use since there have been computers. And everyone has seen the classic 'flow chart' example, and probably learned that way it in school. And for simple algorithms and non-speed-critical applications, this is perfectly acceptable.

(pseudo code, typical function)

    myfunc(parameters)
    {
      input_and_validate

      the_loop
      {
        do_something
      }

      cleanup_and_return
    }

HOWEVER, for a large application that does a LOT of independent calculations, it's really unacceptable, particularly if it forces the end-user to wait for results and leaves all but one CPU core idle.

Further, the types of calculations that are being done might be "too complex" for the average programmer, who still thinks in a straight line for the most part. One dimensional coding is familiar, and so this model continues being used, despite the fact that most computers have at least 2 cores. And the existing language libraries have not been updated to use multi-thread algorithms. In some cases it may simply not be possible nor practical, but in other cases it is definitely possible and definitely practical to make the standard solution use multiple threads of execution instead of just one.

Background processing does (to some extent) take advantage of multi-core processing, and if written properly even screen updates can be done independently of user input and calculations. These trivial examples can give you an APPARENT increase in speed, though, which affects the user perception of your application in a positive way.

Multi-threaded Algorithms (general)

To really take advantage of multi-core processors, the application itself needs to be designed with multiple threads in mind, from the beginning. This kind of design requires the ability to think in more than one dimension, which (sadly) MANY people can not do without being trained to do it. Thinking in more than one dimension is a bit like juggling, or more accurately, results in an algorithm or solution that is a bit like juggling. You have to get the timing and motions right, but if you can, you may end up with something that's both functional and fast.

It would be a waste of time to make EVERYTHING threaded, particularly trivial things (unless you are performing an extremely large number of trivial things). Threaded programs require focused effort on synchronization, 'race conditions' (where two threads attempt to alter the same value without synchronizing first, and the last to alter it 'wins' the race, overwriting the first thread's changes completely), inconsistent data (like the race condition except that just a part of the data is incorrect), and 'deadlock' (where two threads are waiting for each other or a resource that has been exhausted and cannot be replenished due to the deadlock). All of these conditions are well known for those who write multi-threaded applications, but understanding all of this well enough to be second nature takes time and experience. And schools may not teach it at all.

A multi-threaded solution or algorithm might start execution at one point, but then divide up the work and assign it to different 'threads of execution', each running on a different CPU core. The caller would then do its own processing and wait for the other threads to finish theirs. When properly designed, you end up utilizing all of the available CPU cores for most of the processing time needed. This would be two-dimensional coding. In this case, it's likely that the work units are non-overlapping, and you simply assign a portion of the work to each thread until all of the work units have been completed.
A good example might be a Fourier transform, where a single harmonic's calculations are done for each work unit by multiple threads. Each thread would request the next work unit (a harmonic), and then store the results, and keep repeating the process until all work units (harmonics) are completed. Since each thread can execute independently without affecting the results for another, synchronization isn't a problem. It would only be necessary to know when all of the work units had been completed, and then the main function would return the results.

But what if the work units DO overlap, or are dependent upon one another? Now you have to schedule work units such that one work unit can wait on another, and perhaps the thread that is waiting can choose a different work unit to occupy its time while waiting. This would be three-dimensional coding, in which you have temporal requirements (wait until completion of one or more work units) as well as the splitting up of the task into work units into (two of the three dimensions).
A good example of this might be a quick sort algorithm. As review, a quick sort works basically like this:

(pseudo-code, somewhat incomplete)

   myqsort(array, compare, low, high)
   {
     pivot(array, compare, low, high, &pivot_down, &pivot_up);
     myqsort(array, compare, low, pivot_down);
     myqsort(array, compare, pivot_up, high);
   }

The first portion of the algorithm, 'pivot', ensures that everything below the pivot point is between 'low' and 'pivot_down', and that everything above the pivot point is between 'pivot_up' and 'high'. Then you recursively call the function to sort each half, until the 'half' consists of a single entry [this logic is missing from the example, FYI]. The good news here is that each call to 'myqsort' could become a single work unit. Such a work unit would first call pivot, schedule a 2nd work unit for one of the 'myqsort' recursive calls, and then performing the other myqsort call itself. When both 'myqsort' recursive processes are complete, the function returns. To make this work properly, you would have to be able to allocate, schedule, complete, and notify completion using a work unit as an object, and be able to wait for completion notification from any level above where you currently are. The end results are actually somewhat impressive: 4 cores should improve the quick sort algorithm's performance by a factor of 2 to 3, depending upon how good your pivot point selection is and how 'ordered' your data is [worst case for quick sort is similar to reverse order].

In general, for a multi-threaded algorithm to work efficiently, you will need to have an entry point that implements a Process Control of some kind. This will typically start one or more 'worker' threads create (or initialize) a work scheduler, divide up the work load into appropriate segments or work units, and wait for all of the processes to complete before consolidating the results and returning the result back to the caller. This gives the illusion of a linear process to the caller, making it possible to write single-threaded code that invokes a threaded algorithm seamlessly.

Alternately, you can make use of a Notification to the caller so long as the caller is thread-aware. The caller would typically invoke an asynchronous function, passing an object or function capable of notifying the caller when the process is complete. This technique is commonly used for asynchronous I/O processing, as well as various kinds of background processing, such as re-calculating a spreadsheet, printing, or saving snapshots of a document that is being actively edited.

Background processes are normally chosen for things that don't require speed, but still need to be done without impacting user interaction. A user does not expect a document to be immediately printed when he activates the 'print' function, but he does expect to immediately return back to the main menu or be able to continue working with a document from that point. Background printing is almost a necessity these days. But other operations may not easily be done in the background, such as re-calculating all of the data in a database after the user changes some of the data or data parameters. The user will typically be waiting for an answer, and as such a threaded algorithm can help to delivery it more quickly.

Multi-threaded Algorithms - Pitfalls

For trivial applications, converting them to use multiple threads may require a lot of extra effort, and use additional system resources, without any visible benefit. Trivial calculations and applications with no performance restrictions really don't need the extra speed that you can get by using a threaded algorithm. Ultimately it's user perception that should govern whether the advantages of using multiple threads outweigh the additional development time, additional resource use, and additional possibility of failure (due to the use of more complicated programming). If at all possible, the simpler method is usually the best, provided that there is no perceptible difference.

It's also worth mentioning that MOST people will not know what 'multiple threads' means. Users may not even understand that their computer has multiple CPUs in it, and may opt for 8 core over 2 core simply because 8 is bigger than 2. And they are equally disappointed when they discover that their 8 core machine does not perform perceptibly better than someone else's 2 core machine. It all comes back to user perception of speed, when pressing a button with their mouse results in instantaneous action and completion of that action, in lieu of delays, hourglass cursors, and unpainted regions on the screen.

We Need More Multi-threaded Code

Regardless of the difficulty involved, we need more multi-threaded code to be written, adapting existing single-thread code to use multiple threads whenever it is being used for time-consuming operations. This is especially true with respect to multi-media and games. Even if only a single thread is responsible for actually rendering a frame onto the display surface, several threads can be implemented to perform various parts of the rendering, and then schedule the rendered surface to be drawn by the primary 'rendering' thread.

One clear example would be the decoding and rendering of HD video on a PC. I have seen several different applications that are capable of doing this, but only the fastest computers and display hardware seem to be able to properly render a true HD image. This is not because the hardware on other machines is too slow. It is because the application that renders the video is SINGLE THREADED. For each frame, which is typically 1/30 of a second, linear time is used to FIRST decode the image, THEN render the frame into the video hardware. The thread waits for the video hardware (often accelerated) to COMPLETE its rendering, and then begins decoding the NEXT frame. CPU utilization typically does not exceed a small percentage, and yet the video player complains about late and dropped frames.

So how can this 'typical' video player be fixed to perform better? The simplest solution would be to use a set of working threads that decode ahead of time, and schedule display surfaces to be rendered at the correct frame display time. Then a 'rendering' thread does the actual transfer to video hardware for each frame's display surface. So instead of decode 1, render 1, decode 2, render 2, etc. you have decode 1, schedule 1, decode 2, schedule 2 in parallel with de-schedule 1, render 1, de-schedule 2, render 2 (with appropriate frame timing applied for rendering).

When many cores are available, it may make sense to do the MPEG decoding with several 'staggered' threads in parallel. Usually MPEG decoding requires that the contents of the previous frame be decoded correctly before you can decode the next frame, with the exception of I frame data (which is the entire frame). You could use this information such that an I frame is decoded and scheduled independently from the other (P, B) frames. Then P and B frames would be decoded by a different thread, taking the I frame data as its input whenever appropriate, and modifying its 'working copy' otherwise. Other 'more aggressive' threading schemes could still be applied here, but with less impact on overall performance.

The most important factor for quality HD video playback is to allow for some frames to take more time to decode than the inter-frame space would allow, by the use of decoding threads, a 'rendering' scheduler, and sufficient buffering. User perception would focus on how clear the video and audio are being rendered, with proper timing and no defects or 'stuttering' in the playback. Without using threads, a complex video might cause some frames to take too long to render, resulting in dropped frames or 'stuttering' in the audio or video playback. This is obviously unacceptable. Yet the use of threads is the exception, and not the norm, for multimedia. As a result, people often purchase hardware that is unnecessarily expensive for their needs, to avoid problems that could have been overcome by proper software design, and the use of threaded algorithms.

Some examples of multi-threaded algorithms

Threaded quick-sort (as part of the sort demo) application
Calculating the value of pi: pi.c
Discrete Fourier Transform: do_dft2.c
(older, single-thread version HERE)