Tutorials:TinyThreadHOWTO

From ALPS
Jump to: navigation, search

The code on which this tutorial is based on is currently under development (and not part of the ALPS cvs)

  • [#Why_do_multithread_computing.3F ]
  • [#Multithread_programming ]
  • [#Locking ]
  • [#The_framework ]
  • [#Synopsis ]
  • [#Template_parameters ]
  • [#Member_functions ]
  • [#Type_definitions ]
  • [#Configuration_macros ]
  • [#The_Class_bridge_t ]
  • [#Worker ]
  • [#Master ]
  • [#Run_the_programm ]
  • [#Finding_the_optimal_number_of_threads ]

Why do multithread computing?

With the advent of powerful dualcore processors multiprocessor shared memory machines have found their way into everyday desktop-based computing. However, making use of these resources and writing code that efficiently benefits from the multiple cores is a challenging tasks. One way to achieve this goal is by writing multithread applications.

This tutorial provides an introduction on how to write a multithreaded code and introduces a simple framework -- called TinyThread -- which greatly facilitates the code development. This tutorial extends the tutorial " ALPS Single Task Application" which we suggest to look at first. This tutorial will help you develop a simple multithread code that counts the number of prime numbers in a given interval.

Multithread programming

If compared to other parallelization schemes, there are some major differences between multithread programming as compared to message passing protocols such as MPI which are primarily intended for distributed memory machines. In contrast, multithread programs are intended for shared memory machines as the individual threads are supposed to access the same physical memory (which is the case for a dual-core architectore for example). The almost instantaneous memory access allows for much faster data exchange between parallel threads. However, subtle synchronization problems may occur which are overcome by a technique called [#Locking locking]. The TinyThread framework implements this technique and offers bridges, e.g. to pass data by callback functions from a main thread to a number of worker threads.

To use the full power of multithread programming it is crucial to push as much work as possible into such woker threads. However, it is not important that the overall workload is split into equal amounts. Specifically, you would normally create more threads than actuall processors, this is discussed in more detail the last section of this tutorial on [#Finding_the_optimal_number_of_threads finding the optimal number of threads]).

Locking

Because multithread applications use shared memory, it is important to make sure that two threads never simultaneously access/modify the same memory. The TinyThread framework is based on the [1] libary which offers 6 locking strategies. In the following only the simplest strategy of these strategies is discussed.

First, to make sure that every lock created by a thread is actually removed at some point, the [2] libary encapsulated the locking process in class objects, which assign a lock in the constructor and free it in the destructor. The simplest lock mechanism ist the boost::mutex. To create a lock you need to create an object of the type boost::mutex::scoped_lock and pass to the constructor an instance of boost::mutex (see the [#Master master section] for an example). Since you create this object in a local scope, it will be destroyed when you jump out of the function and the lock will be removed.

The framework

The TinyThread framework is a ``tiny" framework that is easy to use and helps you to write portable multithread applications. The major goal of this framework is to offer a mechanism to pass callbacks from a master thread to a number of so called worker threads. Since multithread applications cannot use the standard timer, the TinyThread framework measures time in seconds using a threadsave timer.

Synopsis

   tinythread {
           typedef time_t;
           template <class worker_t, bool use_threads = true> class Master {
                   public:
                   Master(std::size_t = TINYTHREAD_DEFAULT_THREAD_COUNT);
                   void set_computing_threads(std::size_t);
                   time_t execute0();
                   template<class callback_t_0> time_t executeN(callback_t_0);
                   ...
                   template<class callback_t_0, ..., class callback_t_(N-1)> time_t executeN(callback_t_0, ..., callback_t_(N-1));
           };
           class Worker {
                   public:
                   template<class bridge_t> void operator()(bridge_t) {}
           };
   }
   

Template parameters

Member functions

Type definitions

Configuration macros

The configuration macros can be changed by definig them before including the TinyThread header. There are two configuration macros in the TynyThread framework:

The Class bridge_t

The bridge_t object is passed to the Worker::operator(bridge_t). It offers an interface to the functions passed to the Master::executeN. The funcitons can be accessed by the public memberfunction bridge._M, where M goes from 0 to N-1.

Worker

First we need a worker class. This class fetches a number from the main thread and checks if it is prime. (This is a very slow algorithm. If you realy want to check if a number is prime use the primality test)


   class PrimeWorker {
           public:
           template<class bridge_t> void operator()(bridge_t bridge) {
                   long number, divisor;
                   while ((number = bridge._0()) > -1) {
                           if (number < 2) bridge._1(number, false);
                           else if (number == 2) bridge._1(number, true);
                           else {
                                   for (divisor = 2; divisor < (long)sqrt(number) + 1; divisor++)
                                   if (number % divisor == 0) {
                                           bridge._1(number, false);
                                           break;
                                   }
                                   if (divisor > (long)sqrt(number)) bridge._1(number, true);
                           }
                   }
           }
   };
   

Let us focus on the bridge object. The bridge object offers an interface to call the functions passed to the execute function. The long bridge._0() and void bridge._1(long, bool) represents the functions passed by the PrimeMaster (see [#Masternext section]).

Master

The master class controles the threads. The protocol works the following way: A thread calls the next_prime function to receive an integer to check if it is prime. When it has finished, it calls the register_prime function, to pass the checked number and the test result. You have to be careful that these two functions are not called by two threads at the same time. This is done with the mutex class from the [3] libary (see the [#Locking locking] section). To integrate the programm into ALPS we use the classes from SingeTask tutorial.


   class PrimeWorker: public tinythread::Master<PrimeThread>, public singletask::SingleWorker {
           public:
           PrimeWorker(const alps::ProcessList$amp; where, const alps::Parameters$amp; p, int node):
           singletask::SingleWorker(where, p, node)
           {}
           void dostep() {
                   min_ = parms["Min"]; max_ = parms["Max"];
                   current_ = min_; cnt_ = 0;
                   tinythread::time_t interval = execute2(
                   static_cast<boost::function<long()> >(boost::bind($amp;PrimeWorker::next_prime, this)),
                   static_cast<boost::function<void(long, bool)> >(boost::bind($amp;PrimeWorker::register_prime, this, _1, _2))
                   );
                   std::cout << "Between " << min_ << " and " << max_ << " there are " << cnt_ << " prime numbers." << std::endl
                   << "Time used: " << interval << "s" << std::endl;
           }
           long next_prime() {
                   boost::mutex::scoped_lock lock(mutex_);
                   return (current_ < max_ + 1)$nbsp;? current_++$nbsp;: -1;
           }
           void register_prime(long number, bool is_prime) {
                   boost::mutex::scoped_lock lock(mutex_);
                   if (is_prime)
                   cnt_++;
           }
           private:
           boost::mutex mutex_;
           long min_, max_, current_, cnt_;
   };
   

Let us focus on the following lines of code:


   tinythread::time_t interval = execute2(
   static_cast<boost::function<long()> >(boost::bind($amp;PrimeMaster::next_prime, this)),
   static_cast<boost::function<void(long, bool)> >(boost::bind($amp;PrimeMaster::register_prime, this, _1, _2))
   );
   

Here we start the execution of the threads. Because we want to pass two callback functions, we have to use the execute2 function of the TinyThread::Master class (see the [#Memberfunctions manual]). The [4] returns a unspecified-x-y type, which is not a calid callback type, so we cannot pass it directly to the execute function. We have to convert the output of Boost::boost to a valid callback function. This can be managed by the [5] libary.

To run the application as an ALPS application we neet to initialize the SingleWorker class:


   PrimeWorker(const alps::ProcessList$amp; where, const alps::Parameters$amp; p, int node):
   singletask::SingleWorker(where, p, node)
   {}
   

The ALPS::scheduler::Worker class provides a property params which represents all the properties you passed to the application in the parameter file. Here the variables Min and Max can be accessed.

Run the programm

To run the program we need a parameter file, to pass the range we want to scan:


   {Min=100;Max=100000}
   

Now we generate an XML-file and run the program:


   parameter2xml params
   primecounter.out params.in.xml
   

Finding the optimal number of threads

The last question we treat is: what is the optimal number of threads. We used the [/wiki/index.php?title=GraphGeneratorHOWTO$amp;action=edit Graph generator] to do some measurements. For each datapoint we run the application 16 times on one node with 4 Opteron 265 1.8 GHz processors. The plot shows the speedup by creating all graphs with 11 bonds (needed 10s with 64 threads), 12 bonds (needed 31s with 64 threads), 13 bonds (needed 2m 2s with 64 threads), 14 bonds (needed 8m 43s with 64 threads).

File:ThreadMeasurements.png

We decided to take 64 threads for our calculations.


$copy; 2007 by Lukas Gamper