Nian-Feng Tzeng
Center for Advanced Computer Studies
University of Louisiana at Lafayette
The performance of a shared-memory multiprocessor is dictated in part by the
latency of fetching global memory locations through the interconnection network.
We investigated techniques to improve access latency
under various traffic patterns in multiprocessors.
The use of locks for critical sections is fundamental in parallel execution
and efficient lock implementation makes it possible to build scalable,
high-performance parallel systems.
Lock mechanisms for shared-memory machines and distributed-memory machines
have been considered, as an efficient lock makes it feasible to
construct DSM systems on top of the distributed-memory machines,
by enforcing data coherence
without high overhead to ensure good performance.
A cost-effective combining structure for large-scale multiprocessors is realized by making use of the idea that the combining function is separated from the routing function, so that a binary tree-based combining configuration becomes feasible. With a considerably lower cost, this considered combining structure is readily suitable for use in applications where hot-spot traffic can be differentiated from regular traffic at run time. An approach that provides good performance regardless of whether hotspot requests are combinable, is also proposed. It is based on multi-queues at network primary inputs and at switch inputs, which share a fixed amount of storage allocated such that each queue is given a slot permanently and also shares available storage with other queues in a switch dynamically. This design improves network bandwidth in the absence of hotspots and indeed alleviates the impact of tree saturation on regular requests.
Distributed-memory systems needs the use of locks for realizing synchronization during parallel program executions; for example, the barrier synchronization. When a distributed shared memory system is constructed on a distributed-memory machine, the data coherence is commonly enforced through mutual exclusion on the data cache lines of processors. An approach to mutual exclusion in the distributed memory system has been considered so that processors waiting for the critical section will get the permission quickly with as few message exchanges as possible, often exhibiting better performance than prior algorithms. It results in deadlock- and starvation-freeness naturally, and presents simple fault recovery without employing any time-out mechanism.
For a parallel machine, system performance tends to be in proportion to the computation power, which in turn is proportional to the system size. We measured the execution behaviors of four applications on a reconfigured incomplete subcube using the Intel iPSC#/860. This is the first experimental study of incomplete hypercubes. Each application is mapped onto the incomplete hypercube, with load balance and minimum communication overhead in mind.
Send e-mail to: tzeng@cacs.louisiana.edu