### Transcription

Thinking in Parallel:Some Basic Data-Parallel Algorithms and TechniquesUzi Vishkin October 12, 2010Copyright 1992-2009, Uzi Vishkin. These class notes reflect the theorertical part in the ParallelAlgorithms course at UMD. The parallel programming part and its computer architecture context withinthe PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home pagewww.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: pleasewrite to me using my last name at umd.edu 0

Contents1 Preface - A Case for Studying Parallel Algorithmics31.1Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2More background and detail . . . . . . . . . . . . . . . . . . . . . . . . .52 Introduction2.18The PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.1.1Example of a PRAM algorithm . . . . . . . . . . . . . . . . . . .92.2Work-Depth presentation of algorithms . . . . . . . . . . . . . . . . . . .122.3Goals for Designers of Parallel Algorithms . . . . . . . . . . . . . . . . .162.4Some final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .172.4.1Default assumption regarding shared assumption access resolution172.4.2NC: A Related, Yet Different, Efficiency Goal . . . . . . . . . . .172.4.3On selection of material for these notes . . . . . . . . . . . . . . .182.4.4Level of material . . . . . . . . . . . . . . . . . . . . . . . . . . .183 Technique: Balanced Binary Trees; Problem: Prefix-Sums193.1Application - the Compaction Problem . . . . . . . . . . . . . . . . . . .213.2Recursive Presentation of the Prefix-Sums Algorithm . . . . . . . . . . .224 The Simple Merging-Sorting Cluster244.1Technique: Partitioning; Problem: Merging . . . . . . . . . . . . . . . . .244.2Technique: Divide and Conquer; Problem: Sorting-by-merging . . . . . .285 “Putting Things Together” - a First Example. Techniques: InformalWork-Depth (IWD), and Accelerating Cascades; Problem: Selection 305.1Accelerating Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . .305.2The Selection Algorithm (continued) . . . . . . . . . . . . . . . . . . . .325.3A top-down description methodology for parallel algorithms . . . . . . .345.4The Selection Algorithm (wrap-up) . . . . . . . . . . . . . . . . . . . . .356 Integer Sorting6.136An orthogonal-trees algorithm . . . . . . . . . . . . . . . . . . . . . . . .139

7 2-3 trees; Technique: Pipelining7.17.27.341Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .437.1.1Parallel search . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43Insert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .437.2.1Serial processing . . . . . . . . . . . . . . . . . . . . . . . . . . .437.2.2Parallel processing . . . . . . . . . . . . . . . . . . . . . . . . . .46Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .487.3.1Serial processing . . . . . . . . . . . . . . . . . . . . . . . . . . .487.3.2Parallel processing . . . . . . . . . . . . . . . . . . . . . . . . . .508 Maximum Finding528.1A doubly-logarithmic Paradigm . . . . . . . . . . . . . . . . . . . . . . .538.2Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .569 The List Ranking Cluster; Techniques: Euler tours; pointer jumping;randomized and deterministic symmetry breaking589.1The Euler Tour Technique for Trees . . . . . . . . . . . . . . . . . . . . .589.1.1More tree problems . . . . . . . . . . . . . . . . . . . . . . . . . .629.2A first list ranking algorithm: Technique: Pointer Jumping . . . . . . . .649.3A Framework for Work-optimal List Ranking Algorithms . . . . . . . . .679.4Randomized Symmetry Breaking . . . . . . . . . . . . . . . . . . . . . .699.5Deterministic Symmetry Breaking . . . . . . . . . . . . . . . . . . . . . .719.6An Optimal-Work 2-Ruling set Algorithm . . . . . . . . . . . . . . . . .7510 Tree Contraction7610.1 Evaluating an arithmetic expression . . . . . . . . . . . . . . . . . . . . .11 Graph connectivity8011.1 A first connectivity algorithm . . . . . . . . . . . . . . . . . . . . . . . .11.1.1 Detailed description7984. . . . . . . . . . . . . . . . . . . . . . . . .8711.2 A second connectivity algorithm . . . . . . . . . . . . . . . . . . . . . . .8911.3 Minimum spanning forest. . . . . . . . . . . . . . . . . . . . . . . . . .9311.3.1 A first MSF algorithm . . . . . . . . . . . . . . . . . . . . . . . .9411.3.2 A second MSF algorithm . . . . . . . . . . . . . . . . . . . . . . .952

12 Bibliographic Notes9513 Acknowledgement9714 Index1021. Preface - A Case for Studying Parallel Algorithmics1.1. SummaryWe start with two kinds of justification, and proceed to a suggestion: Basic Need. Technological difficulties coupled with fundamental physical limitations will continue to lead computer designers into introducing an increasing amountof parallelism to essentially all the machines that they are building. Given a computational task, one important performance goal is faster completion time. In addition to this “single task completion time” objective, at least one other performanceobjective is also very important. Namely, increasing the number of computationaltasks that can be completed within a given time window. The latter “task throughput” objective is not addressed in the current notes. There are several ways in whichmachine parallelism can help in improving single task completion time. It wouldbe ideal if an existing program could be translated, using compiler methods, intoeffectively utilizing machine parallelism. Following decades of research, and somesignificant yet overall limited accomplishments, it is quite clear that, in general,such compiler methods are insufficient. Given a standard serial program, writtenin a serial performance language such as C, a fundamental problem for which compiler methods have been short handed is the extraction of parallelism. Namely,deriving from a program many operations that could be executed concurrently. Aneffective way for getting around this problem is to have the programmer conceptualize the parallelism in the algorithm at hand and express the concurrency thealgorithm permits in a computer program that allows such expression. Methodology - the system of methods and principles is new. Parallelism is a concernthat is missing from “traditional” algorithmic design. Unfortunately, it turns outthat most efficient serial data structures and quite a few serial algorithms providerather inefficient parallel algorithms. The design of parallel algorithms and datastructures, or even the design of existing algorithms and data structures for parallelism, require new paradigms and techniques. These notes attempt to providea short guided tour of some of the new concepts at a level and scope which makeit possible for inclusion as early as in an undergraduate curriculum in computerscience and engineering.3

Suggestion - where to teach this? We suggest to incorporate the design for parallelism of algorithms and data structures in the computer science and engineeringbasic curriculum. Turing award winner N. Wirth entitled one of his books: algorithms data structures programs. Instead of the current practice where computerscience and engineering students are taught to be in charge of incorporating datastructures in order to serialize an algorithms, they will be in charge of expressingits parallelism. Even this relatively modest goal of expressing parallelism whichis inherent in an existing (“serial”) algorithm requires non-trivial understanding.The current notes seek to provide such understanding. Since algorithmic designfor parallelism involves “first principles” that cannot be derived from other areas,we further suggest to include this topic in the standard curriculum for a bachelordegree in computer science and engineering, perhaps as a component in one of thecourses on algorithms and data-structures.To sharpen the above statements on the basic need, we consider two notions: machineparallelism and algorithm parallelism.Machine parallelism - Each possible state of a computer system, sometimes called itsinstantaneous description, can be presented by listing the contents of all its data cells,where data cells include memory cells and registers. For instance, pipelining with, says, single cycle stages, may be described by associating a data cell with each stage; all scells may change in a single cycle of the machine. More generally, a transition functionmay describe all possible changes of data cells that can occur in a single cycle; the set ofdata cells that change in a cycle define the machine parallelism of the cycle; a machine isliterally serial if the size of this set never exceeds one. Machine parallelism comes in suchforms as: (1) processor parallelism (a machine with several processors); (2) pipelining; or(3) in connection with the Very-Long Instruction Word (VLIW) technology, to mentionjust a few.We claim that literally serial machines hardly exist and that considerable increase inmachine parallelism is to be expected.Parallel algorithms - We will focus our attention on the design and analysis of efficientparallel algorithms within the Work-Depth (WD) model of parallel computation. Themain methodological goal of these notes is to cope with the ill-defined goal of educatingthe reader to “think in parallel”. For this purpose, we outline an informal modelof computation, called Informal Work-Depth (IWD). The presentation reaches thisimportant model of computation at a relatively late stage, when the reader is ready for it.There is no inconsistency between the centrality of the IWD and the focus on the WD,as explained next. WD allows to present algorithmic methods and paradigms includingtheir complexity analysis and the their in a rigorous manner, while IWD will be used foroutlining ideas and high level descriptions.The following two interrelated contexts may explain why the IWD model may bemore robust than any particular WD model.4

(i) Direct hardware implementation of some routines. It may be possible to implementsome routines, such as performing the sum or prefix-sum of n variables, within the sameperformance bounds as simply “reading these variables”. A reasonable rule-of-thumb forselecting a programmer’s model for parallel computation might be to start with somemodel that includes primitives which are considered essential, and then augment it withuseful primitives, as long as the cost of implementing them effectively does not increasethe cost of implementing the original model.(ii) The ongoing evolution of programming languages. Development of facilities for expressing parallelism is an important driving force there; popular programming languages(such as C and Fortran) are being augmented with constructs for this purpose. Constructs offered by a language may affect the programmer’s view on his/her model ofcomputation, and are likely to fit better the more loosely defined IWD. See reference toFetch-and-Add and Prefix-Sum constructs later.1.2. More background and detailA legacy of traditional computer science has been to seek appropriate levels of abstraction. But, why have abstractions worked? To what extent does the introduction ofabstractions between the user and the machine reduce the available computational capability? Following the ingenious insights of Alan Turing, in 1936, where he showed theexistence of a universal computing machine that can simulate any computing machine,we emphasize high-level computer languages. Such languages are much more convenientto human beings than are machine languages whose instructions consists of sequencesof zeros and ones that machine can execute. Programs written in the high-level languages can be translated into machine languages to yield the desired results withoutsacrificing expression power. Usually, the overheads involved are minimal and could beoffset only by very sophisticated machine language programs, and even then only afteran overwhelming investment in human time. In a nutshell, this manuscript is all aboutseeking and developing proper levels of abstractions for designing parallel algorithms andreasoning about their performance and correctness.We suggest that based on the state-of-the-art, the Work-Depth model has to be astandard programmer’s model for any successful general-purpose parallel machine. Inother words, our assertion implies that a general-purpose parallel machine cannot besuccessful unless it can be effectively programmed using the Work-Depth programmer’smodel. This does not mean that there will not be others styles of programming, or modelsof parallel computation, which some, or all, of these computer systems will support. Theauthor predicted in several position papers since the early 1980’s that the strongest nonparallel machine will continue in the future to outperform, as a general-purpose machine,any parallel machine that does not support the Work-Depth model. Indeed, currentlythere is no other parallel programming models which is a serious contender primarilysince no other model enables solving nearly as many problems as the Work-Depth model.5

However, a skeptical reader may wonder, why should Work-Depth be a preferred programmer’s model?We base our answer to this question on experience. For nearly thirty years, numerousresearchers have asked this very question, and quite a few alternative models of parallelcomputation have been suggested. Thousands of research papers were published withalgorithms for these models. This exciting research experience can be summarized asfollows: Unique knowledge-base. The knowledge-base on Work-Depth (or PRAM) algorithms exceeds in order of magnitude any knowledge-base of parallel algorithmswithin any other model. Paradigms and techniques that have been developed ledto efficient and fast parallel algorithms for numerous problems. This applies to adiversity of areas, including data-structures, computational geometry, graph problems, pattern matching, arithmetic computations and comparison problems. Thisprovides an overwhelming circumstantial evidence for the unique importance ofWork-Depth algorithms. Simplicity. A majority of the users of a future general-purpose parallel computerwould like, and/or need, the convenience of a simple programmer’s model, sincethey will not have the time to master advanced, complex computer science skills.Designing algorithms and developing computer programs is an intellectually demanding and time consuming job. Overall, the time for doing those representsthe most expensive component in using computers for applications. This truismapplies to parallel algorithms, parallel programming and parallel computers, aswell. The relative simplicity of the Work-Depth model is one of the main reasonsfor its broad appeal. The Work-Depth (or PRAM) model of computation stripsaway levels of algorithmic complexity concerning synchronization, reliability, datalocality, machine connectivity, and communication contention and thereby allowsthe algorithm designer to focus on the fundamental computational difficulties ofthe problem at hand. Indeed, the result has been a substantial number of efficientalgorithms designed in this model, as well as of design paradigms and utilities fordesigning such algorithms. Reusability. All generations of an evolutionary development of parallel machinesmust support a single robust programmer’s model. If such a model cannot bepromised, the whole development risks immediate failure, because of the following. Imagine a computer industry decision maker that considers whether to investseveral human-years in writing code for some computer application using a certainparallel programming language (or stick to his/her existing serial code). By thetime the code development will have been finished, the language is likely to become,or about to become, obsolete. The only reasonable business decision under this circumstances is simply not to do it. Machines that do not support a robust parallelprogramming language are likely to remain an academic exercise, since from the6

industry perspective, the test for successful parallel computers is their continuedusability. At the present time, “Work-Depth-related” programming language is theonly serious candidate for playing the role of such a robust programming language. To get started. Some sophisticated programmers of parallel machines are willing totune an algorithm that they design to the specifics of a certain parallel machine.The following methodology has become common practice among such programmers: start with a Work-Depth algorithm for the problem under consideration andadvance from there. Performance prediction. This point of performance prediction needs clarification,since the use the Work-Depth model for performance prediction of a buildable architecture is being developed concurrently with the current version of this publication.To make sure that the current paragraph remains current, we refer the interestedreader to the home page for the PRAM-On-Chip project at the University of Maryland. A pointer is provided in the section. Formal emulation. Early work has shown Work-Depth algorithms to be formallyemulatable on high interconnect machines, and formal machine designs that supporta large number of virtual processes can, in fact, give a speedup that approachesthe number of processors for some sufficiently large problems. Some new machinedesigns are aimed at realizing idealizations that support pipelined, virtual unit timeaccess of the Work-Depth model.Note that in the context of serial computation, which has of course been a tremendoussuccess story (the whole computer era), all the above points can be attributed to theserial random-access-machine (RAM) model of serial computation, which is arguably, a“standard programmer’s model” for a general-purpose serial machine. We finish withcommenting on what appears to be a common misconception: Misconception: The Work-Depth model, or the closely related PRAM model, aremachine models. These model are only meant to be convenient programmer’s models; in other words, design your algorithms for the Work-Depth, or the PRAM,model; use the algorithm to develop a computer program in an appropriate language; the machine software will later take your code and translate it to result inan effective use of a machine.Other approaches The approach advocated here for taking advantage of machine parallelism is certainly not the only one that has been proposed. Below two more approachesare noted: (i) Let compilers do it. A widely studied approach for taking advantage of suchparallelism is through automatic parallelization, where a compiler attempts to find parallelism, typically in programs written in a conventional language, such as C. As appealing7

as it may seem, this approach has not worked well in practice even for simpler languagessuch as Fortran. (ii) Parallel programming not through parallel algorithms. This handson mode of operation has been used primarily for the programming of massively parallelprocessors. A parallel program is often derived from a serial one through a multi-stageeffort by the programmer. This multi-stage effort tends to be rather involved since ittargets a “coarse-grained” parallel system that requires decomposition of the executioninto relatively large “chunk”. See, for example, Culler and Singh’s book on parallel computer architectures [CS99]. Many attribute the programming difficulty of such machinesto this methodology. In contrast, the approach presented in this text is much more similar to the serial approach as taught to computer science and engineering students. Asmany readers of this text would recognize, courses on algorithms and data structures arestandard in practically any undergraduate computer science and engineering curriculumand are considered a critical component in the education of programming. Envisioninga curriculum that addresses parallel computing, this manuscript could provide its basicalgorithms component. However, it should be noted that the approach presented in thecurrent text does not necessarily provide a good alternative for parallel systems whichare too coarse-grained.2. IntroductionWe start with describing a model of computation which is called the parallel randomaccess machine (PRAM). Besides its historical role, as the model for which many parallelalgorithms were originally written, it is easy to understand its assumption. We thenproceed to describe the Work-Depth (WD) model, which is essentially equivalent tothe PRAM model. The WD model, which is more convenient for describing parallelalgorithms, is the principal model for presenting parallel algorithms in these notes.2.1. The PRAM ModelWe review the basics of the PRAM model. A PRAM employs p synchronous processors,all having unit time access to a shared memory. Each processor has also a local memory.See Figure 1.At each time unit a processor can write into the shared memory (i.e., copy one of itslocal memory registers into a shared memory cell), read into shared memory (i.e., copya shared memory cell into one of its local memory registers ), or do some computationwith respect to its local memory. We will avoid entering this level of detail in describingPRAM algorithms. That is, an instruction of the form:processor i: c : a bwhere a, b and c are shared memory locations, will be a short form for instructing processor i to: first, copy location a into its local memory, second, copy location b into its localmemory, third, add them, and, fourth, write the result into location c. This paragraph is8

P1P2PnShared memoryFigure 1: Processors and shared memorya first example for selecting a level of abstraction, which as noted before, is an importanttheme in this manuscript.There are a variety of rules for resolving access conflicts to the same shared memorylocation. The most common are exclusive-read exclusive-write (EREW), concurrent-readexclusive-write (CREW), and concurrent-read concurrent-write (CRCW), giving rise toseveral PRAM models. An EREW PRAM does not allow simultaneous access by morethan one processor to the same memory location for read or write purposes, while aCREW PRAM allows concurrent access for reads but not for writes, and a CRCWPRAM allows concurrent access for both reads and writes. We shall assume that in aconcurrent-write model, an arbitrary processor among the processors attempting to writeinto a common memory location, succeeds. This is called the Arbitrary CRCW rule.There are two alternative CRCW rules: (i) By the Priority CRCW rule, the smallestnumbered, among the processors attempting to write into a common memory location,actually succeeds. (ii) The Common CRCW rule allows concurrent writes only when allthe processors attempting to write into a common memory location are trying to writethe same value.For concreteness, we proceed to an example of a PRAM algorithm. However, beforedoing this we present the pardo “programming construct”, which is heavily used in thesenotes to express operations that are performed in parallel:for Pi , 1 i n pardoA(i) : B(i)This means that the following n operations are performed concurrently: processor P1assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on.2.1.1. Example of a PRAM algorithm The summation problemInput: An array A A(1) . . . A(n) of n numbers.The problem is to compute A(1) . . . A(n).9

The summation algorithm below works in rounds. In each round, add, in parallel,pairs of elements as follows: add each odd-numbered element and its successive evennumbered element.For example, assume that n 8; then the outcome of the first round isA(1) A(2), A(3) A(4), A(5) A(6), A(7) A(8)the outcome of the second round isA(1) A(2) A(3) A(4), A(5) A(6) A(7) A(8)and the outcome of the third round isA(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8)which is the sum that we seek. A detailed PRAM description of this “pairwise summation” algorithm follows.For simplicity, assume that: (i) we are given a two dimensional array B (whose entriesare B(h, i), 0 h log n and 1 1 i n/2h ) for storing all intermediate steps of thecomputation, and (ii) n 2l for some integer l.ALGORITHM 1 (Summation)1. for Pi , 1 i n pardo2.B(0, i) : A(i)3.for h : 1 to log n do4.if i n/2h5.then B(h, i) : B(h 1, 2i 1) B(h 1, 2i)6.else stay idle7.for i 1: output B(log n, 1); for i 1: stay idleSee Figure 2.Algorithm 1 uses p n processors. Line 2 takes one round, line 3 defines a looptaking log n rounds, and line 7 takes one round. Since each round takes constant time,Algorithm 1 runs in O(log n) time.So, an algorithm in the PRAM modelis presented in terms of a sequence of parallel time units (or “rounds”, or “pulses”); weallow p instructions to be performed at each time unit, one per processor; this meansthat a time unit consists of a sequence of exactly p instructions to be performedconcurrently. See Figure 31Unless otherwise stated the base of logarithms is always assumed to be 210

B(3,1)B(2,1)B(1,1)B(0,1) A(1)B(2,2)B(1,2)B(0,2) A(2)B(1,3)B(0,3) A(3)B(0,4) A(4)B(1,4)B(0,5) A(5) B(0,6) A(6) B(0,7) A(7)B(0,8) A(8)Figure 2: Summation on an n 8 processor PRAMTimet.321pNumber of operationsFigure 3: Standard PRAM mode: in each of the t steps of an algorithm, exactly poperations, arranged in a sequence, are performedWe refer to such a presentation, as the standard PRAM mode.The standard PRAM mode has a few drawbacks: (i) It does not reveal how thealgorithm will run on PRAMs with different number of processors; specifically, it doesnot tell to what extent more processors will speed the computation, or fewer processorswill slow it. (ii) Fully specifying the allocation of instructions to processors requires alevel of detail which might be unnecessary (since a compiler can extract it automatically- see the WD-presentation sufficiency theorem below).11

2.2. Work-Depth presentation of algorithmsAn alternative model which is actually an alternative presentation mode, called WorkDepth, is outlined next. Work-Depth algorithms are also presented in terms of a sequenceof parallel time units (or “rounds”, or “pulses”); however, each time unit consists of asequence of instructions to be performed concurrently; the sequence of instructions mayinclude any number. See Figure 4.Timet.4321Number of operationsFigure 4: WD mode: in each of the t steps of an algorithm as many operations as neededby the algorithm, arranged in a sequence, are performedComment on rigor. Strictly speaking, WD actually defines a slightly different modelof computation. Consider an instruction of the formfor i , 1 i α pardoA(i) : B(C(i))where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α. Models such as Common CRCW WD, ArbitraryCRCW WD, or Priority CRCW WD, are defined as their PRAM respective counterpartswith α processors. We explain below why these WD models are essentially equivalentto their PRAM counterpart and therefore treat the WD only as a separate presentationmode and suppress the fact that these are actually (slightly) different models. The onlyadditional assumption, which we make for proving these equivalences, is as follows. Incase the same variable is accessed for both reads and write in the same time unit, all thereads precede all the writes.The summation example (cont’d). We give a WD presentation for a summationalgorithm. It is interesting to note that Algorithm 2, below, highlights the following12

“greedy-parallelism” attribute in the summation algorithm: At each point in time thesummation algorithm seeks to break the problem into as many pairwise additions aspossible, or, in other words, into the largest possible number of independent tasks thatcan performed concurrently. We make the same assumptions as for Algorithm 1, above.1.2.3.4.5.6.ALGORITHM 2 (WD-Summation)for i , 1 i n pardoB(0, i) : A(i)for h : 1 to log nfor i , 1 i n/2h pardoB(h, i) : B(h 1, 2i 1) B(h 1, 2i)for i 1 pardo output B(log n, 1)The first round of the algorithm (lines 1 and 2) has n operations. The second round(lines 4 and 5 for h 1) has n/2 operations. The third round (lines 4 and 5 for h 2)has n/4 operations. In general, the kth round of the algorithm, 1 k log n 1, hasn/2k 1 operations and round log n 2 (line 6) has one more operation (note that the useof a pardo instruction in line 6 is somewhat artificial). The total number of operationsis 2n and the time is log n 2. We will use this information in the corollary below.The next theorem demonstrates that the WD presentation mode does not suffer fromthe same drawbacks as the standard PRAM mode, and that every algorithm in the WDmode can be automatically translated into a PRAM algorithm.The WD-presentation sufficiency Theorem. Consider an algorithm in the WD modethat takes a total of x x(n) elementary operations and d d(n) time. The algorithmcan be implemented by any p p(n)-processor within O(x/p d) time, using the sameconcurrent-write convention as in the WD presentation.Note that the theorem actually represents five separate theorems for five respectiveconcurrent-read and concurrent-write models: EREW, CREW, Common CRCW, Arbitrary CRCW and Priority CRCW.Proof of the Theorem. After explaining the notion of a round-robin simulation, weadvance to proving the theorem. Given a sequence of y instructions inst1 , . . . , insty , around-robin simulation of these instructions by p processors, P1 . . . Pp , means the following ⌈y/p⌉ rounds. See also Figure 5. In round 1, the first group of p instructionsinst1 . . . instp are performed in parallel, as follows. For each j, 1 j p, processor Pjperforms instruction instj , respectively. In round 2, the second group of p instructionsinstp 1 . . . inst2p is performed in para

Parallel algorithms - We will focus our attention on the design and analysis of eﬃcient parallel algorithms within the Work-Depth (WD) model of parallel computation. The main methodological goal of these notes is to cope with the ill-deﬁned goal of educating the reader to "think in parallel". For this purpose, we outline an informal model