Technische Universität MünchenInstitut für InformatikProf. Dr. Thomas HuckleDipl.-Math. Benjamin UekermannWiSe 2016/17Parallel NumericsExercise Sheet 1: Flynn’s Taxonomy & MPI Basics1 Flynn’s 1)Load B(1), Reg(0)CPUCPU43Reg(0)Reg(1)Add Reg(0),Reg(1)CPU73Given is a simple computer with a main memory that is capable to hold two arrays A and Bwith n entries each. The CPU has two registers (Reg(0) and Reg(1)) and is able to performone operation (load, store, add, mult) per clock tick.The following program computes A(2) A(3) 2 · B(5):123456loadloadmultloadaddstoreB (5) , Reg (0)2 , Reg (1)A (3) , Reg (1)Reg (0) , A (2)Please note, that this exercise has a more theoretical character!a) The memory holds the vectors A, B IRn , n 6. Write a program that computes2A B and stores the result in A.

1234567for i 0.5loadloadmultloadaddstoreA(i), Reg(0)2, Reg(1)B(i), Reg(1)Reg(0), A(i)b) To what type of computer does the architecture belong according to Flynn’s taxonomy?SISDc) The CPU is added a second pair of registers (Reg(2) and Reg(3)). Furthermore, thereis an operation add2 computing “Reg(0) added Reg(1)” and “Reg(2) added Reg(3)”in one clock cycle. The result values are stored in Reg(0) and Reg(2). Rewrite theapplication using add2 and compare the number of clock cycles required. Is there acomputer available nowadays that supports such operations? To what type of computerdoes such an architecture belong according to Flynn’s taxonomy?Program sequence for first two iterations:12loadloadA(0), Reg(0)A(1), Reg(2)34loadload2, Reg(1)2, Reg(3)56multmultReg(0), Reg(1)Reg(2), Reg(3)78loadloadB(0), Reg(1)B(1), Reg(3)910addaddReg(0), Reg(1)Reg(2), Reg(3)1112storestoreReg(0), A(0)Reg(2), A(1)// replace by add2This way, using add2, only 33 cycles are required instead of 36 cycles. Hence, 21 cyclesper iteration for addition are saved. Further improvement could be an operation mult2.Real computers nowadays e.g. SSE, graphic cards, vector components. The Flynn typeis SIMD.d) The original computer is added a second CPU with two registers. Rewrite the program.To what type of computer does such an architecture belong according to Flynn’s taxonomy?Real parallel architecture/program. Flynn type is MIMD. Split up the program into 2loops (operation sequences):12for i 0.2see exercise a)

12for i 3.5see exercise a)e) Suppose, the CPU (the original one with two registers) is able to handle large instructions, i.e., two operations per cycle as long as they do not use the same resources/operation. This means the combinationload B (5) , Reg (0) and load A (2) , Reg (1)1is not supported since both operations would use the main memory. On the other handmult and load B (4) , Reg (1)1is allowed. Rewrite your program and count the number of clock cycles required. Whatstandard computer architectures support such operations?Again program sequence for first two iterations:12345loadA(0), Reg(0)load2, Reg(1)mult / load B(0), Reg(1)add / load2, Reg(1)// already step 2 of second iteration (loop merge)storeReg(0), A(0)5 cycles first iteration, 4 cycles following iterations. Overall: 5 · 4 5 25 cycles.Modern architectures implement pipeline concept. Splitting up into substeps or executing substeps simultaneously. E.g. superscalar architectures, Very Large InstructionWords.2 Single Program Multiple DataDefine the term Single Program Multiple Data (SPMD). To what type of computer architectureaccording to Flynn do such applications fit? Compare SPMD to SIMD.All processors use the same program but different data. Flynn type is MIMD: Although allprocessors run on the same instruction stream, the operations are not synchronized (differenceto SIMD).3 Parallel Programming ParadigmsIn exercise 1, the simple computer was made a shared memory machine. Define the terms“shared memory” and “distributed shared memory”.What type of architectures was implemented in the HLRB II (Höchstleistungsrechner BayernII) (not in use any more) and is currently used in the SuperMUC at LRZ?Shared memory: Memory that may be simultaneously accessed by multiple programs/processors.Distributed memory: Refers to a multiple-processor computer system in which each processorhas its own private memory Computational tasks on local data.

Hybrid: CPU’s on node share memory. Nodes only know about own memory but may communicate with other nodes.The HLRB-II architecture was a Numalink4 (enhanced ccNUMA) distributed-shared memoryarchitecture.The SuperMUC is a classical distributed memory system.4 Setting up a MPI EnvironmentLast summer semester you attended the lecture on “Parallel Programming”. Part of the coursedealt with MPI (Message Passing Interface) used to parallelise programs on supercomputersor workstation clusters. This term, we will use MPI for programming a couple of numericalcodes. Therefore, a recapitulation of the material about MPI for those who are familiarwith the interface is subject of this exercise. For all the others an introduction will be giventhroughout the tutorials.Make yourself familiar with working with MPI and the terms NFS, public key authorizationand the command mpdboot.a) Write a machine file for your personal experiments.A machine file specifies all the machines you want to use. Entries are seperated by linebreaks.b) To avoid to be asked for your password every time MPI starts up, execute the followingcommands in your home-directory: Create a public dsa key, but do not give it a passphrase123456 ssh - keygen -t dsaGenerating public / private rsa key pair .Enter file in which to save the key (/ home / login /. ssh /id dsa ) :Enter passphrase ( empty for no passphrase ) :Enter same passphrase again :Your identification has been saved in / home / login /. ssh /id dsa . Add the new key stored in .ssh/id dsa to the public key file:12 cd /. ssh cat id dsa . pub authorized keys Log in once on every computer you want to use.c) Make yourself familiar with the commands mpdboot, mpirun and mpdallexit.Example: mpdboot -f mfile -n 25 --ncpus 1 -1 mpdboot starts up MPI environment. -f passes machinefile. -n number of parallel SPMD instances (number of entries within machinefile shouldbe greater).

--ncpus number of processes to be started on local machine. -1 allows more than one processor/node. Other duplicates in machinefile are neglected.Starting parallel application with: mpirun -n * ./executable.mpdallexit closes the MPI environment.5 MPI Application StructureBelow is a very simple MPI application. Try to run this example and identify the semanticsand the syntax conventions of the five different MPI commands.12int main ( int argc , char * argv []) {int myrank , nproz ;3MPI Init ( & argc , & argv ) ;MPI Comm size ( MPI COMM WORLD , & nproz ) ;MPI Comm rank ( MPI COMM WORLD , & myrank ) ;4567MPI Barrier ( MPI COMM WORLD ) ;MPI Finalize () ;8910return 0;1112}MPI Init, MPI Comm size, and MPI Comm rank build the header. MPI Finalize is the trailer.All commands start with MPI . MPI always uses pointers to pass variables. MPI enumeratesall nodes starting with 0 (rank of a node). Size gives you the number of nodes available.6 A First MPI ApplicationWrite a simple application that defines a constant π on the first node in a cluster and, afterwards, sends this constant to all other nodes one by one. Use only the MPI operationsMPI Send and MPI Recv.See source code to corresponding tutorial on webpage.

b)To what type of computer does the architecture belong according to Flynn's taxonomy? SISD c)The CPU is added a second pair of registers (Reg(2) and Reg(3)). Furthermore, there is an operation add2 computing "Reg(0) added Reg(1)" and "Reg(2) added Reg(3)" in one clock cycle. The result values are stored in Reg(0) and Reg(2). Rewrite the