Parallel Computing onSoC ArchitectureDaniele Cesini – INFN-CNAFAndrea Ferraro – INFN-CNAFLucia Morganti – INFN-CNAF

Outline Some2history ModernSystems on Chip Computing Europeanon System on Chip, the INFN experienceProjects on low power computing ConclusionDaniele Cesini – INFN-CNAF10/12/2014

Once upon a time.3The vector machines Serialnumber 001 Cray-1 Los Alamos National Laboratory in 1976 8.8 million 80 MFLOPS scalar, 160/250 MFLOPS vector 1 Mword (64 bit) main memory 8 vector registers 64 elements 64bit each Freon refrigerated 5.5 tons including the Freon refrigeration 115 kW of power 330 kW with refrigeration (*) Source: WikipediaSerial number 003 was installed at the National Center for AtmosphericResearch (NCAR) in 1977 and decommissioned in 1989Daniele Cesini – INFN-CNAF10/12/2014

Not properly wireless.Daniele Cesini – INFN-CNAF410/12/2014

CRAY-XMP.Vector MultiProcessor 19825CRAY-XMP 2 processors 9.5 ns clock cycle (105 MHz) 2x200MFLOPS 2Mwords (64 bit) 16MB 1984CRAY-XMP fourprocessors 800 MFLOPS 8Mwords 64 MB main memoryabout US 15 million CRAY-XMP48 @ CERN in 1984(*) 50plus the cost of disks!!!Daniele Cesini – INFN-CNAF10/12/2014

The Cray-2 6The Cray-2 released in 1985 4 processors 250MHz (4.1 ns) 256 Mword (64bit) Main Memory 2 GByte 1.9 GFLOPS 150 - 200 kW Fluorinet cooling 16 sq ft floor space 5500 pounds About 17 million(*) Cray/Cray.Cray2.1985.102646185.pdfDaniele Cesini – INFN-CNAF10/12/2014

The attack of the Killer Micros7Taken from the title of Eugene Brooks' talk "Attack of the Killer Micros"at Supercomputing 1990 Caltech Cosmic Cube By Charles Seitz and Geoffrey Fox in1981 64 Intel 8086/8087 processors 128 kB per processor 6 dimensions hypercubeDaniele Cesini – 419/1/Cubism.pdf10/12/2014

The Killer Tomatoes8Attack of the Killer Tomatoes is ed and co-written by JohnDeBelloHad three sequels!!!Daniele Cesini – INFN-CNAF10/12/2014

Massively Parallel Processor (MPP) 9A single computer with many networked processors Specialized interconnect networks Low latency interconnection Up to thousands of processors Some examples Connection Machines (CM-1/2/200/5) Intel Paragon ASCI series IBM SP IBM BlueGeneDaniele Cesini – INFN-CNAF10/12/2014

Thinking Machines 1985: Thinking Machines introduces theconnection Machine CM-1 Connection Machine CM-200 maximum configuration of 65536 1-bitCPUs(!) floating-point unit for every 32 1-bit CPUs A cube composed of 8 cubes 10each cube contains up to 8096 processors (The curved structure is a Data Vault - a diskarray) 40 GFLOPS peak1991: CM-5 Featured in “Jurassic Park”(*) Sources: Cesini – l Machines Corporation10/12/2014

Intel Paragon MPP Launched11in 1993 Upto 2048 (later 4000) Inteli860 RISC microprocessors Connected in a 2D grid Processors @ 50 MHz Worldmost powerfulsupercomputer in 1994 Paragon XP/S140 3680 processors 184 GFLOPS peak(*) Source: WikipediaDaniele Cesini – INFN-CNAF10/12/2014

ASCI Red MPP 1996 At Sandia Laboratories Based on the Paragonarchitecture Fastest supercomputer from1997 to 2000 1.4 TFLOPS (peak) in 1997 9152 cores3,2 TFLOPS (peak ) in 1999 129632 cores1st supercomputer above(*) Source: Wikipedia1 TFLOPSDaniele Cesini – INFN-CNAF10/12/2014

IBM BlueGene/Q MPP Trading the speed of processors forlower power consumptionSystem-on-a-chip design. All nodecomponents were embedded on onechipA large number of nodes5D xTorus interconnectCompute chip is an 18 core chip The 64-bit PowerPC A2 4-way simultaneously multithreadedper core 1.6 GHz a 17th core for operating systemfunctions chip manufactured on IBM's copperSOI process at 45 nm. 204.8 GFLOPS and 55 watts perprocessorUp to 20 PFLOPS (peak) 16384 coresDaniele Cesini – INFN-CNAF13(*) Source: Wikipedia10/12/2014

Clusters14[a cluster is a] parallel computer system comprising an integratedcollection of independent nodes, each of which is a system in itsown right, capable of independent operation and derived fromproducts developed and marketed for other stand-alone purposesDongarra et al. : “High-performance computing: clusters, constellations, MPPs,and future directions”, Computing in Science & Engineering (Volume:7 , Issue: 2 ) From “stack of Sparc Pizza Boxes” of the 80s to modern supercomputerDaniele Cesini – INFN-CNAF(*) Picture from: cluster10/12/2014 architectures share16(*) Source: 201306 Poster.pdfDaniele Cesini – INFN-CNAF10/12/2014 chip technology share17(*) Source: 201306 Poster.pdfDaniele Cesini – INFN-CNAF10/12/2014

Vector vs Micro computing power18NEC SX-ACENEC SX-5HITACHI S820/60CRAY-1INTEL i7Pentium ProINTEL8086 MOS 6510Why did microprocessors take over? They have never been more powerful .but they were cheaper, highly available and lesspower demandingDaniele Cesini – INFN-CNAFCool peoplewould say“greener”10/12/2014

Commodity hardware19 Microprocessorsstarted to be mass produced andused in everyday life Personal computer at home Office automation GamingDaniele Cesini – INFN-CNAF10/12/2014

What’s commodity nowadays?20Low-Power System on Chip (SoCs)Daniele Cesini – INFN-CNAF10/12/2014

Where do I find a SoC?21 Mobile EmbeddedDaniele Cesini – INFN-CNAF10/12/2014

ARM based processor shipment 22ARM based processors are shipped in billions of units ARM licences the Intellectual Properties to manufactures many manufactures . Samsung (Korea), MediaTek (China), Allwinner (China), Qualcomm (USA),NVIDIA (USA), RockChip (China), Freescale (USA), Texas Instruments (USA),HiSilicon(China), Xilinx (USA), Broadcom(USA), Apple(USA), Altera(USA),ST(EU) , WanderMedia(Taiwan), Marvel(USA), AMD(USA)etc.Daniele Cesini – INFN-CNAF10/12/2014

Vector vs Micro computing power23NEC SX-ACENEC SX-5HITACHI S820/60CRAY-1INTEL i7Pentium ProINTEL8086Daniele Cesini – INFN-CNAFMOS 651010/12/2014

Vector vs Micro vs ARM based IsDaniele Cesini – INFN-CNAF24history repeating?10/12/2014

Ok, but iPhone cluster?25 NO, weare not thinking to buildan iPhone cluster Wewant to use these processorsin a standard computing centerconfiguration Rack mounted Linux powered Running scientific application mostly ina batch environment . Usedevelopment board.Daniele Cesini – INFN-CNAF10/12/2014

ODROID-XU3 Powered by ARM big.LITTLE technology, with aHeterogeneous Multi-Processing (HMP) solution 4 core ARM A15 4 cores ARM A7Exynos 5422 by Samsung 26 20 GFLOPS peak (32bit) single precisionMali- T628 MP6 GPU 110 GFLOPS peak single precision 2 GB RAM 2xUSB3.0, 2xUSB2.0, 1x107100 eth Ubuntu 14.4 HDMI 1.4 portPower consumption max 15 W 64 GB flash storageDaniele Cesini – INFN-CNAFCosts 150 euro!10/12/2014

Other nice ieBoard PlatformsDaniele Cesini – INFN-CNAFDragonBoardTexas Instruments EVMK2H .and counting.10/12/2014

Some specs28SOCBOARDModelARM IPGPU IPFREESCALE (Embedded FreescaleSoC)i.MX6QSABRE BoardARMA9(4)VivanteGC2100 (19.2GFlops)ARNDALE(Mobile SoC)Octa BoardSamsungExynos 5420ARMA15(4)A7(4)ARMMali-T628 MP6 (110Gflops)HARDKERNEL(Mobile SoC)Odroid-XU-ESamsungExynos 5410ARMA15(4)A7(4)Imagination TechnologiesPowerVRSGX544MP3 (51.1 Gflops)HARDKERNEL(Mobile SoC)Odroid-XU3SamsungExynos 5422ARMA15(4)A7(4)ARM Mali-T628 MP6(110 Gflops)INTRINSIC(Mobile SoC)DragonBoardQualcommQualcommSnapdragon 800 Krait(4)TI(Embedded SoC)EVMK2HTI Keystone66AK2H14QualcommAdreno 330 (130Gflops)DSP IPGFLOPSEth(CPU 6x(189Gflops) 210ARMA15(2)1Gb(10Gb)TDP tra 5W e 15WDaniele Cesini – INFN-CNAF(EVMK2H 15W)10/12/2014

How do you program them?(in a Linux environment) GCCis available for ARM CPUs (for free) OpenCL for the GPUIf you are lucky enough to find working drivers Cross 29compilationIf you dare!Daniele Cesini – INFN-CNAF10/12/2014

NVIDIA JETSON K1 30First ARM CUDA programmableGPU-accelerated Linux development board!Daniele Cesini – INFN-CNAF 4 cores ARM A15 CPU 192 cores NVIDIA GPU 300 GFLOPS (peak sp) .for less than 200 Euros10/12/2014

Modern Accelerators Many simple cores to speed up parallel portions of code 31GPUs for general purpose application Available since mid 90s but now we have “reasonable” programmingmodelsIntel XEON PHI (MIC Many Integrated Cores)High Performance/Watt ratio (if properly used)Daniele Cesini – INFN-CNAF10/12/2014

GPU acceleration in scientificcomputation322 x (E5-2673v2 (IvyBridge) 8 cores) 2 x100 200 GFLOPS (double precision)2 x 110 Watt 220 W 1 GFLOPS/W1xNVIDIA TESLA K402880 cores12 GB RAM 1400 GFLOPS (double precision) 4300 GFLOPS (single precision)235 Watt 6 GFLOPS / W dp 18 GFLOPS/W spDaniele Cesini – INFN-CNAFC PU GPU 3 GFLOPS/W dp 9 GFLOPS/W sp10/12/2014

CPU GFLOPS/WattDaniele Cesini – INFN-CNAF3310/12/2014

GPU acceleration in K1344 core ARM A15 18 GFLOPSKepler SMX1 192 core 300 GFLOPS 15 Watt 21GFLOPS /WN.B. Single precision – 32 bit architecture 1.5 GFLOPS / (0.67 /GFLOPS)Daniele Cesini – INFN-CNAF10/12/2014

OK.good, that’s the theory.but what happens in reality? Real35life is, as usual, harder than the theory InSOCs most of the power is in the GPU Extracting it could be unbelievably difficultDaniele Cesini – INFN-CNAF10/12/2014

Limitations36 Moreovercommodity SoCs and development boardshave a number of limitations:32 bit Small caches Small RAM size No ECC memory Frequent failures and system crashes Slow connections in some cases HW bugs If anything can go wrong, it will. (Murphy)Daniele Cesini – INFN-CNAF10/12/2014

Not always perfectDaniele Cesini – INFN-CNAF3710/12/2014

OpenMP π computation38CPU ONLYDaniele Cesini – INFN-CNAF10/12/2014

Prime numbers computation39CPU ONLYDaniele Cesini – INFN-CNAF10/12/2014

[email protected] results40CPU ONLYHigh Energy Physics MonteCarlo simulationsDavid Abdurachmanov et al 2014 J. Phys.: Conf. Ser. 513 052008 doi:10.1088/1742-6596/513/5/052008ARM slower by a factor 3 or 4 but ARM better by a factor 3 or 5 on the power ratioDaniele Cesini – INFN-CNAF10/12/2014

Molecular Dynamics on Jetson-K1CPU GPU41Parallel application for CPU and GPULower is better Jetson-K1 about 10X slower using the same number of cores Jetson-K1 about 10X slower using the GPU (vs. an NVIDIA Tesla K20) Jetson-K1 13.5WattXeon K20 320WattDaniele Cesini – INFN-CNAF10/12/2014

Lattice Boltzmann on the Tegra K143GPU only(*)(*) Schifano et al. ; A portable OpenCL LatticeBoltzmann code for multi- And many-coreprocessor architectures;Procedia Computer Science Volume 29, 2014,Pages 40-49,doi: 10.1016/j.procs.2014.05.004Daniele Cesini – INFN-CNAFOn Tegra-K1 Porting15 GFLOPS12 GB/sPe 10 Watt Performance40x slower thana K20measierthan expectedunderinvestigation10/12/2014

OK, nice.but.where is the cluster? We 44still need to build itWaiting for a 64bit low cost, low power SoC INFNCOSA project Two years project starting in 2015 50keuro per year Build a 10 TFLOPS cluster with SoC architectures Prototype a scalable interconnect Test real life INFN parallel and scalar applicationsDaniele Cesini – INFN-CNAF10/12/2014

Only ARM based SoCs?And Intel? 45INTEL produce SoCs Probably you have one in your laptop Some of them are low power Already 64bit Integrated GPU CILK programmable OpenCL programmableDaniele Cesini – INFN-CNAF10/12/2014

Some low power from IntelDaniele Cesini – INFN-CNAF4610/12/2014

472 cores GPU Intel HD GraphicsOpenCL 2.0 Support4.5 Watt (TDP)Daniele Cesini – INFN-CNAF(*) r-family-spec-update.html10/12/2014

Test on Intel AVOTON48N.B. - Preliminary resultsN.B. - Old Xeon CPUDaniele Cesini – INFN-CNAF10/12/2014

European leadership? Mobile49CPU and GPU (IP licenses) ARM Imagination Technology(iPhone GPU, MIPS tech) Embedded/Automotive/Avionics Siemens Bosch ST InfineonDaniele Cesini – INFN-CNAF10/12/2014

The MontBlanc Project - 1Daniele Cesini – INFN-CNAF5010/12/2014

The MontBlanc Project - 2Daniele Cesini – INFN-CNAF5110/12/2014

Daniele Cesini – INFN-CNAF5210/12/2014

EU Horizon 2020 CallDaniele Cesini – INFN-CNAF5310/12/2014

The road to ExaScale A machine capable of running a parallel application (notembarassingly parallel) requiring 1018 fp operation per seconds A group of 1000 PetaScale machines is not an Exascale systemGeneral agreed requirements available by 2018/2020 Need less than 20 MwattNov 2014:Tiane233 PFLOPS18 MWNeed a factor 30 in the computing performance keeping the samepower consumption of today machines 54Low power needed!But.will exist in 2020 a parallel application requiring 1018 fp operationper seconds? There will be someone able to code it?“And what comes after exascale? We can look forward to zettascale (1021) andyottascale (1024) .Then we run out of prefixes.” (from: omputers-of-the-future)Daniele Cesini – INFN-CNAF10/12/2014

Conclusion The 55power consumption is becoming a key factorFrom mobile ExaScale supercomputers Mobileand embedded low power System-on-Chipare becoming attractive for scientific computingEurope can play an industrial leadership role European Commission is investing on it But they still have many limitationsDon’t just look at specs and GFLOPS count!!! The future is heterogeneousA single system will have multiple type of processors CPU, GPU, DSP, MIC.Often in the same chipDaniele Cesini – INFN-CNAF10/12/2014

Once upon a time. Serial number 001 Cray-1 Los Alamos National Laboratory in 1976 8.8 million 80 MFLOPS scalar, 160/250 MFLOPS vector 1 Mword (64 bit) main memory 8 vector registers 64 elements 64bit each Freon refrigerated 5.5 tons including the Freon refrigeration 115 kW of power 330 kW with refrigeration