Influence of Technology Directionson System ArchitectureDr. Randy IsaacVP of Science and TechnologyIBM Research DivisionSeptember 10, 2001

Moore's Law continues beyond conventional scalingPower becomes the limiting metricThe integration focus moves from circuit to processor

1000 Buys.1E 12Computations/sec1E 9MechanicalElectro-mechanicalVacuum tubeDiscrete transistorIntegrated circuit1E 61E 31E 01E-31E-61900192019401960198020002020Yearafter Kurzweil, 1999 & Moravec, 1998

Integrated Circuit Performance Trends1110MemoryLogicxDensity10# Transistors per 10 4104Mb10MHzx1MHz3198019902000

The Original Moore's Law ProposalAfter G. E. Moore"Electronics," 1965

A Decade of AgreementAfter G. E. MooreProc. IEDM, 1975

Complexity's InfluenceAfter G. E. MooreSPIE v. 2440,1995

Increased integrationFunction implemented with:One ChipFew siliconcomponentsFunctionSpeedCostMany siliconcomponents

Partitioning the Improvement RateImproving Integration: Components per chip50% Gain from Lithography25% Gain from Device and Circuit Innovation25% Gain from Increased Chip Size(manufacturability)Improving Performance:Transistor Performance ImprovementInterconnect Density and DelayPackaging and CoolingCircuit-level and System-level Gains

Evolution of Memory Density100,000Megabits/chip10,0001,000doubling time: 3 yr.100doubling time: 1.5 yr.10101980198519901995Year2000200520102015

ITRS Lithography Roadmap5009597020508111994 SIA NTRS350Minimum Feature Size (nm)(DRAM Half-Pitch)992501997 SIA NTRS1801998 / 1999 ITRS130100ISMT Litho2000 Plan7050Area for FutureAcceleration352595979902050811Industry-Wide Lithography Technology Acceleration

Dimensions in LithographyNanometers10010100010.10.01Feature SizesWavelength435&405365248157193Deep UVExtreme UVX-ray ProximityElectron Beam

Device ScalingOriginal DeviceVoltage, VWIRINGtoxWGATEn sourcen drainLp substrate, doping NAxd

Device ScalingScaled DeviceVoltage, V / αSCALING:WIRINGtox/ααGATEn sourceW/ααn drainL/ααxd/ααp substrate, doping α *NAVoltage:Oxide:Wire width:Gate width:Diffusion:Substrate:NAV/ααtox /ααW/ααL/ααxd /ααα*

Device ScalingSCALING:Scaled DeviceVoltage, V / αWIRINGtox/ααW/αGATE n sourceVoltage:Oxide:Wire width:Gate width:Diffusion:Substrate:V/ααtox /ααW/ααL/ααxd /ααα * NA n drainL/αxd/ααp substrate, doping α*NARESULTS:Higher Density: αα2Higher Speed: ααLower Power/ckt: 1/α/α 2Power Density: Constant

Fundamental atomic limit to scalingrecipe1.2 nm oxynitridepresentfuturesilicon bulk field effect transistor (FET)Oxide thickness is approaching a few atomic layers

Gate Current Density (A/cm2)Limit of Oxide Scaling1E 61E 41E 21E 01E-21E-41E-61E-801234Gate Oxide Thickness (nm)(Gate voltages: 0.9 to 2.0 V)

High Performance CMOS Logic TrendIndustry Logic Performance Trends1000FPG100101199419961998200020022004Year of First Production20062008

Relative CMOS Device PerformanceNew structures are needed to maintain device performance.Relative Device PerformanceDouble Gate FETsLow Temp.SOI FETsBulk FETs19861998199220102004Year of Technology Capability?

Relative Device PerformanceMOSFET Device Structure (R)evolution?YearNew devices/materials support accelerated growth rate

Better Performance Without Scaling

Novel DevicesV-GrooveTransistorsCarbon NanotubesOrganic TransistorsQuantum ComputingMolecularDevices

64-bit S/390 Microprocessor47 Million transistorsCopper interconnect -- 7 layersSize: 17.9 x 9.9 mmSingle scalar, in-order executionSplit L1 cache (256K I & D)BTB 2K x 4, multiportedOn chip compression unit 1 GHz frequency on a 20-way system

Blue Pacific3.9 trillion operations/secCan simulate nuclear devices15,000 X speed of average desktop80,000 X memory of average desktop75 terabytes of disk storage capacity

System Level Performance ImprovementP e r f o r m a n c eOverall System Level Performance Improvement Will ComeFrom Many Small Improvements60 to 90%CAGROverall performanceApplication tuningMiddleware tuningOS: tuning/scalabilityCompilersMulti-way systemsMotherboard design: electrical, debugMemory subsystem: latency/bandwidthPackaging: more pins, better electrical/coolingTools / environment / designer productivityArchitecture/Microarchitecture/Logic designCircuit designNew device structuresOther process technologyTraditional CMOS scaling20% CAGRYear 2000

Moore's Law continues beyond conventional scalingPower becomes the limiting metricThe integration focus moves from circuit to processor

Microprocessor Size Trends256xMagnitude64x# TransistorsMoore's LawIndustry2xrs.y.5/116x4x2xrs.y9/ 1.s.2x / 6 yrChip Area1x flat½x / 6 yrs.1/4x½x / 82000

Microprocessor Performance y2xrs.y.5/1rs.y22x /16xPower.3 yrs2x /s.2x / 6 yr4xPower y19982000

Microprocessor Scaling Trends486DXDeviceScalingMoore'sLawPentium 404/10/892001200104/23/01Technology (um) (V) (MHz)2510064001700SpecInt950.52.012871# Transistors (M)1.21.230742Chip Size (sq. wer (W)Power Density (W/

Power Density: The Fundamental Problem1000W/cm 2Nuclear Reactor100101PentiumPentiumIII IIHot PlatePentium Pro Pentium i386i4861.5µµ1µµ0.7µµ 0.5µµ 0.35µµ 0.25µµ 0.18µµ 0.13µµ 0.07µµ 0.1µµSource: Fred Pollack, Intel. New Microprocessor Challengesin the Coming Generations of CMOS Technologies, Micro32

PowerIT electrical power needs are projected to reach crisisproportionsServer farm energy consumption is increasing exponentially.more Watts/sq. ft than semiconductor or automobile needs constitute 60% of costInteresting anecdotesThe "2,400 megawatt problem":27 farms proposed for South King County will require as much energyas Seattle (including Boeing)Exodus considering building power plant near its Santa Clara facilitySan Jose City Council approved 250 MW power plant for US DataPortserver farmand installation of 80 back-up diesel generators

Server Farm Heat Density TrendHighest Communication: 28% AGRLowest Tape storage: 7%* Slower growth after 2005 due to improvement in semiconductor power consumptionReprinted with permission of The Uptime Institute from a White Paper titled Heat Density Trends in Data Processing, Computer Systems, andTelecommunications Equipment Version 1.0.

Energy Dissipated per Logic Operation1E 10Energy (pJ)1E 61E 21E-21E-6kT (room temp.)1E-10194019601980YEAR20002020

Device ScalingSCALING:Scaled DeviceVoltage, V / αWIRINGtox/ααGATEn sourceW/ααVoltage:Oxide:Wire width:Gate width:Diffusion:Substrate:V/ααtox /ααW/ααL/ααxd /ααα * NAn drainRESULTS:L/ααxd/ααp substrate, doping α *NAHigher Density:Higher Speed:Lower Power/ckt:Power Density: αα2 αα 1/α/α 2 Constant

MOSFET Device Parameter Trends1000Tox (CC)10010classic scaling10.10.01Vdd (V)Vt (V)0.1Gate Length, Lgate (um)1

Source/Drain Current (A/cm)Low Temperature CMOS1.0E 21.0E 0L 25 nmVds 1V1.0E-21.0E-41.0E-6T 100KT 200KT 300K1.0E-81.0E-10-0.500.51Gate Voltage (V)Subthreshold slope steepens as temperature is reduced

CMOS Performance Parameter Trends100001000Cgate (fF/um)Inverter Delay (ps)NFET Id-sat (A/m)Power Density (W/cm2)CV/I Delay (a. u.)1001010.10.010.1LGATE (lm)1

Relative Power Density in Scaled CMOS4(48)Relative Power Density1.2V(25)1.5VHigh Performance3(12.8)1.8V(6.3)2.5V(RELATIVE DENSITY)2(2.5)3.3V(1.0)0.8V10.051.0V0.1Low Power1.2V1.5V0.2Channel Length (µµ m)After B. Davari, et al., IEEE Proc. Vol. 83, p. 595, 1995.5.0V2.5V0.51.0

CMOS Power Density Trends1000?100Active PowerDensityPower (W/cm2)1010.10.010.001Subthreshold Power Density0.00011E-50.010.1Gate Length (um)1

Microprocessor Power Draw vs. Frequency4035Power (Watts)302520151050200300400500Operating Frequency (MHz)600700

Moore's Law continues beyond conventional scalingPower becomes the limiting metricThe integration focus moves from circuit to processor

We've been here before!Heat Flux Explosion2Module Heat Flux(watts/cm )14IBM ES900012Bipolar10CMOS8Fujitsu VP2000IBM 3090SNTT6IBM GPIBM RY5Fujitsu M-78042Vacuum IBM 360019501960IBM 3090CDC Cyber 205IBM 4381IBM 3081Fujitsu M380IBM 370IBM 3033Steam Iron(5W/cm2)IBM RY7PulsarIBM RY6IBM RY4ApacheMercedPentium II(DSIP)197019801990Year of Announcement20002010

S/390 Mainframe CPU PerformanceRelative Performance1000BipolarCMOSS/390 G7S/390 G6S/390 G51009021-711S/390 G4S/390 G310168119701975198019851990Year199520002005

S/390: Comparison of Bipolar and CMOSES9000 9X2S/390 G5TechnologyBipolarCMOSTotal Chips500029(12 CPUs)Total Parts665992Weight (lbs)31.1 K2.0 KPower Req (KW)1535Chips/processor3901102467252Maximum Memory (GB)Space (sq ft)

S/390 Mainframe CPU PerformanceRelative Performance1000BipolarCMOSS/390 G7S/390 G6S/390 G51009021-711S/390 G4S/390 G310168119701975198019851990Year199520002005

Focus on massively parallel systemsUse slower processors with much greater power efficiencyScale to desired performance with parallel systemsWorkload scaling efficiency must sustain power efficiencyPhysical distance must be small to keep communicationpower manageable.Example: Processor A is slower than Bby a factor S but more power efficient by E.Then MP System A at the same performanceas MP System B has lower power by E/S.

Microprocessor Efficiencies0.1 MIPS/mW30001 MIPS/mWPerformance (DMIPS)4000 200010000100100010000Active power (mW)100000

Parallel Performance Scaling Model100Ideal scalingRelative performance806040Pmax20Nmax0Number of processorsReal scaling

Power/Bandwidth by Interconnect Lengthroom-roomPower/Bandwidth chip10.0010.010.11Interconnect Length (meters)10100

Supercomputer Peak Performance1E 16PetaflopBlueGenePeak Speed (flops)1E 141E 121E 10Doubling time 1.5 yr.1E 8CDC STAR-100 (vectors)ASCI WhiteASCI RedBlue PacificASCI RedCP-PACSNWTCM-5ParagonDeltai860 (MPPs)CRAY-2Y-MP8X-MP4Cyber 205X-MP2 (parallel vectors)CRAY-1CDC 7600CDC 6600 (ICs)1E 6IBM Stretch1E 41E 21940IBM 701IBM 7090 (transistors)IBM 704UNIVACENIAC (vacuum tubes)1950196019701980Year Introduced199020002010

ASCI White

Cellular Architecturecomputational efficiency 0.2 GFLOP/W

Example of a Cellular NodeIBM PPC440 system-on-chip440 PowerPC 1 Watt32 kB I-Cache32 kB D-Cache10/100MbEthernetIntegrated memory ppedControlor CacheLink DMADDR/DDR2 BufferscontrollerLinkDMAand GlobalTree&buffersFP2.8GFPLBInfc1Gb Ethernetor Infiniband24EthernetFor bootEthernetfor I/O144Six 2Gb/sec DDR SDRAM256-512MBserial linksGlobalFunctions

Cellular Communication Networks65536 nodes interconnected with three integratednetworksEthernetIncorporated into every node ASICDisk I/OHost control, booting and diagnostics3 Dimensional TorusVirtual cut-through hardware routing to maximize efficiency2.8 Gb/s on each of 12 node links (total 4.2 GB/s per node)Communication backbone134 TB/s total torus interconnect bandwidth1.4/2.8 TB/s bisectional bandwidthGlobal TreeOne-to-all or all-all broadcast functionalityArithmetic operations implemented in tree 1.4 GB/s of bandwidth from any node to all other nodesLatency of tree less than 1usec 90TB/s total binary tree bandwidth (64k machine)

Node Card and I/O Card DesignCompute cards8 processors, 2 x 2 x 2 (x,y,z)256 MB RAM each processorRedundant power suppliesFast EthernetI/O cards4 processors (no torus)512MB-1GB each processorRedundant Power SuppliesFast and 1Gb EthernetGb EthernetI/O Node100Mb EthernetSwitchCompute Nodes

Rack Design1024 compute nodes256 GB DRAM2.8TF peakDRAM DRAM DRAMBL ASIC2 coresDRAM DRAM DRAMDRAM DRAM DRAMOne compute nodeDRAM DRAM DRAM16 I/O nodes8 GB DRAM16 Gb EthernetDRAM DRAM DRAMBL ASIC2 coresDRAM DRAM DRAMDRAM DRAM DRAMDRAM DRAM DRAMDRAM DRAM DRAMOne I/O node 15 KW, air cooled1 1 or 2 1 redundant power2 1 redundant fans

Building a Cellular SystemSystem(64 cabinets, 32x32x64)Cabinet(128 boards, 8x8x16)Board(8 chips, 2x2x2)Chip(2 processors)360 TF/s16 TB440 coreEDRAM440 coreI/O5.6 GF/s4 MB44.8 GF/s2.08 GB5.7 TF/s266 GB

Moore's Law continues beyond conventional scalingTechnology innovation will overcome limitsPower becomes the limiting metricTechnology trend is to higher power densityThe integration focus moves from circuit to processorRadical power reduction depends on efficient processorsMassively parallel systems have great potential

( Hopefully Not ) The End!

IBM 360 IBM 370IBM 3033 IBM ES9000 Fujitsu VP2000 IBM 3090S NTT Fujitsu M-780 IBM 3090 CDC Cyber 205 IBM 4381 IBM 3081 Fujitsu M380 IBM RY5 IBM GP IBM RY6 Apache Pulsar Merced IBM RY7