Handling Large Data Sets for High-Performance EmbeddedApplications in Heterogeneous Systems-on-ChipPaolo MantovaniEmilio G. CotaChristian PilatoGiuseppe Di GuglielmoLuca P. CarloniDepartment of Computer ScienceColumbia University - New York, NY, duABSTRACTLocal memory is a key factor for the performance of acceleratorsin SoCs. Despite technology scaling, the gap between on-chip storage and memory footprint of embedded applications keeps widening. We present a solution to preserve the speedup of acceleratorswhen scaling from small to large data sets. Combining specialized DMA and address translation with a software layer in Linux,our design is transparent to user applications and broadly applicable to any class of SoCs hosting high-throughput accelerators.We demonstrate the robustness of our design across many heterogeneous workload scenarios and memory allocation policies withFPGA-based SoC prototypes featuring twelve concurrent accelerators accessing up to 768MB out of 1GB-addressable DRAM.2048MBAggregate SoC CacheMain 20KB32KBiPhone3GiPhone576KBiP3GS hone hone6iPhone6SFigure 1: The growing gap between the aggregate size of theSoC on-chip caches and the main-memory size, across sevenyears of Apple iPhone products.1. INTRODUCTIONThe end of Dennard’s constant-field scaling has led designerstowards heterogeneous system-on-chip (SoC) architectures that exploit the large number of available transistors to incorporate a variety of customized hardware accelerators along with the processor cores [3]. To achieve energy-efficient high performance in embedded applications, both academia and industry have developedmany different classes of accelerators and accelerator-rich architectures [5, 7, 8, 13, 18, 24, 28, 31]. A recent analysis of die photos ofthree generations of Apple SoCs, which empower the iPhone product line, shows that more than half of the chip area is consistentlydedicated to application-specific accelerators [29].These SoCs integrate high-throughput loosely-coupled accelerators [10] to implement complete application kernels, such asvideo encoding [23]. Each of these accelerators leverages adedicated, highly-customized, Private Local Memory (PLM) andfetches data from DRAM through DMA. The PLM is key to achieving high data-processing throughput: by integrating many independent SRAM banks whose ports can sustain multiple memoryoperations per cycle, it enables concurrent accesses from both thehighly-parallelized logic of the accelerator datapath and the DMAinterface to main memory. Recent studies confirm the importanceof the PLM, which occupies 40 to 90% of the accelerator area [10,Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’16, October 01-07 2016, Pittsburgh, PA, USAc 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-4482-1/16/10. . . 15.00DOI:] and contributes, together with the processors’ caches, to thegrowing fraction of chip area dedicated to memory.However, despite this trend, the data-set sizes of embedded applications are increasing much faster than the available on-chipmemory. As an example, Fig. 1 illustrates the growing gap between the aggregate size of the SoC on-chip caches (accounting forL1 and L2 caches, and, starting with iPhone 5s, a 4MB L3 cache)and DRAM size, across seven years of Apple iPhone generations.Relative to the first product generation, the difference between thetwo sizes has grown by a factor of 16 , to reach almost 2GB1 . Thegrowth in DRAM size reflects the need for supporting applicationswith increasingly large footprints, which pose new challenges forhigh-throughput accelerators.A loosely-coupled accelerator has a twofold nature: while it issimilar to an on-board peripheral, in that the processor core canoffload a specific task to it, the accelerator does not have a largeand private storage system (e.g. a dedicated off-chip memory), andtherefore shares the external memory with the processor core. Atthe same time, the accelerator’s computation modules are unawareof the physical memory allocation, which can be even on multiple physically-separated DRAM banks [34]. Thus, an acceleratorcan be likened to a software thread, where physical memory is abstracted by virtual memory and multiple levels of caches. In an accelerator, the PLM gives the illusion of a contiguous address space,allowing the accelerator to perform concurrent random accesses ondata structures. This contiguous address space, however, is limitedby the size of the PLM, and processing large data sets necessarilyinvolves multiple data transfers between DRAM and PLM.1 Admittedly, the cache numbers do not include the aggregate sizes of the PLMs of theApple SoC accelerators, which are likely to be large but are not publicly known. Still,even if we assume that their contributions could double or triple the reported figures,the on-chip memory sizes would remain very small compared to DRAM.

Accelerator Tileuser virtualaddressPrivate Local Memorydebayer()PLM portsDMACmemcpy()readInput1 2in635 4writeComputation 1Computation nphysicaladdressCPUDMA Buffermmap()0x40000000DMA1MB - 8MBAcceleratorwith PLMcircular buffer4KB21outOutputping-pong 12Computation3456out[1] debayer(in[1,5])Outputout[2] debayer(in[2,6])1out[1] debayer(in[3,7]) Computationmalloc()The acceleratorelaborates oneblock of data,then wakes upthe processorThe processorcopies data3GBto/from theDMA buffer2(b)Figure 2: (a) The D EBAYER accelerator structure. (b) Overlapping of computation and communication.We define the Large Data Set (LDS) Problem for SoC Accelerators as the problem of finding a high-performance and lowoverhead mechanism that allows hardware accelerators to processlarge data sets without incurring penalties for data transfers. Apossible solution to the LDS Problem is to have the acceleratorsshare the virtual address space of the processor in a fully coherentway. This is obtained by replacing the PLM with a standard privateL1-cache and sharing the higher levels of the memory hierarchyand the memory-management unit with general-purpose cores [34].This approach, however, is not effective for high-throughput accelerators because it degrades their performance by depriving themfrom their customized PLMs. Moreover, as the data set grows, theoverhead of maintaining coherence further limits the acceleratorspeedup over software [10]. Alternatively, one could expose thePLM to the processor and let it manage data transfers across separate address spaces in suitable small chunks [16]. However, as weshow in Section 2, this increases software complexity and forces accelerators to stall waiting for the software-managed transfers, thuswasting most of the speedup offered by the dedicated hardware.In this paper, we present an effective hardware-software solution to the LDS Problem that avoids most common shortcomingsof accelerators coupled with embedded processors. Our solutionachieves this by combining the following features: a low-overhead accelerator virtual address space, which is distinct from the processor virtual address space (to reduce theprocessor-accelerator interaction); direct sharing of physical memory across processors and accelerators (to avoid redundant copies of data); a dedicated DMA controller with specialized translationlookaside buffer (TLB) per accelerator (to support many heterogeneous accelerators coexisting in the same SoC, each withits specific memory-access pattern); hardware and software support for implementing run-time policies to balance traffic among available DRAM channels.After motivating our work in Section 2, we describe our approachin Section 3 and its system-level integration in Section 4. Section 5shows a full-system evaluation on FPGA. Our experiments demonstrate that we are able to preserve the accelerators’ speedup as thenumber of concurrent workloads and the size of the data sets scale.2. PRESERVING THE SPEEDUPA loosely-coupled accelerator executes very efficiently as longas it keeps its computation and communication phases balanced.0x80000000Figure 3: Traditional software-managed DMA.As an example, Fig. 2(a) shows the high-level block diagram of ahigh-throughput accelerator for the D EBAYER kernel [2]. This kernel takes as input a Bayer-array image with one color sample perpixel and returns an image with three-color samples (red, green andblue) per pixel, where the missing colors are estimated via interpolation. The accelerator consists of a load module (to fetch data fromDRAM), one or more computation modules, and a store module(to send results to DRAM). These modules communicate through aPLM, which is composed of multiple banks and ports. Such PLMarchitecture allows the computational modules to process multiplepixels per clock cycle. Additionally, circular and ping-pong databuffers support the pipelining of computation and DMA transferswith the off-chip DRAM. This choice derives directly from thefunctional specification of the kernel: the D EBAYER interpolatespixels row-by-row and uses 5 5-interpolation masks centered inthe pixel of interest. To start the computation, the accelerator needsat least the first five rows of the input image in the circular buffer(input bursts from 1 to 5 in Fig. 2(b)). Then, while the computation modules work, the input module can prefetch more rows forfuture processing (input burst 6 in Fig. 2(b)). As soon as a computation step completes, an interpolated row is stored in the first halfof the ping-pong buffer so that it can be transferred back to DRAM(output burst 1). Meanwhile, the computation modules can startprocessing the additional row in the circular buffer and storing theresult in the second half of the ping-pong buffer (output burst 2).This behavior represents well many high-throughput accelerators. However, the specifics of the micro-architecture of any givenaccelerator, including the PLM organization, may vary considerably depending on the particular computation kernel. The timingdiagram in Fig. 2(b) shows a hypothetical scenario, where the communication (i.e. input and output) and computation phases are overlapping, and the latency of DMA transfers is hidden by the localbuffers. Intuitively, if such latency becomes larger than processingtime, then the accelerator must be stalled until new data are available for computation. This can limit the efficiency of the accelerator, reducing its advantages over software execution. Next, wepresent an experiment demonstrating that, when loosely-coupledaccelerators process large data sets, traditional memory handlingfor non-coherent devices leads to such undesirable scenario.By using an FPGA board, we realized a simple SoC that integrates one embedded processor with a 32-bit architecture, whichruns the Linux Operating System (OS), and two loosely-coupledaccelerators. These two accelerators implement the D EBAYER andS ORT computational kernels [2]. The virtual memory available touser-level applications is 3GB, while the actual physical memory

user ress(using Linuxbig-physicalarea patch)CPUsort-SW-2MB0x40000000DMAAcceleratorwith PLMup to tensof HW3GBdebayer-HW02x1094x1096x1098x109time (ns)1x1010 1.2x1010 1.4x1010Figure 4: Software-managed DMA versus hardware-onlyDMA execution time breakdown. Orange segments correspondto accelerator DMA and computation, while purple segmentsrepresent software-handled data 1GB. For this experiment we considered a memory footprintof 32MB for D EBAYER, which elaborates one 2048 2048-pixelBayer-array and the corresponding bitmap image (16-bit colors),and of 4MB for S ORT, which processes in place 1024 vectors eachcontaining 1024 single-precision floating point numbers. The twoaccelerated applications share the processor in time multiplexingaccording to the Linux scheduler. Each of them can invoke the appropriate accelerator through the API provided by a device driver.Fig. 3 shows the memory layout of one application: the physical memory is usually allocated in 4KB pages and remapped to acontiguous virtual memory area where the program stores the application’s data. To allow a non-coherent device to access these data,the driver typically implements a memory-mapping function thatserves three main tasks. First, it requests the OS to reserve a contiguous area in physical memory for DMA and pins the corresponding pages (orange-shaded memory area in Fig. 3). Then, it passesthe physical address to the device, referred to as dma handle ofthe allocated buffer. Finally, it remaps the DMA buffer to the virtualmemory (purple-shaded area) and returns a pointer to the user-levelapplication. The exact amount of contiguous memory that the OScan allocate depends on the target processor architecture, but it isusually limited to a few megabytes. If we set the size of the DMAbuffer to 1MB, the D EBAYER computation can be easily split into32 parts, each processing a different portion of the input image.Similarly the input vectors for S ORT can be divided into 4 sets of256 vectors each. The bars of Fig. 4 show how hardware acceleration (orange) and software execution (purple) interleave over time.The orange segments include the time for fetching the input datafrom DRAM via DMA, elaborating them, and transferring resultsback to DRAM also via DMA. The purple segments, instead, correspond to the time spent by the user application in saving resultsfrom the DMA buffer to another virtual memory area and copyingthe next block of input data into the DMA buffer. Note that the firstand the last segments of each bar are always orange since we arenot considering the application setup and wrap-up phases, whichare constant across all scenarios. We repeated the experiment fourtimes, varying the size of the DMA buffer from 1MB up to 8MB.As the size of the DMA buffer increases, the data processed by theaccelerators are split into fewer blocks and the overhead of interleaving hardware and software decreases. Further, the executionof S ORT benefits from having a DMA buffer large enough for itsmmap()1GB(DRAM)ComputationThe acceleratorelaborates ALLDATA, thenwakes up theprocessor0x80000000Figure 5: Hardware-only DMA using Linux big-physical areapatch to reserve up to tens of MB of contiguous memory.memory footprint: the accelerator is able to complete the entire taskwithout the intervention of the processor, thus obtaining a speedupof 21 over the test case with a 2MB DMA buffer. For D EBAYER,however, the software-based data management is always responsible for the largest part of the execution time, because its memoryfootprint never fits into the DMA buffer. Additionally, the execution of multiple accelerators creates contention on the processor, which must handle multiple concurrent transfers between eachDMA buffer and the virtual memory of the corresponding application. This is shown by the purple bars that become longer when thetwo accelerators execute at the same time.Following the intuition that avoiding the intervention of the processor core in DMA transfers (except from the initial setup) benefits the accelerated application, we executed again the experimentleveraging a Linux patch known as big-physical area. When enabled, this patch forces the Linux OS to reserve a region of contiguous memory configurable in size up to a few tens of megabytes.Fig. 5 shows the updated memory layout made possible by thepatch: the entire application data set for both S ORT and D EBAYERcan be mapped to contiguous physical memory. Hence, the accelerator needs only the base address of the buffer to process all data,while the processor can remain idle or perform other tasks. Theresult is reported in the last two bars at the bottom of Fig. 4: theaccelerator for D EBAYER achieves a speedup of 8 with respect tothe scenario with an 8MB DMA buffer.This experiment proves the benefits of reducing the processorintervention when loosely-coupled accelerators move data withDMA transactions. The big-physical area patch, however, is onlyviable for applications with medium-sized memory footprints. Asthe number of accelerators and the size of data sets grow, it is necessary to adopt a more scalable and flexible approach.3. HANDLING LARGE DATA SETSIn the context of general purpose processors, cache hierarchy andvirtual memory are typically used to give user applications the illusion of accessing the entire address space with low-latency. As thenumber of accelerators integrated in SoCs keeps growing, designers need a similar efficient solution dedicated to special-purposehardware components. In this section we describe a combinationof hardware and software techniques that gives accelerators the illusion of accessing contiguous physical memory. Each acceleratorcan therefore issue memory references using an accelerator-virtualaddress (AVA), equivalent to a simple offset with respect to its data

/* Data structure for contig alloc */struct contig alloc req {unsigned long size;/* aggregate size requiredunsigned long block size;/* size of one blockstruct contig alloc params params; /* DRAM allocation policyunsigned int n blocks;/* number of contiguous blockscontig khandle t khandle;/* handle for the device driverunsigned long user *arr; /* blocks physical addresses (PT)void user **mm;/* user-space mapping of the blocks};user ather DMA and accelerators. For off-chip peripherals and non-coherent devices, the standard Linux API provides routines to create a list of pages reserved for any virtual buffer. Thislist, called scatterlist, represents the page table (PT) for the buffer.This name refers to scatter-gather DMA, which is a common technique mostly applied to move data between main memory and thededicated DRAM of on-board peripherals. To reduce the size ofthe PT, Linux tends to reserve blocks of contiguous pages wheneverpossible so that it is sufficient to store the base address and length ofeach block. A typical transaction to an external peripheral impliestransferring all data stored in the area pointed by the PT. Hence,the scatter-gather DMA controller must simply walk the PT andgather data from all memory areas in order. Conversely, on-chipaccelerators must deal with PLMs having limited size. Therefore,they have to issue several random accesses to memory, followinga pattern that is highly-dependent on the implemented algorithm.Since the blocks may have different sizes, the access to a scatteredmemory buffer with a random offset requires the addition of everyblock length until the requested data is effectively reached. Moreover, long DMA transfers may easily span across multiple blocks,incurring further overhead to complete the transaction.Alternatively, Linux can guarantee a set of equally-sized blocks,each consisting of one page (typically 4KB). However, if we consider a data set of 300MB, we need 76,800 entries in the PT, equivalent to 300KB on a 32-bit address space. A traditional TLB, holding only a few of these PT entries, would incur high miss rates. Infact, high-throughput accelerators do not typically reuse the samedata multiple times and very little spatial locality can be exploited.To overcome this issue, we implemented a kernel module, namedcontig alloc, and a companion user-space library to replace thestandard malloc interface. Fig 6 shows the request data structurefor our module. A request to contig alloc includes the size ofthe requested memory area (size), the desired size of each contiguous physical block (block size) into which the memory region willbe divided, and some allocation policy parameters. These parameters are intended for load balancing in case of multiple DRAMbanks. If we specify only the parameter size, as typically done formalloc, then default values are used for the other parameters.The kernel module generates a DMA handle for the accelerator’sdriver, the resulting number of equally-sized blocks (also calledaccelerator pages), the corresponding PT, and the virtual-memorymapping for the user-space application. Fig. 7 shows the memory layout after calling contig alloc. Note that only the callingprocess is allowed to access the allocated memory region and theuser-level application can still operate transparently on the data inits virtual address space (purple-shaded area). However, differentlyfrom standard allocation mechanisms, the corresponding th PLMAcceleratorPageFigure 6: Data structure to request an accelerator buffer.structures, without requiring any information about the underlyingsystem memory hierarchy. Combined with a lightweight dedicatedDMA controller, this makes all transactions occur across the entiredata set without intervention of the processor, thus allowing the accelerators to preserve the speedup they were initially designed ontig alloc()ComputationThe acceleratorelaborates ALLDATA, thenwakes up theprocessor0x80000000Figure 7: Memory layout after calling contig alloc to enablelow-overhead scatter-gather DMA for accelerators.pages have larger size (orange-shaded regions in physical memory).For very large data-sets, we can set a medium size for the accelerator pages (e.g. 1MB) so that the resulting PT has a size on the orderof a few KBs and can be thus stored contiguously in memory, asshown in Fig. 7.This approach enables a low-overhead version of scatter-gatherDMA specialized for loosely-coupled accelerators, while maintaining shared memory across processors and accelerators2 , withoutrequiring coherence with the PLMs.TLB and DMA controller for accelerators. Once the data areready in memory, laid out as shown in Fig. 7, the application canrun the accelerator by invoking the device driver through the traditional ioctl system call. The driver takes the configuration parameters from the user application and passes them to the accelerator through memory-mapped registers. Such parameters includeapplication-specific variables to be used directly by the acceleratorkernel (e.g. the size of the image for the D EBAYER application),and the information for the DMA controller (e.g. the memory address where the PT is stored). To guarantee memory consistencywithout coherence, the device driver performs a simple flush of thecache lines holding data from the shared buffers right before sending the start command to the accelerator. This operation is completely transparent to the user-level applications and incurs negligible performance overhead.At this stage, DMA and computation are entirely managed bythe accelerator. The accelerator requests are composed of a set ofcontrol signals to: distinguish memory-to-device from device-tomemory transfers, set an offset with respect to the data structureto process (corresponding to the AVA), and determine the transaction length. To serve such requests we equip every acceleratorwith a DMA controller (DMAC) and a parametrized TLB. Thesecomponents autonomously fetch the PT through a single memoryto-device transaction, and store it inside the TLB. Once the TLBis initialized, every accelerator request is translated in only fourcycles. When operating on large data sets with very long DMAtransfers, this address translation overhead is negligible. The TLBis configured to match the requirements of a given accelerator interms of number of supported physical memory pages and theirsize. In fact, thanks to contig alloc, the number of pages is keptunder control and set according to the size of the required mem2 Differently from Linux huge pages, our module supports dynamic allocation ofblocks; the latter can have variable sizes, trading off PT size for memory fragmentation; and it is supported across all architectures.

startDMA transactionsDMAC TLBAckDataPTstatusR0R1R2PT0PT1AcceleratorPage 0idlego and!tlb emptyPTNAVAnew PT(tlb empty 1)1. pend dma and tlb readywr handshaketlb readyFigure 8: DMA controller interface.rstend rcv(dma done 1)dma done &&length 0(pend dma 0)waitidle3. acc donesend dataAcceleratorPage Nresetrunningend send(dma done 1)Acceleratorwith PLM!tlb empty2. rst5. wr requestAcceleratorPage 1tlb init(tlb empty 1)configLengthconfigstartend rcv and tlb empty(tlb empty 0)DRAMrd/wr request(pend dma 1)AVA2PA4. rd requestrd handshakedma done &&length ! 0(tlb ready 1)wait addrtlb readyrcv dataaddr sentDEV TO MEM(tlb ready 0)MEM TO DEVsend addresswait data(a)(b)Figure 9: DMA Controller (a) and TLB (b) finite state machines.ory area. Therefore it is possible to have the number of PT entriesmatch the TLB size. This not only simplifies the design, but italso minimizes the performance degradation due to scatter-gatherDMA. Indeed, filling in the TLB can be done with one single transfer before activating the computational blocks. Results reported inSection 5 confirm that preparing and using the TLB has a negligible impact on the overall execution time of the accelerators. Acrossthe analyzed workloads, we set the accelerator page size to 1MB,which is a reasonable trade-off between the complexity of the memory allocation performed by the operating system and the PT size,resulting in few hundreds entries. Should an application requiremore entries, in order to handle even larger data sets, the TLB canbe parametrized to hold more pointers in exchange for silicon area.Note that the relative performance overhead would not increase, because transactions and computation would also scale with the dataset.Fig. 8 shows the organization of the accelerator, the DMAC,the dedicated TLB, and the bank of configuration registers. Notethat one of the registers stores the DMA handle generated by contig alloc. This corresponds to the PT base address and is usedto initialize the TLB. The DMAC and TLB behaviors are describedby the finite state machine in Fig. 9(a) and Fig. 9(b), respectively. As soon as the PT register is written by the device driver,the DMAC engine initiates an autonomous transaction to retrievethe PT, as shown by the transition from idle to send address inFig. 9(a). The request includes the PT address and the number ofentries to fetch. Then, following the control flow of read requests(i.e. MEM TO DEV path), the DMA waits for the response ofthe memory controller before transferring the received pointers tothe physical blocks into the TLB. The operation terminates whenall entries are received: this corresponds to the transaction fromrcv data to idle in Fig. 9(a), where the signal tlb empty is deasserted. This also corresponds to the transition from tlb init toidle in Fig. 9(b).After this TLB initialization, the DMAC steps through the statesconfig and running, and starts its execution. Whenever the accelerator needs to perform a read or write request to DRAM, it sendsa request to the DMAC through its DMA interface, as shown inFig. 8. Specifically, the AVA and the length of the data transfer aresent to the TLB, which initiates the address translation, while theDMAC starts a handshake protocol with the DMA interface of theaccelerator. The TLB determines whether the transaction needs toaccess one or multiple pages in memory, and computes the lengthof the transfer for the first accelerator page. In four cycles the TLBis ready to provide the physical address and the DMAC initiatesa transaction over the interconnection system, following either theMEM TO DEV or the DEV TO MEM paths, for read or write operations, respectively. In the case of read requests, the acknowledge signal (Ack) shown in Fig. 8 is set when valid data are available. Conversely, in the case of write requests, the signal Ack isset when an output value (Data) has been sent to the DMAC. Thissimple latency-insensitive protocol [4] ensures functional correctness, while coping with congestion and DRAM latency. After therequest has been sent to the interconnect, the TLB controller stepsto a second waiting state (i.e. wait data in Fig. 9(b)). In this state,if the current transfer length does not match the actual length requested by the accelerator, the controller reads the physical addressof the next page in the TLB and initiates another transaction skipping the handshake with the accelerator. When the DMAC returnsto the state running, it checks first for pending transactions, thenit reads the command register to check for a reset from software,and finally waits for the accelerator to raise another request or forcompletion (i.e. signal acc done). Note that, even considering theDMAC initialization, the delay introduced by each accelerator request is negligible when compared to the lengths of typical burstdata transfers (thousands of words).Main memory load balancing. As the number of acceleratorsgrows, the system interconnect and the I/O channels to the externalmemory are responsible for sustaining the increasing traffic generated by many long DMA transactions. A system interconnect basedon a network-on-chip (NoC) offers larger throughput and has betterscalability than traditional bus-based interconnects [12]. However,since all accelerators need access to external memory, it is necessary to improve the traffic on the NoC to mini

iPhone 3GiPhone 3GSiPhone 4 iPhone 4S iPhone 5 iPhone 5S iPhone 6 iPhone 6S Aggregate SoC Cache Main Memory 32KB 320KB 576KB 1064KB 1064KB 5128KB 5128KB 5128KB 128MB 256MB 256MB 512MB 512MB . shows the high-level block diagram of a high-throughput accelerator for the DEBAYER kernel [2