ResearchHub | Open Science Community

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman Jouppi et al.Jun 15, 2017

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Electrical And Electronic Engineering

Computer Networks And Communications

0

Paper

Electrical And Electronic Engineering

3,590

0

Save

0

McPAT

Sheng Li et al.Dec 12, 2009

This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

Electrical And Electronic Engineering

Computer Networks And Communications

0

Paper

Electrical And Electronic Engineering

2,340

0

Save

0

Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers

Norman JouppiAug 1, 1998

Article Free Access Share on Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers Author: Norman P. Jouppi Digital Equipment Corporation Western Research Lab, 100 Hamilton Ave., Palo Alto, CA Digital Equipment Corporation Western Research Lab, 100 Hamilton Ave., Palo Alto, CAView Profile Authors Info & Claims ISCA '98: 25 years of the international symposia on Computer architecture (selected papers)August 1998 Pages 388–397https://doi.org/10.1145/285930.285998Online:01 August 1998Publication History 16citation1,530DownloadsMetricsTotal Citations16Total Downloads1,530Last 12 Months26Last 6 weeks1 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF

Computer Networks And Communications

Hardware And Architecture

0

Paper

Computer Networks And Communications

1,336

0

Save

0

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman Jouppi et al.Jun 24, 2017

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Electrical And Electronic Engineering

Computer Networks And Communications

0

Paper

Electrical And Electronic Engineering

1,250

0

Save

0

NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory

Xiangyu Dong et al.Jun 14, 2012

Various new nonvolatile memory (NVM) technologies have emerged recently. Among all the investigated new NVM candidate technologies, spin-torque-transfer memory (STT-RAM, or MRAM), phase-change random-access memory (PCRAM), and resistive random-access memory (ReRAM) are regarded as the most promising candidates. As the ultimate goal of this NVM research is to deploy them into multiple levels in the memory hierarchy, it is necessary to explore the wide NVM design space and find the proper implementation at different memory hierarchy levels from highly latency-optimized caches to highly density- optimized secondary storage. While abundant tools are available as SRAM/DRAM design assistants, similar tools for NVM designs are currently missing. Thus, in this paper, we develop NVSim, a circuit-level model for NVM performance, energy, and area estimation, which supports various NVM technologies, including STT-RAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is successfully validated against industrial NVM prototypes, and it is expected to help boost architecture-level NVM-related studies.

Materials Chemistry

Electrical And Electronic Engineering

0

Paper

Save

CACTI: an enhanced cache access and cycle time model

Steven Wilton et al.May 1, 1996

This paper describes an analytical model for the access and cycle times of on-chip direct-mapped and set-associative caches. The inputs to the model are the cache size, block size, and associativity, as well as array organization and process parameters. The model gives estimates that are within 6% of Hspice results for the circuits we have chosen. This model extends previous models and fixes many of their major shortcomings. New features include models for the tag array, comparator, and multiplexor drivers, nonstep stage input slopes, rectangular stacking of memory subarrays, a transistor-level decoder model, column-multiplexed bitlines controlled by an additional array organizational parameter, load-dependent size transistors for wordline drivers, and output of cycle times as well as access times. Software implementing the model is available via ftp.

Software

Electrical And Electronic Engineering

0

Paper

Save

Complexity-effective superscalar processors

Subbarao Palacharla et al.May 1, 1997

The performance tradeoff between hardware complexity and clock speed is studied.First, a generic superscalar pipeline is defined.Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed.Each is modeled and Spice simulated for feature sizes of O&m, 0.35,um, and 0.18~7% Performance results and trends are expressed in terms of issue width and window size.Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future.A microarchitecture that simplifies wakeup and selection logic is proposed and discussed.This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel.Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles.Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster-consequently overall performance is improved.By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.

Electrical And Electronic Engineering

Hardware And Architecture

0

Paper

Electrical And Electronic Engineering

817

0

Save

0

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

Rakesh Kumar et al.Dec 3, 2003

This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application's execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements. Our evaluation of this architecture shows significant energy benefits. For an objective function that optimizes for energy efficiency with a tight performance threshold, for 14 SPEC benchmarks, our results indicate a 39% average energy reduction while only sacrificing 3% in performance. An objective function that optimizes for energy-delay with looser performance bounds achieves, on average, nearly a factor of three improvements in energy-delay product while sacrificing only 22% in performance. Energy savings are substantially more than chip-wide voltage/frequency scaling.

Software

Electrical And Electronic Engineering

0

Paper

Save

Corona

Dana Vantrease et al.Jun 1, 2008

We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on-stack bandwidth requirements at acceptable power levels. Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power.

Electrical And Electronic Engineering

Computer Science

0

Paper

Electrical And Electronic Engineering

540

0

Save

0

Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Rakesh Kumar et al.Mar 2, 2004

A single-ISA heterogeneous multi-core architecture is achip multiprocessor composed of cores of varying size, performance,and complexity. This paper demonstrates that thisarchitecture can provide significantly higher performance inthe same area than a conventional chip multiprocessor. It doesso by matching the various jobs of a diverse workload to thevarious cores. This type of architecture covers a spectrum ofworkloads particularly well, providing high single-thread performancewhen thread parallelism is low, and high throughputwhen thread parallelism is high.This paper examines two such architectures in detail,demonstrating dynamic core assignment policies that providesignificant performance gains over naive assignment, andeven outperform the best static assignment. It examines policiesfor heterogeneous architectures both with and withoutmultithreading cores. One heterogeneous architecture we examineoutperforms the comparable-area homogeneous architectureby up to 63%, and our best core assignment strategyachieves up to 31% speedup over a naive policy.

Architecture

Information Systems

0

Paper

Architecture

512

0

Save

In-Datacenter Performance Analysis of a Tensor Processing Unit

McPAT

Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers

In-Datacenter Performance Analysis of a Tensor Processing Unit

NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory

CACTI: an enhanced cache access and cycle time model

Complexity-effective superscalar processors

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

Corona

Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension